In [8]:
!mysql -e "DROP DATABASE IF EXISTS cyclistic;"
!mysql -e "CREATE DATABASE cyclistic;"

First, we create our database called cyclistic

In [9]:
!mysql -e "SHOW DATABASES;"

+--------------------+
| Database           |
+--------------------+
| burke23201534      |
| cyclistic          |
| information_schema |
| mydb               |
| mysql              |
| newdb              |
| performance_schema |
| sys                |
+--------------------+


Lets take a look at our databases to ensure what we need is there

In [10]:
%%bash
# 1. Download the file using curl instead of wget
FILE_NAME="Divvy_Trips_2020_Q1.zip"
DOWNLOAD_URL="https://divvy-tripdata.s3.amazonaws.com/$FILE_NAME"

echo "Downloading $FILE_NAME..."
# -L follows redirects, -o specifies the output filename
curl -L $DOWNLOAD_URL -o $FILE_NAME

# 2. Unzip the CSV
echo "Unzipping data..."
unzip -o $FILE_NAME

# 3. Clean up the zip file to save space
rm $FILE_NAME

Downloading Divvy_Trips_2020_Q1.zip...


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.1M  100 15.1M    0     0  2124k      0  0:00:07  0:00:07 --:--:-- 2290k:09 1549k


Unzipping data...
Archive:  Divvy_Trips_2020_Q1.zip
  inflating: Divvy_Trips_2020_Q1.csv  
  inflating: __MACOSX/._Divvy_Trips_2020_Q1.csv  


Here we download part of our data from a url

In [11]:
%%bash
mysql -D "cyclistic" -e "
CREATE TABLE trips (
    ride_id VARCHAR(255) PRIMARY KEY,
    rideable_type VARCHAR(50),
    started_at DATETIME,
    ended_at DATETIME,
    start_station_name VARCHAR(255),
    start_station_id VARCHAR(255),
    end_station_name VARCHAR(255),
    end_station_id VARCHAR(255),
    start_lat DECIMAL(10, 8),
    start_lng DECIMAL(11, 8),
    end_lat DECIMAL(10, 8),
    end_lng DECIMAL(11, 8),
    member_casual VARCHAR(50)
);"
mysql -e "SET GLOBAL local_infile = 1;"

Now we create the table schema, we must make sure it matches the csv exactly. We also tell the MySQL server to globally allow local file uploads.

In [12]:
%%bash
# Check the unzipped file name (it might differ from the zip name)
CSV_FILE=$(ls *.csv | head -n 1)
echo "Loading $CSV_FILE into MySQL..."

time mysql --local-infile=1 -D "cyclistic" -e "
LOAD DATA LOCAL INFILE '$CSV_FILE' 
INTO TABLE trips 
FIELDS TERMINATED BY ',' 
ENCLOSED BY '\"' 
LINES TERMINATED BY '\n' 
IGNORE 1 ROWS;
"

Loading Divvy_Trips_2020_Q1.csv into MySQL...



real	0m3.440s
user	0m0.021s
sys	0m0.142s


This is where we load in our data, and take note of how long it takes

In [31]:
%%bash
time mysql -E -D "cyclistic" -e "
SELECT 
    SUM(CASE WHEN ride_id IS NULL OR ride_id = '' THEN 1 ELSE 0 END) AS missing_ids,
    SUM(CASE WHEN start_station_name IS NULL OR start_station_name = '' THEN 1 ELSE 0 END) AS missing_start_stations,
    SUM(CASE WHEN end_station_name IS NULL OR end_station_name = '' THEN 1 ELSE 0 END) AS missing_end_stations,
    SUM(CASE WHEN end_lat IS NULL OR end_lng IS NULL THEN 1 ELSE 0 END) AS missing_end_coords,
    SUM(CASE WHEN member_casual IS NULL OR member_casual = '' THEN 1 ELSE 0 END) AS missing_user_types
FROM trips;
"

*************************** 1. row ***************************
           missing_ids: 0
missing_start_stations: 94656
  missing_end_stations: 110880
    missing_end_coords: 0
    missing_user_types: 0



real	0m7.183s
user	0m0.011s
sys	0m0.007s


Lets take a look at the data to ensure we have no missing data

In [26]:
%%bash
time mysql -E -D "cyclistic" -e "
SELECT 
    member_casual, 
    rideable_type, 
    COUNT(*) as total_rides
FROM trips
GROUP BY member_casual, rideable_type
ORDER BY total_rides DESC; 
"

*************************** 1. row ***************************
member_casual: member
rideable_type: docked_bike
  total_rides: 1820108
*************************** 2. row ***************************
member_casual: casual
rideable_type: docked_bike
  total_rides: 1146005
*************************** 3. row ***************************
member_casual: member
rideable_type: electric_bike
  total_rides: 295519
*************************** 4. row ***************************
member_casual: casual
rideable_type: electric_bike
  total_rides: 209226
*************************** 5. row ***************************
member_casual: member
rideable_type: classic_bike
  total_rides: 59297
*************************** 6. row ***************************
member_casual: casual
rideable_type: classic_bike
  total_rides: 11319



real	0m7.561s
user	0m0.010s
sys	0m0.008s


Here we can see the breakdown of total rides taken and type of bike used by customers with an anual membersihp (members) vs. customers who have purchased a single-ride/full-day pass (casual). 

In [27]:
%%bash
time mysql -E -D "cyclistic" -e "
SELECT 
    member_casual, 
    ROUND(AVG(TIMESTAMPDIFF(MINUTE, started_at, ended_at)), 2) AS avg_duration_minutes
FROM trips
GROUP BY member_casual;
"

*************************** 1. row ***************************
       member_casual: casual
avg_duration_minutes: 45.36
*************************** 2. row ***************************
       member_casual: member
avg_duration_minutes: 12.31



real	0m7.456s
user	0m0.011s
sys	0m0.009s


This gives us the average duration of rides for casual customers and members.

In [28]:
%%bash
time mysql -E -D "cyclistic" -e "
SELECT 
    start_station_name, 
    end_station_name, 
    COUNT(*) as route_count
FROM trips
WHERE start_station_name != '' AND end_station_name != ''
GROUP BY start_station_name, end_station_name
ORDER BY route_count DESC
LIMIT 5;
"

*************************** 1. row ***************************
start_station_name: Streeter Dr & Grand Ave
  end_station_name: Streeter Dr & Grand Ave
       route_count: 6673
*************************** 2. row ***************************
start_station_name: Lake Shore Dr & Monroe St
  end_station_name: Lake Shore Dr & Monroe St
       route_count: 6397
*************************** 3. row ***************************
start_station_name: Buckingham Fountain
  end_station_name: Buckingham Fountain
       route_count: 4993
*************************** 4. row ***************************
start_station_name: Millennium Park
  end_station_name: Millennium Park
       route_count: 4897
*************************** 5. row ***************************
start_station_name: Indiana Ave & Roosevelt Rd
  end_station_name: Indiana Ave & Roosevelt Rd
       route_count: 4678



real	0m23.245s
user	0m0.011s
sys	0m0.007s


Here we see the most popular routes

In [29]:
%%bash
time mysql -E -D "cyclistic" -e "
SELECT 
    member_casual, 
    DAYNAME(started_at) AS day_of_week, 
    COUNT(*) AS total_rides
FROM trips
GROUP BY member_casual, day_of_week
ORDER BY 
    member_casual, 
    FIELD(day_of_week, 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday');
"

*************************** 1. row ***************************
member_casual: casual
  day_of_week: Monday
  total_rides: 141652
*************************** 2. row ***************************
member_casual: casual
  day_of_week: Tuesday
  total_rides: 137563
*************************** 3. row ***************************
member_casual: casual
  day_of_week: Wednesday
  total_rides: 152508
*************************** 4. row ***************************
member_casual: casual
  day_of_week: Thursday
  total_rides: 162689
*************************** 5. row ***************************
member_casual: casual
  day_of_week: Friday
  total_rides: 202187
*************************** 6. row ***************************
member_casual: casual
  day_of_week: Saturday
  total_rides: 313803
*************************** 7. row ***************************
member_casual: casual
  day_of_week: Sunday
  total_rides: 256148
*************************** 8. row ***************************
member_casual: member
  da


real	0m7.227s
user	0m0.011s
sys	0m0.006s


This cell shows us the amount of trips taken per day of the week, broken up by casual riders vs. members. This tells us day of the week corrolates with usage.

In [30]:
%%bash
time mysql -E -D "cyclistic" -e "
SELECT 
    station_name, 
    SUM(activity_count) AS total_activity
FROM (
    SELECT start_station_name AS station_name, COUNT(*) AS activity_count
    FROM trips
    WHERE start_station_name != '' AND start_station_name IS NOT NULL
    GROUP BY start_station_name
    
    UNION ALL
    
    SELECT end_station_name AS station_name, COUNT(*) AS activity_count
    FROM trips
    WHERE end_station_name != '' AND end_station_name IS NOT NULL
    GROUP BY end_station_name
) AS combined_traffic
GROUP BY station_name
ORDER BY total_activity DESC
LIMIT 5;
"

*************************** 1. row ***************************
  station_name: Streeter Dr & Grand Ave
total_activity: 73132
*************************** 2. row ***************************
  station_name: Clark St & Elm St
total_activity: 63988
*************************** 3. row ***************************
  station_name: Theater on the Lake
total_activity: 60666
*************************** 4. row ***************************
  station_name: Lake Shore Dr & Monroe St
total_activity: 57660
*************************** 5. row ***************************
  station_name: Lake Shore Dr & North Blvd
total_activity: 53795



real	0m14.899s
user	0m0.011s
sys	0m0.008s


This cell shows us which stations have most traffic (which stations have the most people starting + stopping at). This shows us which stations are most popular, and which would need the most upkeep.

In [22]:
!mysql -D "cyclistic" -e "SELECT COUNT(*) FROM trips;"

+----------+
| COUNT(*) |
+----------+
|  3541474 |
+----------+
