In [38]:
!mysql -e "DROP DATABASE IF EXISTS cyclistic;"
!mysql -e "CREATE DATABASE cyclistic;"

First, we create our database called cyclistic

In [39]:
!mysql -e "SHOW DATABASES;"

+--------------------+
| Database           |
+--------------------+
| burke23201534      |
| cyclistic          |
| information_schema |
| mydb               |
| mysql              |
| newdb              |
| performance_schema |
| sys                |
+--------------------+


Lets take a look at our databases to ensure what we need is there

In [40]:
%%bash
mysql -D "cyclistic" -e "
CREATE TABLE trips (
    ride_id VARCHAR(255) PRIMARY KEY,
    rideable_type VARCHAR(50),
    started_at DATETIME,
    ended_at DATETIME,
    start_station_name VARCHAR(255),
    start_station_id VARCHAR(255),
    end_station_name VARCHAR(255),
    end_station_id VARCHAR(255),
    start_lat DECIMAL(10, 8),
    start_lng DECIMAL(11, 8),
    end_lat DECIMAL(10, 8),
    end_lng DECIMAL(11, 8),
    member_casual VARCHAR(50)
);"
mysql -e "SET GLOBAL local_infile = 1;"

Now we create the table schema, we must make sure it matches the csv exactly. We also tell the MySQL server to globally allow local file uploads.

In [31]:
%%bash
time mysql --local-infile=1 -D "cyclistic" -e "
LOAD DATA LOCAL INFILE 'cyclistic_tripdata_2020.csv' 
INTO TABLE trips 
FIELDS TERMINATED BY ',' 
ENCLOSED BY '\"' 
LINES TERMINATED BY '\n' 
IGNORE 1 ROWS;
"


real	2m22.155s
user	0m0.079s
sys	0m0.528s


This is where we load in our data, and take note of how long it takes

In [32]:
%%bash
time mysql -t -D "cyclistic" -e "
SELECT 
    SUM(CASE WHEN ride_id IS NULL OR ride_id = '' THEN 1 ELSE 0 END) AS missing_ids,
    SUM(CASE WHEN start_station_name IS NULL OR start_station_name = '' THEN 1 ELSE 0 END) AS missing_start_stations,
    SUM(CASE WHEN end_station_name IS NULL OR end_station_name = '' THEN 1 ELSE 0 END) AS missing_end_stations,
    SUM(CASE WHEN end_lat IS NULL OR end_lng IS NULL THEN 1 ELSE 0 END) AS missing_end_coords,
    SUM(CASE WHEN member_casual IS NULL OR member_casual = '' THEN 1 ELSE 0 END) AS missing_user_types
FROM trips;
"

+-------------+------------------------+----------------------+--------------------+--------------------+
| missing_ids | missing_start_stations | missing_end_stations | missing_end_coords | missing_user_types |
+-------------+------------------------+----------------------+--------------------+--------------------+
|           0 |                  94656 |               110880 |                  0 |                  0 |
+-------------+------------------------+----------------------+--------------------+--------------------+



real	0m6.841s
user	0m0.010s
sys	0m0.014s


Lets take a look at the data to ensure we have no missing data

In [17]:
%%bash
time mysql -t -D "cyclistic" -e "
SELECT 
    member_casual, 
    rideable_type, 
    COUNT(*) as total_rides
FROM trips
GROUP BY member_casual, rideable_type
ORDER BY total_rides DESC;
"

+---------------+---------------+-------------+
| member_casual | rideable_type | total_rides |
+---------------+---------------+-------------+
| member        | docked_bike   |     1820108 |
| casual        | docked_bike   |     1146005 |
| member        | electric_bike |      295519 |
| casual        | electric_bike |      209226 |
| member        | classic_bike  |       59297 |
| casual        | classic_bike  |       11319 |
+---------------+---------------+-------------+



real	0m6.946s
user	0m0.007s
sys	0m0.010s


Here we can see the breakdown of total rides taken and type of bike used by customers with an anual membersihp (members) vs. customers who have purchased a single-ride/full-day pass (casual). 

In [18]:
%%bash
time mysql -t -D "cyclistic" -e "
SELECT 
    member_casual, 
    ROUND(AVG(TIMESTAMPDIFF(MINUTE, started_at, ended_at)), 2) AS avg_duration_minutes
FROM trips
GROUP BY member_casual;
"

+---------------+----------------------+
| member_casual | avg_duration_minutes |
+---------------+----------------------+
| casual        |                45.36 |
| member        |                12.31 |
+---------------+----------------------+



real	0m7.036s
user	0m0.009s
sys	0m0.011s


This gives us the average duration of rides for casual customers and members.

In [19]:
%%bash
time mysql -t -D "cyclistic" -e "
SELECT 
    start_station_name, 
    end_station_name, 
    COUNT(*) as route_count
FROM trips
WHERE start_station_name != '' AND end_station_name != ''
GROUP BY start_station_name, end_station_name
ORDER BY route_count DESC
LIMIT 5;
"

+----------------------------+----------------------------+-------------+
| start_station_name         | end_station_name           | route_count |
+----------------------------+----------------------------+-------------+
| Streeter Dr & Grand Ave    | Streeter Dr & Grand Ave    |        6673 |
| Lake Shore Dr & Monroe St  | Lake Shore Dr & Monroe St  |        6397 |
| Buckingham Fountain        | Buckingham Fountain        |        4993 |
| Millennium Park            | Millennium Park            |        4897 |
| Indiana Ave & Roosevelt Rd | Indiana Ave & Roosevelt Rd |        4678 |
+----------------------------+----------------------------+-------------+



real	0m22.644s
user	0m0.009s
sys	0m0.011s


Here we see the most popular routes

In [20]:
%%bash
time mysql -t -D "cyclistic" -e "
SELECT 
    member_casual, 
    DAYNAME(started_at) AS day_of_week, 
    COUNT(*) AS total_rides
FROM trips
GROUP BY member_casual, day_of_week
ORDER BY 
    member_casual, 
    FIELD(day_of_week, 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday');
"

+---------------+-------------+-------------+
| member_casual | day_of_week | total_rides |
+---------------+-------------+-------------+
| casual        | Monday      |      141652 |
| casual        | Tuesday     |      137563 |
| casual        | Wednesday   |      152508 |
| casual        | Thursday    |      162689 |
| casual        | Friday      |      202187 |
| casual        | Saturday    |      313803 |
| casual        | Sunday      |      256148 |
| member        | Monday      |      292044 |
| member        | Tuesday     |      315416 |
| member        | Wednesday   |      329379 |
| member        | Thursday    |      328589 |
| member        | Friday      |      326048 |
| member        | Saturday    |      312376 |
| member        | Sunday      |      271072 |
+---------------+-------------+-------------+



real	0m7.279s
user	0m0.007s
sys	0m0.008s


This cell shows us the amount of trips taken per day of the week, broken up by casual riders vs. members. This tells us day of the week corrolates with usage.

In [21]:
%%bash
time mysql -t -D "cyclistic" -e "
SELECT 
    station_name, 
    SUM(activity_count) AS total_activity
FROM (
    SELECT start_station_name AS station_name, COUNT(*) AS activity_count
    FROM trips
    WHERE start_station_name != '' AND start_station_name IS NOT NULL
    GROUP BY start_station_name
    
    UNION ALL
    
    SELECT end_station_name AS station_name, COUNT(*) AS activity_count
    FROM trips
    WHERE end_station_name != '' AND end_station_name IS NOT NULL
    GROUP BY end_station_name
) AS combined_traffic
GROUP BY station_name
ORDER BY total_activity DESC
LIMIT 5;
"

+----------------------------+----------------+
| station_name               | total_activity |
+----------------------------+----------------+
| Streeter Dr & Grand Ave    |          73132 |
| Clark St & Elm St          |          63988 |
| Theater on the Lake        |          60666 |
| Lake Shore Dr & Monroe St  |          57660 |
| Lake Shore Dr & North Blvd |          53795 |
+----------------------------+----------------+



real	0m13.759s
user	0m0.009s
sys	0m0.010s


This cell shows us which stations have most traffic (which stations have the most people starting + stopping at). This shows us which stations are most popular, and which would need the most upkeep.