*I’m using Google Colab when making this notebook. I planned to use Rstudio to combine and clean the dataset. However, RStudio keeps on freezing and I can't proceed with cleaning the dataset.*

# Phase 2: Prepare
---



## Importing libraries and loading data
*The data is publicly available on an [AWS server](https://divvy-tripdata.s3.amazonaws.com/index.html). There are one file per month, with a total of 12 files. Each files has 13 columns with varying data type. We will merge the files into one and named as 'combined_data'*

In [1]:
#Importing libraries
import numpy as np
import pandas as pd
import datetime
from math import radians, sin, cos, sqrt, atan2
import os

# Mount Google Drive where the files are stored
from google.colab import drive
drive.mount('/content/drive')

# Set the path to the folder containing the CSV files
folder_path = '/content/drive/MyDrive/Dataset/'

# Get a list of all CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Initialize an empty DataFrame to store the combined data
combined_data = pd.DataFrame()

# Loop through each CSV file and concatenate its data to the combined DataFrame
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    data = pd.read_csv(file_path)
    combined_data = pd.concat([combined_data, data], ignore_index=True)

# Display the combined data
combined_data

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,65DBD2F447EC51C2,electric_bike,2022-12-05 10:47:18,2022-12-05 10:56:34,Clifton Ave & Armitage Ave,TA1307000163,Sedgwick St & Webster Ave,13191,41.918244,-87.657115,41.922167,-87.638888,member
1,0C201AA7EA0EA1AD,classic_bike,2022-12-18 06:42:33,2022-12-18 07:08:44,Broadway & Belmont Ave,13277,Sedgwick St & Webster Ave,13191,41.940106,-87.645451,41.922167,-87.638888,casual
2,E0B148CCB358A49D,electric_bike,2022-12-13 08:47:45,2022-12-13 08:59:51,Sangamon St & Lake St,TA1306000015,St. Clair St & Erie St,13016,41.885919,-87.651133,41.894345,-87.622798,member
3,54C5775D2B7C9188,classic_bike,2022-12-13 18:50:47,2022-12-13 19:19:48,Shields Ave & 31st St,KA1503000038,Damen Ave & Madison St,13134,41.838464,-87.635406,41.881370,-87.674930,member
4,A4891F78776D35DF,classic_bike,2022-12-14 16:13:39,2022-12-14 16:27:50,Ashland Ave & Chicago Ave,13247,Damen Ave & Charleston St,13288,41.895954,-87.667728,41.920082,-87.677855,casual
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5677605,30B44BD4C16E688C,classic_bike,2023-11-24 08:39:27,2023-11-24 08:47:03,Clark St & Wellington Ave,TA1307000136,Southport Ave & Wellington Ave,TA1307000006,41.936497,-87.647539,41.935775,-87.663600,member
5677606,094A79892812BAB9,classic_bike,2023-11-06 09:07:20,2023-11-06 09:10:00,Aberdeen St & Jackson Blvd,13157,Peoria St & Jackson Blvd,13158,41.877726,-87.654787,41.877642,-87.649618,member
5677607,F0A7DF8A44FDA3CB,electric_bike,2023-11-10 19:35:30,2023-11-10 19:44:28,Halsted St & Roscoe St,TA1309000025,Southport Ave & Wellington Ave,TA1307000006,41.943687,-87.648855,41.935775,-87.663600,member
5677608,4D5E3685BB913A3C,classic_bike,2023-11-27 09:11:23,2023-11-27 09:13:23,Aberdeen St & Jackson Blvd,13157,Peoria St & Jackson Blvd,13158,41.877726,-87.654787,41.877642,-87.649618,member


In [2]:
#Checking type of data
combined_data.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id       object
end_station_name       object
end_station_id         object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object

#Phase 3: Process
---



## Checking for null and duplicates

In [3]:
#Check null data
nan_count01 = combined_data.isna().sum()
print(nan_count01)

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    869289
start_station_id      869421
end_station_name      922436
end_station_id        922577
start_lat                  0
start_lng                  0
end_lat                 6879
end_lng                 6879
member_casual              0
dtype: int64


In [4]:
# Identify the duplicated rows based on all columns
duplicates = combined_data.duplicated()

# Count the number of duplicated rows
dup_count = duplicates.sum()
dup_count

0

*There are no duplicate rows.*

In [5]:
#Remove null data
df01 = combined_data.dropna()
df01.shape

(4299967, 13)

*We will add some new columns to the dataframe. We will drop the unnecessary rows after we add the new columns.*

## Additional Columns and Data Transformation

*We will add distance travel in kilometers from the given start latitude and longtitude and end latitude and longtitude. We will use Harversine formula.*

In [6]:
#Haversine Formula
def haversine(lat1, lon1, lat2, lon2):
    # Radius of the Earth in kilometers
    R = 6371.0

    # Convert latitude and longitude from degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])

    # Calculate the differences between latitudes and longitudes
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    # Haversine formula
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    # Calculate the distance
    distance = R * c

    return distance

In [7]:
#Copy to a new dataframe
df02 = df01.copy(deep=True)

# Calculate distance and store the result in a new column
df02['distance_travelled_km'] = df02.apply(lambda row: haversine(row['start_lat'], row['start_lng'], row['end_lat'], row['end_lng']), axis=1)
df02.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,distance_travelled_km
0,65DBD2F447EC51C2,electric_bike,2022-12-05 10:47:18,2022-12-05 10:56:34,Clifton Ave & Armitage Ave,TA1307000163,Sedgwick St & Webster Ave,13191,41.918244,-87.657115,41.922167,-87.638888,member,1.569868
1,0C201AA7EA0EA1AD,classic_bike,2022-12-18 06:42:33,2022-12-18 07:08:44,Broadway & Belmont Ave,13277,Sedgwick St & Webster Ave,13191,41.940106,-87.645451,41.922167,-87.638888,casual,2.067289
2,E0B148CCB358A49D,electric_bike,2022-12-13 08:47:45,2022-12-13 08:59:51,Sangamon St & Lake St,TA1306000015,St. Clair St & Erie St,13016,41.885919,-87.651133,41.894345,-87.622798,member,2.525678
3,54C5775D2B7C9188,classic_bike,2022-12-13 18:50:47,2022-12-13 19:19:48,Shields Ave & 31st St,KA1503000038,Damen Ave & Madison St,13134,41.838464,-87.635406,41.88137,-87.67493,member,5.785813
4,A4891F78776D35DF,classic_bike,2022-12-14 16:13:39,2022-12-14 16:27:50,Ashland Ave & Chicago Ave,13247,Damen Ave & Charleston St,13288,41.895954,-87.667728,41.920082,-87.677855,casual,2.810713


In [8]:
#Check the data of column 'distance_travelled_km'
df02['distance_travelled_km'].describe()

count    4.299967e+06
mean     2.064339e+00
std      8.400811e+00
min      0.000000e+00
25%      8.645415e-01
50%      1.521062e+00
75%      2.674487e+00
max      9.815429e+03
Name: distance_travelled_km, dtype: float64

*There are some data with 0 distance. There is a possiblity that they are roundtrip (same start station and end station). We will add ride duration.*

In [9]:
#Copy to a new dataframe
df03 = df02.copy(deep=True)

#Convert 'started_at'and 'ended_at' in datetime
date_columns = ['started_at','ended_at']
df03[date_columns] = df03[date_columns].apply(pd.to_datetime)

#Calculate ride duration and store the result in a new column
df03['ride_duration_s']=(df03['ended_at']-df03['started_at']).dt.total_seconds()
df03.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,distance_travelled_km,ride_duration_s
0,65DBD2F447EC51C2,electric_bike,2022-12-05 10:47:18,2022-12-05 10:56:34,Clifton Ave & Armitage Ave,TA1307000163,Sedgwick St & Webster Ave,13191,41.918244,-87.657115,41.922167,-87.638888,member,1.569868,556.0
1,0C201AA7EA0EA1AD,classic_bike,2022-12-18 06:42:33,2022-12-18 07:08:44,Broadway & Belmont Ave,13277,Sedgwick St & Webster Ave,13191,41.940106,-87.645451,41.922167,-87.638888,casual,2.067289,1571.0
2,E0B148CCB358A49D,electric_bike,2022-12-13 08:47:45,2022-12-13 08:59:51,Sangamon St & Lake St,TA1306000015,St. Clair St & Erie St,13016,41.885919,-87.651133,41.894345,-87.622798,member,2.525678,726.0
3,54C5775D2B7C9188,classic_bike,2022-12-13 18:50:47,2022-12-13 19:19:48,Shields Ave & 31st St,KA1503000038,Damen Ave & Madison St,13134,41.838464,-87.635406,41.88137,-87.67493,member,5.785813,1741.0
4,A4891F78776D35DF,classic_bike,2022-12-14 16:13:39,2022-12-14 16:27:50,Ashland Ave & Chicago Ave,13247,Damen Ave & Charleston St,13288,41.895954,-87.667728,41.920082,-87.677855,casual,2.810713,851.0


In [10]:
#Check the data of column 'distance_travelled_km'
df03['ride_duration_s'].describe()

count    4.299967e+06
mean     9.573586e+02
std      2.158577e+03
min     -3.274000e+03
25%      3.370000e+02
50%      5.880000e+02
75%      1.049000e+03
max      7.281780e+05
Name: ride_duration_s, dtype: float64

*There are some data with ride duration that are negative and also the possibility of less than 1 minute (60 seconds). We will remove ride duration less than 1 minute.*

In [11]:
#Filter out ride duration less than 1 minute
df03_filtered = df03[df03['ride_duration_s'] >= 60]
df04 =df03_filtered.copy(deep=True)
df04.shape

(4212771, 15)

*We will further remove data by computing speed in kph. We will remove data with unlikely speed.*

In [12]:
#Calculate speed in kph
df04['speed_kph'] = (df04['distance_travelled_km'] / (df04['ride_duration_s'] / 3600))

#Check the spread of data of column 'speed_kph'
df04['speed_kph'].describe()

count    4.212771e+06
mean     1.054149e+01
std      5.955605e+01
min      0.000000e+00
25%      7.814031e+00
50%      1.071901e+01
75%      1.357626e+01
max      1.217888e+05
Name: speed_kph, dtype: float64

*We will remove data with speed more than 45 kph (~28mph, maximum bike speed in US).*

In [13]:
#Copy data with speed less than 45 kph into new dataframe
df05 = df04[df04['speed_kph'] <= 45]
df05.shape

(4212381, 16)

*We will drop some columns for faster loading.*

In [14]:
#Drop columns 'ride_id' and 'speed_kph'
columns_to_drop = ['ride_id', 'speed_kph']
df05_drop = df05.drop(columns=columns_to_drop)
df05_drop.shape

(4212381, 14)

*We will remove trailing spaces for string columns. There are data that are test ride (with 'test' included in station name and station id). We will also remove those data.*



In [15]:
#Remove trailing spaces
columns_str = ['rideable_type', 'start_station_name', 'start_station_id',	'end_station_name', 'end_station_id', 'member_casual']
df05_drop[columns_str] = df05_drop[columns_str].astype(str).apply(lambda col: col.str.strip())

#Filter rows not containing 'test' in station ID and name, both start and end
columns = ['start_station_name', 'start_station_id',	'end_station_name', 'end_station_id']
wo_test = ~df05_drop[columns].apply(lambda col: col.str.contains('test', case=False)).any(axis=1)

#Keep rows that do not contain 'test'
df05_filtered = df05_drop[wo_test]
df05_filtered.shape

(4212292, 14)

*We will add these following categories to uncover time-based patterns:*

*   day_of_week: Monday, Tuesday, Wednesday and etc.
*   day_type: Weekday or Weekend
*   month: January, February, March and etc.
*   season: Winter, Spring, Summer, Fall





In [16]:
#Add 'day_of_week' and 'month' based on 'started_at'
df_add = df05_filtered.copy(deep=True)
df_add['day_of_week']=df_add['started_at'].dt.strftime('%A')
df_add['month']=df_add['started_at'].dt.strftime('%B')
df_add.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,distance_travelled_km,ride_duration_s,day_of_week,month
0,electric_bike,2022-12-05 10:47:18,2022-12-05 10:56:34,Clifton Ave & Armitage Ave,TA1307000163,Sedgwick St & Webster Ave,13191,41.918244,-87.657115,41.922167,-87.638888,member,1.569868,556.0,Monday,December
1,classic_bike,2022-12-18 06:42:33,2022-12-18 07:08:44,Broadway & Belmont Ave,13277,Sedgwick St & Webster Ave,13191,41.940106,-87.645451,41.922167,-87.638888,casual,2.067289,1571.0,Sunday,December
2,electric_bike,2022-12-13 08:47:45,2022-12-13 08:59:51,Sangamon St & Lake St,TA1306000015,St. Clair St & Erie St,13016,41.885919,-87.651133,41.894345,-87.622798,member,2.525678,726.0,Tuesday,December
3,classic_bike,2022-12-13 18:50:47,2022-12-13 19:19:48,Shields Ave & 31st St,KA1503000038,Damen Ave & Madison St,13134,41.838464,-87.635406,41.88137,-87.67493,member,5.785813,1741.0,Tuesday,December
4,classic_bike,2022-12-14 16:13:39,2022-12-14 16:27:50,Ashland Ave & Chicago Ave,13247,Damen Ave & Charleston St,13288,41.895954,-87.667728,41.920082,-87.677855,casual,2.810713,851.0,Wednesday,December


In [17]:
#Function for 'day_type'
def day_cat (d):
  if d == "Saturday" or d == "Sunday":
    return "Weekend"
  else: return "Weekday"

In [18]:
#Add 'day_type'
df_add['day_type'] = df_add['day_of_week'].apply(day_cat)
df_add.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,distance_travelled_km,ride_duration_s,day_of_week,month,day_type
0,electric_bike,2022-12-05 10:47:18,2022-12-05 10:56:34,Clifton Ave & Armitage Ave,TA1307000163,Sedgwick St & Webster Ave,13191,41.918244,-87.657115,41.922167,-87.638888,member,1.569868,556.0,Monday,December,Weekday
1,classic_bike,2022-12-18 06:42:33,2022-12-18 07:08:44,Broadway & Belmont Ave,13277,Sedgwick St & Webster Ave,13191,41.940106,-87.645451,41.922167,-87.638888,casual,2.067289,1571.0,Sunday,December,Weekend
2,electric_bike,2022-12-13 08:47:45,2022-12-13 08:59:51,Sangamon St & Lake St,TA1306000015,St. Clair St & Erie St,13016,41.885919,-87.651133,41.894345,-87.622798,member,2.525678,726.0,Tuesday,December,Weekday
3,classic_bike,2022-12-13 18:50:47,2022-12-13 19:19:48,Shields Ave & 31st St,KA1503000038,Damen Ave & Madison St,13134,41.838464,-87.635406,41.88137,-87.67493,member,5.785813,1741.0,Tuesday,December,Weekday
4,classic_bike,2022-12-14 16:13:39,2022-12-14 16:27:50,Ashland Ave & Chicago Ave,13247,Damen Ave & Charleston St,13288,41.895954,-87.667728,41.920082,-87.677855,casual,2.810713,851.0,Wednesday,December,Weekday


In [19]:
#Function for 'season'
def szn (m):
  if m == "December" or m == "January" or m == "February":
    return "Winter"
  elif m == "March" or m == "April" or m == "May":
    return "Spring"
  elif m == "June" or m == "July" or m == "August":
    return "Summer"
  else: return "Fall"

In [20]:
#Add 'season'
df_add['season'] = df_add['month'].apply(szn)
df_add.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,distance_travelled_km,ride_duration_s,day_of_week,month,day_type,season
0,electric_bike,2022-12-05 10:47:18,2022-12-05 10:56:34,Clifton Ave & Armitage Ave,TA1307000163,Sedgwick St & Webster Ave,13191,41.918244,-87.657115,41.922167,-87.638888,member,1.569868,556.0,Monday,December,Weekday,Winter
1,classic_bike,2022-12-18 06:42:33,2022-12-18 07:08:44,Broadway & Belmont Ave,13277,Sedgwick St & Webster Ave,13191,41.940106,-87.645451,41.922167,-87.638888,casual,2.067289,1571.0,Sunday,December,Weekend,Winter
2,electric_bike,2022-12-13 08:47:45,2022-12-13 08:59:51,Sangamon St & Lake St,TA1306000015,St. Clair St & Erie St,13016,41.885919,-87.651133,41.894345,-87.622798,member,2.525678,726.0,Tuesday,December,Weekday,Winter
3,classic_bike,2022-12-13 18:50:47,2022-12-13 19:19:48,Shields Ave & 31st St,KA1503000038,Damen Ave & Madison St,13134,41.838464,-87.635406,41.88137,-87.67493,member,5.785813,1741.0,Tuesday,December,Weekday,Winter
4,classic_bike,2022-12-14 16:13:39,2022-12-14 16:27:50,Ashland Ave & Chicago Ave,13247,Damen Ave & Charleston St,13288,41.895954,-87.667728,41.920082,-87.677855,casual,2.810713,851.0,Wednesday,December,Weekday,Winter


*We will add 'route_type' (same 'start_station_name' and 'end_station_name') to classify them as round trip or one-way trip.*

In [21]:
#Function for 'route_type'
def rte_ty (row):
  if row['start_station_name'] == row['end_station_name']:
    return "Round trip"
  else: return "One-way trip"

In [22]:
#Add 'route_type'
df_add['route_type'] = df_add.apply(rte_ty, axis=1)
df_add.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,distance_travelled_km,ride_duration_s,day_of_week,month,day_type,season,route_type
0,electric_bike,2022-12-05 10:47:18,2022-12-05 10:56:34,Clifton Ave & Armitage Ave,TA1307000163,Sedgwick St & Webster Ave,13191,41.918244,-87.657115,41.922167,-87.638888,member,1.569868,556.0,Monday,December,Weekday,Winter,One-way trip
1,classic_bike,2022-12-18 06:42:33,2022-12-18 07:08:44,Broadway & Belmont Ave,13277,Sedgwick St & Webster Ave,13191,41.940106,-87.645451,41.922167,-87.638888,casual,2.067289,1571.0,Sunday,December,Weekend,Winter,One-way trip
2,electric_bike,2022-12-13 08:47:45,2022-12-13 08:59:51,Sangamon St & Lake St,TA1306000015,St. Clair St & Erie St,13016,41.885919,-87.651133,41.894345,-87.622798,member,2.525678,726.0,Tuesday,December,Weekday,Winter,One-way trip
3,classic_bike,2022-12-13 18:50:47,2022-12-13 19:19:48,Shields Ave & 31st St,KA1503000038,Damen Ave & Madison St,13134,41.838464,-87.635406,41.88137,-87.67493,member,5.785813,1741.0,Tuesday,December,Weekday,Winter,One-way trip
4,classic_bike,2022-12-14 16:13:39,2022-12-14 16:27:50,Ashland Ave & Chicago Ave,13247,Damen Ave & Charleston St,13288,41.895954,-87.667728,41.920082,-87.677855,casual,2.810713,851.0,Wednesday,December,Weekday,Winter,One-way trip


*We will add ride duration in minutes.*

In [23]:
#Add 'ride_duration_min'
df_add ['ride_duration_min'] = df_add['ride_duration_s'] / 60
df_add.head()

Unnamed: 0,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,distance_travelled_km,ride_duration_s,day_of_week,month,day_type,season,route_type,ride_duration_min
0,electric_bike,2022-12-05 10:47:18,2022-12-05 10:56:34,Clifton Ave & Armitage Ave,TA1307000163,Sedgwick St & Webster Ave,13191,41.918244,-87.657115,41.922167,-87.638888,member,1.569868,556.0,Monday,December,Weekday,Winter,One-way trip,9.266667
1,classic_bike,2022-12-18 06:42:33,2022-12-18 07:08:44,Broadway & Belmont Ave,13277,Sedgwick St & Webster Ave,13191,41.940106,-87.645451,41.922167,-87.638888,casual,2.067289,1571.0,Sunday,December,Weekend,Winter,One-way trip,26.183333
2,electric_bike,2022-12-13 08:47:45,2022-12-13 08:59:51,Sangamon St & Lake St,TA1306000015,St. Clair St & Erie St,13016,41.885919,-87.651133,41.894345,-87.622798,member,2.525678,726.0,Tuesday,December,Weekday,Winter,One-way trip,12.1
3,classic_bike,2022-12-13 18:50:47,2022-12-13 19:19:48,Shields Ave & 31st St,KA1503000038,Damen Ave & Madison St,13134,41.838464,-87.635406,41.88137,-87.67493,member,5.785813,1741.0,Tuesday,December,Weekday,Winter,One-way trip,29.016667
4,classic_bike,2022-12-14 16:13:39,2022-12-14 16:27:50,Ashland Ave & Chicago Ave,13247,Damen Ave & Charleston St,13288,41.895954,-87.667728,41.920082,-87.677855,casual,2.810713,851.0,Wednesday,December,Weekday,Winter,One-way trip,14.183333


*We will rename some of the columns and re-arrange columns to make it organize.*

In [24]:
#Rename columns and re-arrange columns order
df_format = df_add.copy(deep=True)
df_format.rename(columns = {'rideable_type':'bike_type','started_at':'start_time', 'ended_at':'end_time', 'member_casual':'user_type' }, inplace = True)
new_cols = ['bike_type', 'user_type','start_time','end_time','day_of_week','day_type','month','season','start_station_name','end_station_name','route_type', 'start_lat', 'start_lng', 'end_lat',	'end_lng',	'distance_travelled_km', 'ride_duration_s', 'ride_duration_min']
df_clean=df_format[new_cols]
df_clean.head()

Unnamed: 0,bike_type,user_type,start_time,end_time,day_of_week,day_type,month,season,start_station_name,end_station_name,route_type,start_lat,start_lng,end_lat,end_lng,distance_travelled_km,ride_duration_s,ride_duration_min
0,electric_bike,member,2022-12-05 10:47:18,2022-12-05 10:56:34,Monday,Weekday,December,Winter,Clifton Ave & Armitage Ave,Sedgwick St & Webster Ave,One-way trip,41.918244,-87.657115,41.922167,-87.638888,1.569868,556.0,9.266667
1,classic_bike,casual,2022-12-18 06:42:33,2022-12-18 07:08:44,Sunday,Weekend,December,Winter,Broadway & Belmont Ave,Sedgwick St & Webster Ave,One-way trip,41.940106,-87.645451,41.922167,-87.638888,2.067289,1571.0,26.183333
2,electric_bike,member,2022-12-13 08:47:45,2022-12-13 08:59:51,Tuesday,Weekday,December,Winter,Sangamon St & Lake St,St. Clair St & Erie St,One-way trip,41.885919,-87.651133,41.894345,-87.622798,2.525678,726.0,12.1
3,classic_bike,member,2022-12-13 18:50:47,2022-12-13 19:19:48,Tuesday,Weekday,December,Winter,Shields Ave & 31st St,Damen Ave & Madison St,One-way trip,41.838464,-87.635406,41.88137,-87.67493,5.785813,1741.0,29.016667
4,classic_bike,casual,2022-12-14 16:13:39,2022-12-14 16:27:50,Wednesday,Weekday,December,Winter,Ashland Ave & Chicago Ave,Damen Ave & Charleston St,One-way trip,41.895954,-87.667728,41.920082,-87.677855,2.810713,851.0,14.183333


In [25]:
#Check the final dataframe
df_clean.shape

(4212292, 18)

## Extract dataframe for analysis

*We will make a separate notebook for the analysis. The cleaned data was extracted.*

In [26]:
#Extract csv file
from google.colab import files
df_clean.to_csv('cyclistic_clean.csv', index=False)
files.download('cyclistic_clean.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>