# Description and Goal

This project performs cleaning, EDA, and modelling to the BlueBikes bicycle sharing system data from 20xx to 20xx found [here](https://bluebikes.com/system-data)



| Column (From website)      |
| ----------- |
| Trip Duration (seconds) |
| Start Time and Date |
| Stop Time and Date |
| Start Station Name & ID |
| End Station Name & ID |
| Bike ID |
| User Type (Casual = Single Trip or Day Pass user; Member = Annual or Monthly Member) |









# Load and Clean Data

Step 1: Preprocessing (in Python/Pandas)

Let's load data. Lots of files from the website that we need to standardize column names for and concatenate into one csv file?
- Loop through CSVs, inspect column names, standardize them.
- Concatenate all into one big DataFrame.
- Clean data (e.g., fix datetime parsing, column types, missing values).

In [186]:
import pandas as pd

In [187]:
start_path = "/Users/ellawang/Documents/GitHub/bike_csv_files/"
old_end_path = "-hubway-tripdata.csv"
new_end_path = "-bluebikes-tripdata.csv"
yr_15 = ["2015" + str(i).zfill(2) + old_end_path for i in range(1, 13)]
yr_16 = ["2016" + str(i).zfill(2) + old_end_path for i in range(1, 13)]
yr_17 = ["2017" + str(i).zfill(2) + old_end_path for i in range(1, 13)]
yr_18_1 = ["2018" + str(i).zfill(2) + old_end_path for i in range(1, 5)] 
    # note 1801-1803 i had to manually replace _ with - in the names
    # after 1805 hubway-->bluebikes
yr_18_2 = ["2018" + str(i).zfill(2) + new_end_path for i in range(5, 13)] 
yr_19 = ["2019" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_20 = ["2020" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_21 = ["2021" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_22 = ["2022" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_23 = ["2023" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_24 = ["2024" + str(i).zfill(2) + new_end_path for i in range(1, 13)]
yr_25 = ["2025" + str(i).zfill(2) + new_end_path for i in range(1, 7)]
pathways = yr_15 + yr_16 + yr_17 + yr_18_1 + yr_18_2 + yr_19 + yr_20 + yr_21 + yr_22 + yr_23 + yr_24 + yr_25

# condense this shii

In [188]:
# # give us a peak into the columns and formats/datatypes of each file
# num_total_rows = 0
# col_count = {}
# for path in pathways:
#     df = pd.read_csv(start_path + path)
#     print(f'{path}: {df.columns} : {df.shape[0]} rows')
#     num_total_rows += df.shape[0]
#     print(df.iloc[0])
#     for col in df.columns:
#         if col not in col_count:
#             col_count[col] = 0
#         col_count[col] += 1

# print(col_count)
# print(f'Num total rows: {num_total_rows}')

# # saved to "output.txt" so don't have to re-run

We see inconsistent naming conventions. investigated into output of the print statements printing one line from each file to see which columns are the same and of those which are reformatted and also which columns like dropped before or after a certain point

Column names in 99 files (201501 until 202303 (including final yr/mo))
- 'tripduration': 99 (ends in 202303) (e.g. 1105) -- DROPPING tentatively?
- 'bikeid': 99, (ends in 202304) (e.g. 6680) -- DROPPING tentatively?
- 'starttime': 99 (turns into started_at beg. 202304) (e.g. 2023-03-01 00:00:44.1520 --> 2023-04-13 13:49:59)
- 'stoptime': 99 (turns into ended_at beg. 202304)
- 'start station id': 99 (turns into start_station_id beg. 202304) (e.g. 386 --> A32011)
- 'start station name': 99, (turns into start_station_name beg. 202304) (e.g. Central Square at Mass Ave / Essex St --> seems to stay same!)
- start station latitude': 99, (turns into start_lat beg. 202304) (e.g. 42.368605 --> 42.363713 stays the same!)
- 'start station longitude': 99, (turns into start_lng beg. 202304) (same)
- 'end station id': 99, (turns into end_station_id beg. 202304) (e.g. 386 --> A32011 aka same)
- 'end station name': 99, (turns into end_station_name beg. 202304) (same)
- 'end station latitude': 99, (turns into end_lat beg. 202304) (same)
- 'end station longitude': 99, (turns into end_lng beg. 202304) (same)
- 'usertype': 99(turns into member_casual beg. 202304) (e.g. Customer or Subscriber --> member or casual)

Column names in 64 files (201501 until 202004)
- 'birth year': 64 (e.g. 1984) -- DROPPING
- 'gender': 64 (e.g. 0 or 1 or 2) -- DROPPING

Column names in 35 files (202005 until 202303)
- 'postal code': 35 (e.g. 02118 or NaN) -- DROPPING

Column names in 27 files (202304 to 202506)
- 'ride_id': 27 (begins 202304) (e.g. 0093AA5E7E3E0158) -- DROPPING
- 'rideable_type': 27, (begins 202304) (e.g. docked_bike or classic_bike or electric_bike) -- DROPPING tentatively?

dropping columns: I will delete **birth year, gender, postal code** since those are present in only half or fewer of the rows and not the most imformative. I will drop **ride_id** since not informative and just distinguishes rides from each other, **bikeid** because I don't care too much about particular bike (not sure about htis assumption hm), dropping **tripduration** bc that can be deduced from starttime and endtime (i'll engineer a new col after this). 

also will drop **start station id** and **end station id** bc the format changes halfway and redundant with start and end station name

for now, i will drop **rideable type** bc it's only in 27 rows... however this is a meaningful var to predict other things so will do more research (maybe bluebikes only started offering e bikes a certain year and prior to that there was only classic bike... also idk the diff between classic and docked bike lol so will look into that later but for now drop?)

(might need to rewrite/move) then i'll rename columns, standardize formatting, and visualize with EDA as well as missing values before i decide how to go about filling in missing values

In [189]:
# this takes 1 min 15 sec to run ish

# drop those columns - 
def load_and_clean_csv(filepath):
    
    # read_csv
    df = pd.read_csv(filepath)
    
    # rename cols as needed pass in dict
    renames = {
        'starttime': 'started_at',
        'stoptime': 'ended_at',
        'start station id': 'start_station_id',
        'start station name': 'start_station_name',
        'start station latitude': 'start_lat',
        'start station longitude': 'start_lng',
        'end station id': 'end_station_id',
        'end station name': 'end_station_name',
        'end station latitude': 'end_lat',
        'end station longitude': 'end_lng',
        'usertype' : 'member_casual'
    }
    
    df.rename(columns = renames, inplace=True)
    
    
    # get a subset of columns wanted
    keep_columns = ['started_at', 'ended_at', 'start_station_name',
       'start_station_id', 'start_lat', 'start_lng', 'end_station_id',
       'end_station_name', 'end_lat', 'end_lng', 'member_casual']
    
    df = df[keep_columns]
    
    return df
    
    # for path in pathways:
    #     df = pd.read_csv(start_path + path)
    #     print(f'{path}: {df.columns} : {df.shape[0]} rows')
    #     num_total_rows += df.shape[0]
    #     print(df.iloc[0])
    #     for col in df.columns:
    #         if col not in col_count:
    #             col_count[col] = 0
    #         col_count[col] += 1

    # print(col_count)
    # print(f'Num total rows: {num_total_rows}')
    # return df

# get list of all pathways (this is pathways from prev code cell)

# # get list of dataframes
dfs_list = [load_and_clean_csv(start_path + pathway) for pathway in pathways]

# # concat and get a list of that funciton applied to each pathways
big_df = pd.concat(dfs_list, ignore_index=True)

big_df

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual
0,2015-01-01 00:21:44,2015-01-01 00:30:47,Porter Square Station,115,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,Subscriber
1,2015-01-01 00:27:03,2015-01-01 00:34:21,MIT Stata Center at Vassar St / Main St,80,42.361962,-71.092053,95,Cambridge St - at Columbia St / Webster Ave,42.372969,-71.094445,Subscriber
2,2015-01-01 00:31:31,2015-01-01 00:35:46,One Kendall Square at Hampshire St / Portland St,91,42.366277,-71.091690,68,Central Square at Mass Ave / Essex St,42.36507,-71.1031,Subscriber
3,2015-01-01 00:53:46,2015-01-01 01:00:58,Porter Square Station,115,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,Subscriber
4,2015-01-01 01:07:06,2015-01-01 01:19:21,Lower Cambridgeport at Magazine St/Riverside Rd,105,42.356954,-71.113687,88,Inman Square at Vellucci Plaza / Hampshire St,42.374035,-71.101427,Customer
...,...,...,...,...,...,...,...,...,...,...,...
27065605,2025-06-12 20:20:45.355,2025-06-12 21:39:54.807,Tremont St at Northampton St,C32056,42.338432,-71.081690,B32015,Landmark Center - Brookline Ave at Park Dr,42.343691,-71.102353,member
27065606,2025-06-25 17:13:30.010,2025-06-25 17:20:29.507,Park Street T Stop - Tremont St at Park St,B32068,42.356627,-71.062457,C32077,Columbus Ave at W. Canton St,42.344742,-71.076482,member
27065607,2025-06-09 16:03:20.398,2025-06-09 16:31:22.323,Innovation Lab - 125 Western Ave at Batten Way,A32011,42.363713,-71.124598,K32012,Marion St at Harvard St,42.340122,-71.120706,member
27065608,2025-06-03 09:22:41.353,2025-06-03 09:38:11.359,Centre St at Seaverns Ave,E32008,42.313580,-71.114050,B32003,HMS/HSPH - Avenue Louis Pasteur at Longwood Ave,42.337417,-71.102861,member


In [None]:
# gotta fix the datetime for started_at and ended_at, which are diff formats before and after certain index

# 2023-03-01 00:00:44.1520 --> 2023-04-13 13:49:59 beginning 202304

# - 'start station id': 99 (turns into start_station_id beg. 202304) (e.g. 386 --> A32011)
# - 'start station name': 99, (turns into start_station_name beg. 202304) (e.g. Central Square at Mass Ave / Essex St --> seems to stay same!)
# - start station latitude': 99, (turns into start_lat beg. 202304) (e.g. 42.368605 --> 42.363713 stays the same!)
# - 'start station longitude': 99, (turns into start_lng beg. 202304) (same)
# - 'end station id': 99, (turns into end_station_id beg. 202304) (e.g. 386 --> A32011 aka same)
# - 'end station name': 99, (turns into end_station_name beg. 202304) (same)
# - 'end station latitude': 99, (turns into end_lat beg. 202304) (same)
# - 'end station longitude': 99, (turns into end_lng beg. 202304) (same)
# - 'usertype': 99(turns into member_casual beg. 202304) (e.g. Customer or Subscriber --> member or casual)

In [190]:
big_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27065610 entries, 0 to 27065609
Data columns (total 11 columns):
 #   Column              Dtype  
---  ------              -----  
 0   started_at          object 
 1   ended_at            object 
 2   start_station_name  object 
 3   start_station_id    object 
 4   start_lat           float64
 5   start_lng           float64
 6   end_station_id      object 
 7   end_station_name    object 
 8   end_lat             object 
 9   end_lng             object 
 10  member_casual       object 
dtypes: float64(2), object(9)
memory usage: 2.2+ GB


In [191]:
big_df.describe()

Unnamed: 0,start_lat,start_lng
count,27065610.0,27065610.0
mean,42.35804,-71.08889
std,0.05348415,0.08861284
min,0.0,-73.56692
25%,42.34828,-71.10594
50%,42.3581,-71.08995
75%,42.36643,-71.06985
max,45.50509,0.0


In [192]:
big_df.isna().sum()

started_at                0
ended_at                  0
start_station_name     2033
start_station_id       2033
start_lat                 0
start_lng                 0
end_station_id        29036
end_station_name      28417
end_lat               21829
end_lng               21829
member_casual             0
dtype: int64

In [193]:
# view rows with at least one NaN value
big_df[big_df.isna().any(axis=1)]

# notice a lot of these bike rides lasted over a day ... maybe they were abandoned/lost/stolen/data entry error/broken dock so will drop probably
# to test let's engineer a new trip_duration feature from started_at and ended_at

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual
17220605,2023-04-18 17:29:09,2023-04-18 19:38:59,Danehy Park,M32031,42.388966,-71.132788,,,,,casual
17220606,2023-04-05 22:36:18,2023-04-13 19:26:02,Day Sq,A32051,42.379295,-71.027733,,,,,casual
17220618,2023-04-25 08:35:05,2023-04-25 09:16:57,359 Broadway - Broadway at Fayette Street,M32026,42.370803,-71.104412,,,,,casual
17220620,2023-04-09 17:30:05,2023-04-15 15:47:25,Blue Hill Ave at Almont St,C32044,42.274621,-71.093726,,,,,casual
17220622,2023-04-24 22:15:25,2023-04-28 14:44:45,Talbot Ave At Blue Hill Ave,C32043,42.294583,-71.087111,,,,,casual
...,...,...,...,...,...,...,...,...,...,...,...
27065551,2025-06-06 18:27:50.927,2025-06-07 19:27:43.969,Forest Hills,E32010,42.300923,-71.114249,,,,,casual
27065565,2025-06-16 16:42:03.580,2025-06-17 00:51:50.667,Commonwealth Ave at Naples Rd,E32016,42.351911,-71.123798,,,42.35,-71.14,casual
27065570,2025-06-05 11:07:18.462,2025-06-06 12:07:14.354,Beacon St at Charles St,D32024,42.356052,-71.069849,,,,,casual
27065577,2025-06-17 12:40:35.574,2025-06-18 13:40:25.827,Government Center - Cambridge St at Court St,B32032,42.359825,-71.059796,,,,,casual


In [194]:
# big_df[big_df.isna().any(axis=1)].index

Int64Index([17220605, 17220606, 17220618, 17220620, 17220622, 17220624,
            17220635, 17220638, 17220644, 17220646,
            ...
            27065468, 27065469, 27065479, 27065488, 27065531, 27065551,
            27065565, 27065570, 27065577, 27065603],
           dtype='int64', length=30447)

In [195]:
null_indices = big_df[big_df.isna().any(axis=1)].index.tolist()

KeyboardInterrupt: 

In [196]:
null_indices

[17220605,
 17220606,
 17220618,
 17220620,
 17220622,
 17220624,
 17220635,
 17220638,
 17220644,
 17220646,
 17220648,
 17220649,
 17220659,
 17220686,
 17220688,
 17220691,
 17220694,
 17220695,
 17220697,
 17220698,
 17220699,
 17220700,
 17220966,
 17220969,
 17222236,
 17222244,
 17222251,
 17222256,
 17222257,
 17222258,
 17222259,
 17222260,
 17222272,
 17222273,
 17222274,
 17222284,
 17222285,
 17222288,
 17222295,
 17222299,
 17222304,
 17222307,
 17222322,
 17223136,
 17223145,
 17223160,
 17223170,
 17223173,
 17223186,
 17223187,
 17223194,
 17223206,
 17223209,
 17223211,
 17223217,
 17223218,
 17223223,
 17223224,
 17223225,
 17223228,
 17223229,
 17224197,
 17224205,
 17224206,
 17224208,
 17224214,
 17224215,
 17224216,
 17224217,
 17224223,
 17224228,
 17224233,
 17224237,
 17224240,
 17224242,
 17224244,
 17224255,
 17224256,
 17224264,
 17224266,
 17224268,
 17224278,
 17224284,
 17224287,
 17224298,
 17224304,
 17224315,
 17224326,
 17224333,
 17225256,
 17226177,

In [197]:
test_rows = big_df.iloc[:2]
test_rows

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual
0,2015-01-01 00:21:44,2015-01-01 00:30:47,Porter Square Station,115,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,Subscriber
1,2015-01-01 00:27:03,2015-01-01 00:34:21,MIT Stata Center at Vassar St / Main St,80,42.361962,-71.092053,95,Cambridge St - at Columbia St / Webster Ave,42.372969,-71.094445,Subscriber


In [198]:
type(test_rows.iloc[0]['started_at'])

str

In [199]:
# cast string to datetime
pd.to_datetime(test_rows.iloc[0]['started_at'])


Timestamp('2015-01-01 00:21:44')

In [200]:
# apply to columns
big_df['started_at'] = pd.to_datetime(big_df['started_at'])
big_df['ended_at'] = pd.to_datetime(big_df['ended_at'])

In [201]:
# engineer new duration column
big_df['duration'] = big_df['ended_at'] - big_df['started_at']
big_df.head()

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual,duration
0,2015-01-01 00:21:44,2015-01-01 00:30:47,Porter Square Station,115,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,Subscriber,0 days 00:09:03
1,2015-01-01 00:27:03,2015-01-01 00:34:21,MIT Stata Center at Vassar St / Main St,80,42.361962,-71.092053,95,Cambridge St - at Columbia St / Webster Ave,42.372969,-71.094445,Subscriber,0 days 00:07:18
2,2015-01-01 00:31:31,2015-01-01 00:35:46,One Kendall Square at Hampshire St / Portland St,91,42.366277,-71.09169,68,Central Square at Mass Ave / Essex St,42.36507,-71.1031,Subscriber,0 days 00:04:15
3,2015-01-01 00:53:46,2015-01-01 01:00:58,Porter Square Station,115,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,Subscriber,0 days 00:07:12
4,2015-01-01 01:07:06,2015-01-01 01:19:21,Lower Cambridgeport at Magazine St/Riverside Rd,105,42.356954,-71.113687,88,Inman Square at Vellucci Plaza / Hampshire St,42.374035,-71.101427,Customer,0 days 00:12:15


In [202]:
# now let's look again at the null value rows' durations to see if they are removable outliers/exceptions

big_df.loc[null_indices]

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual,duration
17220605,2023-04-18 17:29:09.000,2023-04-18 19:38:59.000,Danehy Park,M32031,42.388966,-71.132788,,,,,casual,0 days 02:09:50
17220606,2023-04-05 22:36:18.000,2023-04-13 19:26:02.000,Day Sq,A32051,42.379295,-71.027733,,,,,casual,7 days 20:49:44
17220618,2023-04-25 08:35:05.000,2023-04-25 09:16:57.000,359 Broadway - Broadway at Fayette Street,M32026,42.370803,-71.104412,,,,,casual,0 days 00:41:52
17220620,2023-04-09 17:30:05.000,2023-04-15 15:47:25.000,Blue Hill Ave at Almont St,C32044,42.274621,-71.093726,,,,,casual,5 days 22:17:20
17220622,2023-04-24 22:15:25.000,2023-04-28 14:44:45.000,Talbot Ave At Blue Hill Ave,C32043,42.294583,-71.087111,,,,,casual,3 days 16:29:20
...,...,...,...,...,...,...,...,...,...,...,...,...
27065551,2025-06-06 18:27:50.927,2025-06-07 19:27:43.969,Forest Hills,E32010,42.300923,-71.114249,,,,,casual,1 days 00:59:53.042000
27065565,2025-06-16 16:42:03.580,2025-06-17 00:51:50.667,Commonwealth Ave at Naples Rd,E32016,42.351911,-71.123798,,,42.35,-71.14,casual,0 days 08:09:47.087000
27065570,2025-06-05 11:07:18.462,2025-06-06 12:07:14.354,Beacon St at Charles St,D32024,42.356052,-71.069849,,,,,casual,1 days 00:59:55.892000
27065577,2025-06-17 12:40:35.574,2025-06-18 13:40:25.827,Government Center - Cambridge St at Court St,B32032,42.359825,-71.059796,,,,,casual,1 days 00:59:50.253000


In [203]:
# if we remove null values then what are the stats

big_df.loc[null_indices]

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual,duration
17220605,2023-04-18 17:29:09.000,2023-04-18 19:38:59.000,Danehy Park,M32031,42.388966,-71.132788,,,,,casual,0 days 02:09:50
17220606,2023-04-05 22:36:18.000,2023-04-13 19:26:02.000,Day Sq,A32051,42.379295,-71.027733,,,,,casual,7 days 20:49:44
17220618,2023-04-25 08:35:05.000,2023-04-25 09:16:57.000,359 Broadway - Broadway at Fayette Street,M32026,42.370803,-71.104412,,,,,casual,0 days 00:41:52
17220620,2023-04-09 17:30:05.000,2023-04-15 15:47:25.000,Blue Hill Ave at Almont St,C32044,42.274621,-71.093726,,,,,casual,5 days 22:17:20
17220622,2023-04-24 22:15:25.000,2023-04-28 14:44:45.000,Talbot Ave At Blue Hill Ave,C32043,42.294583,-71.087111,,,,,casual,3 days 16:29:20
...,...,...,...,...,...,...,...,...,...,...,...,...
27065551,2025-06-06 18:27:50.927,2025-06-07 19:27:43.969,Forest Hills,E32010,42.300923,-71.114249,,,,,casual,1 days 00:59:53.042000
27065565,2025-06-16 16:42:03.580,2025-06-17 00:51:50.667,Commonwealth Ave at Naples Rd,E32016,42.351911,-71.123798,,,42.35,-71.14,casual,0 days 08:09:47.087000
27065570,2025-06-05 11:07:18.462,2025-06-06 12:07:14.354,Beacon St at Charles St,D32024,42.356052,-71.069849,,,,,casual,1 days 00:59:55.892000
27065577,2025-06-17 12:40:35.574,2025-06-18 13:40:25.827,Government Center - Cambridge St at Court St,B32032,42.359825,-71.059796,,,,,casual,1 days 00:59:50.253000


In [204]:
big_df.head()

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual,duration
0,2015-01-01 00:21:44,2015-01-01 00:30:47,Porter Square Station,115,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,Subscriber,0 days 00:09:03
1,2015-01-01 00:27:03,2015-01-01 00:34:21,MIT Stata Center at Vassar St / Main St,80,42.361962,-71.092053,95,Cambridge St - at Columbia St / Webster Ave,42.372969,-71.094445,Subscriber,0 days 00:07:18
2,2015-01-01 00:31:31,2015-01-01 00:35:46,One Kendall Square at Hampshire St / Portland St,91,42.366277,-71.09169,68,Central Square at Mass Ave / Essex St,42.36507,-71.1031,Subscriber,0 days 00:04:15
3,2015-01-01 00:53:46,2015-01-01 01:00:58,Porter Square Station,115,42.387995,-71.119084,96,Cambridge Main Library at Broadway / Trowbridg...,42.373379,-71.111075,Subscriber,0 days 00:07:12
4,2015-01-01 01:07:06,2015-01-01 01:19:21,Lower Cambridgeport at Magazine St/Riverside Rd,105,42.356954,-71.113687,88,Inman Square at Vellucci Plaza / Hampshire St,42.374035,-71.101427,Customer,0 days 00:12:15


In [205]:
# seem sus indeed! so we gonna drop those null rows

clean_df = big_df[big_df.notna().all(axis=1)]

In [206]:
clean_df.shape[0]

27035163

yas we got 27,035,163 clean rows now

In [207]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27035163 entries, 0 to 27065609
Data columns (total 12 columns):
 #   Column              Dtype          
---  ------              -----          
 0   started_at          datetime64[ns] 
 1   ended_at            datetime64[ns] 
 2   start_station_name  object         
 3   start_station_id    object         
 4   start_lat           float64        
 5   start_lng           float64        
 6   end_station_id      object         
 7   end_station_name    object         
 8   end_lat             object         
 9   end_lng             object         
 10  member_casual       object         
 11  duration            timedelta64[ns]
dtypes: datetime64[ns](2), float64(2), object(7), timedelta64[ns](1)
memory usage: 2.6+ GB


In [208]:
clean_df.describe()

Unnamed: 0,start_lat,start_lng,duration
count,27035160.0,27035160.0,27035163
mean,42.35804,-71.08889,0 days 00:25:02.690534287
std,0.0535063,0.08865524,0 days 11:32:14.512574284
min,0.0,-73.56692,-1 days +23:05:22
25%,42.34828,-71.10594,0 days 00:06:58.086000
50%,42.3581,-71.08995,0 days 00:11:41.707000
75%,42.36643,-71.06985,0 days 00:19:37
max,45.50509,0.0,492 days 15:12:17.825000


uhhhh min of like negative duration seems sussy... BRUH max duration is 492 days

GOOTA INVESTIGATE

In [None]:
from datetime import timedelta

# investgiate durationnnnnn

# to DOOOOOO

clean_df[clean_df['duration'] > timedelta(days=1)]

# gonna filter out lol

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual,duration
1495,2015-01-08 01:06:37.000,2015-01-09 14:39:54.000,Lechmere Station at Cambridge St / First St,90,42.370677,-71.076529,1,18 Dorrance Warehouse,42.387151,-71.075978,Subscriber,1 days 13:33:17
4064,2015-01-17 13:55:59.000,2015-01-19 11:21:08.000,MIT at Mass Ave / Amherst St,67,42.358100,-71.093198,1,18 Dorrance Warehouse,42.387151,-71.075978,Subscriber,1 days 21:25:09
7168,2015-01-26 17:14:12.000,2015-01-29 09:46:11.000,Ames St at Main St,107,42.362500,-71.088220,67,MIT at Mass Ave / Amherst St,42.3581,-71.093198,Customer,2 days 16:31:59
8562,2015-02-06 15:31:02.000,2015-02-08 11:03:53.000,Harvard University Gund Hall at Quincy St / Ki...,110,42.376369,-71.114025,87,Harvard University Housing - 115 Putnam Ave at...,42.366621,-71.114214,Subscriber,1 days 19:32:51
9228,2015-02-13 08:04:02.000,2015-02-14 10:23:49.000,Lafayette Square at Mass Ave / Main St / Colum...,75,42.363465,-71.100573,107,Ames St at Main St,42.3625,-71.08822,Customer,1 days 02:19:47
...,...,...,...,...,...,...,...,...,...,...,...,...
26803560,2025-06-29 11:50:47.357,2025-06-30 12:47:52.438,Boylston St at Fairfield St,C32008,42.348804,-71.082369,D32018,Boston Convention and Exhibition Center - Summ...,42.347763,-71.04536,casual,1 days 00:57:05.081000
26868490,2025-06-15 19:13:16.445,2025-06-16 19:35:06.237,Alewife Station at Russell Field,M32033,42.396232,-71.139788,S32006,Davis Square,42.396969,-71.123024,casual,1 days 00:21:49.792000
26869040,2025-06-23 06:57:30.021,2025-06-24 07:38:40.778,Cambridge Crossing at North First Street,M32077,42.371141,-71.076198,M32048,Third at Binney,42.365445,-71.082771,casual,1 days 00:41:10.757000
26955634,2025-06-07 13:06:51.222,2025-06-08 13:28:51.492,Washington Sq,K32011,42.339858,-71.134212,D32035,Harvard Ave at Brainerd Rd,42.34953,-71.130228,casual,1 days 00:22:00.270000


In [None]:
# also check durations < 0
clean_df[clean_df['duration'] < timedelta(days=0)] 

# lol deleting these two bc no es possible

Unnamed: 0,started_at,ended_at,start_station_name,start_station_id,start_lat,start_lng,end_station_id,end_station_name,end_lat,end_lng,member_casual,duration
984979,2015-11-01 01:44:08.000,2015-11-01 01:08:25.000,Union Square - Brighton Ave. at Cambridge St.,8,42.353334,-71.137313,76,Central Sq Post Office / Cambridge City Hall a...,42.366426,-71.105495,Subscriber,-1 days +23:24:17
984980,2015-11-01 01:47:16.000,2015-11-01 01:03:13.000,Inman Square at Vellucci Plaza / Hampshire St,88,42.374035,-71.101427,74,Harvard Square at Mass Ave/ Dunster,42.373268,-71.118579,Subscriber,-1 days +23:15:57
984981,2015-11-01 01:48:11.000,2015-11-01 01:13:02.000,Columbus Ave. at Mass. Ave.,57,42.340799,-71.081572,132,Summer St at Cutter St,42.394002,-71.120406,Subscriber,-1 days +23:24:51
984985,2015-11-01 01:50:17.000,2015-11-01 01:32:53.000,West Broadway at Dorchester St,121,42.335693,-71.045859,13,Boston Medical Center - East Concord at Harri...,42.336437,-71.073089,Subscriber,-1 days +23:42:36
984986,2015-11-01 01:50:36.000,2015-11-01 01:28:45.000,West Broadway at Dorchester St,121,42.335693,-71.045859,13,Boston Medical Center - East Concord at Harri...,42.336437,-71.073089,Subscriber,-1 days +23:38:09
...,...,...,...,...,...,...,...,...,...,...,...,...
24864789,2024-11-03 01:35:18.803,2024-11-03 01:01:15.174,Chinatown Gate Plaza,D32015,42.351356,-71.059367,M32012,Central Sq Post Office / Cambridge City Hall a...,42.366426,-71.105495,casual,-1 days +23:25:56.371000
24924629,2024-11-03 01:46:23.903,2024-11-03 01:00:37.868,Inman Square at Springfield St.,M32062,42.374267,-71.100265,D32010,Cross St at Hanover St,42.362811,-71.056067,member,-1 days +23:14:13.965000
24953416,2024-11-03 01:49:41.708,2024-11-03 01:02:27.414,Clinton St at North St,D32057,42.360703,-71.055249,V32011,Broadway at Beacham St,42.398361,-71.063738,member,-1 days +23:12:45.706000
24955104,2024-11-03 01:44:19.671,2024-11-03 01:14:05.677,Boylston St at Arlington St,D32007,42.352052,-71.070360,S32018,Assembly Square T,42.392233,-71.077466,casual,-1 days +23:29:46.006000


In [216]:
clean_df = clean_df[(clean_df['duration'] < timedelta(days=1)) & (clean_df['duration'] > timedelta(0))]

In [217]:
clean_df.info()
# yay final df!

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27015461 entries, 0 to 27065609
Data columns (total 12 columns):
 #   Column              Dtype          
---  ------              -----          
 0   started_at          datetime64[ns] 
 1   ended_at            datetime64[ns] 
 2   start_station_name  object         
 3   start_station_id    object         
 4   start_lat           float64        
 5   start_lng           float64        
 6   end_station_id      object         
 7   end_station_name    object         
 8   end_lat             object         
 9   end_lng             object         
 10  member_casual       object         
 11  duration            timedelta64[ns]
dtypes: datetime64[ns](2), float64(2), object(7), timedelta64[ns](1)
memory usage: 2.6+ GB


In [218]:
clean_df.describe()

Unnamed: 0,start_lat,start_lng,duration
count,27015460.0,27015460.0,27015461
mean,42.35804,-71.0889,0 days 00:17:25.553920026
std,0.05289558,0.08762087,0 days 00:34:13.984093683
min,0.0,-73.56692,0 days 00:00:01
25%,42.34828,-71.10594,0 days 00:06:58
50%,42.3581,-71.08995,0 days 00:11:41.031000
75%,42.36643,-71.06985,0 days 00:19:35.317000
max,45.50509,0.0,0 days 23:59:58


In [219]:
clean_df.isna().sum()

started_at            0
ended_at              0
start_station_name    0
start_station_id      0
start_lat             0
start_lng             0
end_station_id        0
end_station_name      0
end_lat               0
end_lng               0
member_casual         0
duration              0
dtype: int64

YAY no null values. and duration seems reasonable. let's double check unique values of each column now as final preprocessing

In [None]:
# check unique values for categorical cols like station names and ids and member_casual

clean_df.columns

Index(['started_at', 'ended_at', 'start_station_name', 'start_station_id',
       'start_lat', 'start_lng', 'end_station_id', 'end_station_name',
       'end_lat', 'end_lng', 'member_casual', 'duration'],
      dtype='object')

In [243]:
start_names = clean_df["start_station_name"].unique().sort()
end_names = clean_df["end_station_name"].unique().sort()
start_id = clean_df["start_station_id"].unique().sort()
end_id = clean_df["end_station_id"].unique().sort()

TypeError: '<' not supported between instances of 'str' and 'int'

In [244]:
start_names

In [241]:
from collections import Counter
# start_names == end_names # lol compare them to see if they are identical or nah
# start_id == end_id

print(Counter(start_names) == Counter(end_names))
print(Counter(start_id) == Counter(end_id))

False
False


In [None]:
# rename columns
# standardize formatting e.g. starttime

In [None]:
# eda techniques

# EDA