# 01B - Data Cleaning, Toronto Bike Share

## Import relevant libraries

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

---

## Import Data

Import 2019 to 2023 data, see work performed in **01A - Data Import, Toronto Bike Share** notebook for the compilation of data.

In [2]:
dtype_dictionary = {"Trip Id": 'string', 
                    "Trip  Duration": 'Int64',
                    "Start Station Id": 'string', 
                    "Start Time": 'string', #use parse_dates in pd.read_csv()
                    "Start Station Name": 'string',
                    "End Station Id": 'string',
                    "End Time": 'string', #use parse_dates in pd.read_csv()
                    "End Station Name": 'string',
                    "Bike Id": 'string',
                    "User Type": 'string'
                   }

list_parse_date = ["Start Time", "End Time"]

In [3]:
# Read Ridership Data
df_bike_share_trips_raw = pd.read_csv("data/2019_to_2023_bike_ridership_raw.csv", 
                                      dtype=dtype_dictionary, parse_dates=list_parse_date)

# Keep the original import dataset for comparability / being able to go back 
df_bike_share_trips = df_bike_share_trips_raw.copy()

In [4]:
# Read Other Data

# Stations
df_bike_stations = pd.read_csv("data/bike_stations_raw.csv",
                              dtype={'station_id':'string', 'station_name':'string', 'lat':'float64', 'lon':'float64'}) 

# Price Plans
df_bike_priceplans = pd.read_csv("data/bike_priceplans_raw.csv",
                                dtype =  {'plan_id': 'string', 'name': 'string', 'currency': 'string', 'currency': 'string', 'description': 'string'})

*Ridership Data - Check*

In [5]:
df_bike_share_trips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19259368 entries, 0 to 19259367
Data columns (total 10 columns):
 #   Column              Dtype         
---  ------              -----         
 0   Trip Id             string        
 1   Trip  Duration      Int64         
 2   Start Station Id    string        
 3   Start Time          datetime64[ns]
 4   Start Station Name  string        
 5   End Station Id      string        
 6   End Time            datetime64[ns]
 7   End Station Name    string        
 8   Bike Id             string        
 9   User Type           string        
dtypes: Int64(1), datetime64[ns](2), string(7)
memory usage: 1.5 GB


In [6]:
df_bike_share_trips.head() #visualize

Unnamed: 0,Trip Id,Trip Duration,Start Station Id,Start Time,Start Station Name,End Station Id,End Time,End Station Name,Bike Id,User Type
0,4581278,1547,7021,2019-01-01 00:08:00,Bay St / Albert St,7233,2019-01-01 00:33:00,King / Cowan Ave - SMART,1296,Annual Member
1,4581279,1112,7160,2019-01-01 00:10:00,King St W / Tecumseth St,7051,2019-01-01 00:29:00,Wellesley St E / Yonge St (Green P),2947,Annual Member
2,4581280,589,7055,2019-01-01 00:15:00,Jarvis St / Carlton St,7013,2019-01-01 00:25:00,Scott St / The Esplanade,2293,Annual Member
3,4581281,259,7012,2019-01-01 00:16:00,Elizabeth St / Edward St (Bus Terminal),7235,2019-01-01 00:20:00,Bay St / College St (West Side) - SMART,283,Annual Member
4,4581282,281,7041,2019-01-01 00:19:00,Edward St / Yonge St,7257,2019-01-01 00:24:00,Dundas St W / St. Patrick St,1799,Annual Member


*Stations Data - Check*

In [7]:
df_bike_stations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785 entries, 0 to 784
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   station_id    785 non-null    string 
 1   station_name  785 non-null    string 
 2   lat           785 non-null    float64
 3   lon           785 non-null    float64
dtypes: float64(2), string(2)
memory usage: 24.7 KB


In [8]:
df_bike_stations.head()

Unnamed: 0,station_id,station_name,lat,lon
0,7000,Fort York Blvd / Capreol Ct,43.639832,-79.395954
1,7001,Wellesley Station Green P,43.664964,-79.38355
2,7002,St. George St / Bloor St W,43.667333,-79.399429
3,7003,Madison Ave / Bloor St W,43.667158,-79.402761
4,7005,King St W / York St,43.648001,-79.383177


*Price Plan Data - Check*

In [9]:
df_bike_priceplans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   plan_id      49 non-null     string 
 1   name         49 non-null     string 
 2   currency     49 non-null     string 
 3   price        49 non-null     float64
 4   description  40 non-null     string 
 5   is_taxable   49 non-null     int64  
dtypes: float64(1), int64(1), string(4)
memory usage: 2.4 KB


In [10]:
df_bike_priceplans.head()

Unnamed: 0,plan_id,name,currency,price,description,is_taxable
0,186,Annual 30,CAD,105.0,Unlimited 30-min trips on classic bikes,1
1,191,CMP-City of Toronto,CAD,90.0,CMP-City of Toronto,1
2,208,Annual 45,CAD,120.0,Unlimited 45-min trips on classic bikes,1
3,209,Corporate 30,CAD,84.0,Corporate 30,1
4,210,Corporate 45,CAD,96.0,Corporate 45,1


## Bike Station

Upon further inspection, it seems like some stations that are in the trips dataframe is not in the station dataframe. They are likely removed from the station as they are no longer active. 

I have found supplemental information from a csv file online (i.e. older hardcoded file) to help bridge the gaps in our station dataframe.

In a professional setting, I would likely not be using files found online but older internal files / historical tables, etc.

In [11]:
# Stations SUPPLEMENT
df_bike_stations_supplement = pd.read_csv("data/bikeshare_stations_supplement.csv",
                                          dtype={'station_id':'string', 'station_name':'string', 'lat':'float64', 'lon':'float64'}) 

In [12]:
df_bike_stations_supplement.head() # visualize

Unnamed: 0,station_id,name,lat,lon
0,7000,Fort York Blvd / Capreol Crt,43.639832,-79.395954
1,7001,Lower Jarvis St / The Esplanade,43.647992,-79.370907
2,7002,St. George St / Bloor St W,43.667333,-79.399429
3,7003,Madison Ave / Bloor St W,43.667158,-79.402761
4,7004,University Ave / Elm St,43.656518,-79.389099


In [13]:
df_bike_stations.head()

Unnamed: 0,station_id,station_name,lat,lon
0,7000,Fort York Blvd / Capreol Ct,43.639832,-79.395954
1,7001,Wellesley Station Green P,43.664964,-79.38355
2,7002,St. George St / Bloor St W,43.667333,-79.399429
3,7003,Madison Ave / Bloor St W,43.667158,-79.402761
4,7005,King St W / York St,43.648001,-79.383177


Even a quick side by side comparison of the first 5 rows shows that Station ID 7004, is no longer in the Station dataframe (pulled based on JSON information available at Bike Share URL, see 01A notebook for more details).

In [14]:
# Get rows where station_id is in supplemental file but not in our current station file
df_stations_to_add = df_bike_stations_supplement[~df_bike_stations_supplement['station_id'].isin(df_bike_stations['station_id'])]
df_stations_to_add

Unnamed: 0,station_id,name,lat,lon
4,7004,University Ave / Elm St,43.656518,-79.389099
10,7011,Wellington St W / Portland St,43.642982,-79.399256
50,7051,Wellesley St E / Yonge St (Green P),43.66506,-79.38357
55,7056,Parliament St / Gerrard St,43.662132,-79.36568
59,7060,Princess St / Adelaide St E,43.652123,-79.367139
61,7062,University Ave / College St,43.659226,-79.390213
66,7067,Yonge St / Harbour St,43.643795,-79.375413
88,7092,Pape Subway Green P,43.680223,-79.344062
94,7098,Riverdale Park South (Broadview Ave),43.667819,-79.35347
129,7134,Marlborough Ave / Yonge St,43.68,-79.391111


In [15]:
# Number of stations to add
df_stations_to_add.shape[0]

25

In [16]:
# Number of rows before adding
df_bike_stations.shape[0]

785

In [17]:
# Expected Number of rows after adding
df_bike_stations.shape[0] + df_stations_to_add.shape[0]

810

In [18]:
df_bike_stations = pd.concat([df_bike_stations, df_stations_to_add],ignore_index=True)

In [19]:
# Number of rows after adding
df_bike_stations.shape[0]

810

In [20]:
# Write to CSV 
df_bike_stations.to_csv("data/bike_stations_clean.csv",
                           index = False) # don't include index as a separate column)

## Missing Data

Do some Preliminary checks / summaries.

In [21]:
df_bike_share_trips.duplicated().sum()

0

In [22]:
# Count of missing data
df_bike_share_trips.isna().sum()

Trip Id                    0
Trip  Duration            16
Start Station Id           0
Start Time                 0
Start Station Name    783014
End Station Id          7947
End Time                   0
End Station Name      791926
Bike Id                  276
User Type                  0
dtype: int64

##### Approach:
- I plan on tackling Trip Duration first - relatively insignificant missing data.
- I will look at Bike Id next, as it's isolated from Stations
- I will look at Station related Data next. I will start with End Station Id as this is the unique identifier for Stations. If both the Id and Name are missing then there is no way to identify the end station.

### Missing Data - Trip Duration

In [23]:
# Magnitude of Missing Data
df_bike_share_trips.isna()['Trip  Duration'].sum()

16

In [24]:
# Visualize Trip Duration Missing Values
df_bike_share_trips[df_bike_share_trips['Trip  Duration'].isna()]

Unnamed: 0,Trip Id,Trip Duration,Start Station Id,Start Time,Start Station Name,End Station Id,End Time,End Station Name,Bike Id,User Type
59771,4651028,,7380,2019-01-28 12:45:00,Erskine Ave / Yonge St SMART,7380,2019-01-28 12:45:00,Erskine Ave / Yonge St SMART,329,Annual Member
74933,4669491,,7309,2019-02-09 21:07:00,Queen St. E / Rhodes Ave.,7309,2019-02-09 21:07:00,Queen St. E / Rhodes Ave.,2173,Annual Member
78219,4673600,,7100,2019-02-12 13:58:00,Dundas St E / Regent Park Blvd,7100,2019-02-12 13:59:00,Dundas St E / Regent Park Blvd,3103,Annual Member
86634,4684203,,7324,2019-02-20 09:35:00,King St W / Charlotte St (West),7324,2019-02-20 09:35:00,King St W / Charlotte St (West),1834,Annual Member
120157,4724000,,7387,2019-03-11 18:47:00,Mortimer Ave / Carlaw Ave SMART,7387,2019-03-11 18:47:00,Mortimer Ave / Carlaw Ave SMART,557,Annual Member
159043,4768737,,7341,2019-03-23 14:20:00,Eastern Ave / Winnifred Ave,7341,2019-03-23 14:20:00,Eastern Ave / Winnifred Ave,3272,Annual Member
188615,4802014,,7385,2019-03-31 15:48:00,20 Charles St E,7385,2019-03-31 15:49:00,20 Charles St E,3733,Annual Member
312368,4942306,,7203,2019-04-26 11:11:00,Bathurst St/Queens Quay(Billy Bishop Airport),7203,2019-04-26 11:11:00,Bathurst St/Queens Quay(Billy Bishop Airport),1186,Annual Member
395955,5037597,,7354,2019-05-11 13:22:00,Tommy Thompson Park (Leslie Street Spit),7354,2019-05-11 13:22:00,Tommy Thompson Park (Leslie Street Spit),714,Annual Member
407899,5051307,,7385,2019-05-14 16:29:00,20 Charles St E,7385,2019-05-14 16:29:00,20 Charles St E,3167,Annual Member


Based on quick visual inspection, it appears that Trip Duration is null when `Start Time` equals `End Time`

In [25]:
df_bike_share_trips[df_bike_share_trips['Trip  Duration'].isna()]['End Time']

59771     2019-01-28 12:45:00
74933     2019-02-09 21:07:00
78219     2019-02-12 13:59:00
86634     2019-02-20 09:35:00
120157    2019-03-11 18:47:00
159043    2019-03-23 14:20:00
188615    2019-03-31 15:49:00
312368    2019-04-26 11:11:00
395955    2019-05-11 13:22:00
407899    2019-05-14 16:29:00
492954    2019-05-25 16:52:00
1290325   2019-08-06 07:50:00
1674887   2019-09-04 13:13:00
1749283   2019-09-10 21:31:00
2198867   2019-10-28 17:49:00
2415410   2019-12-20 12:28:00
Name: End Time, dtype: datetime64[ns]

In [26]:
na_trip_delta = df_bike_share_trips[df_bike_share_trips['Trip  Duration'].isna()]['End Time'] - df_bike_share_trips[df_bike_share_trips['Trip  Duration'].isna()]['Start Time']
na_trip_delta #show

59771     0 days 00:00:00
74933     0 days 00:00:00
78219     0 days 00:01:00
86634     0 days 00:00:00
120157    0 days 00:00:00
159043    0 days 00:00:00
188615    0 days 00:01:00
312368    0 days 00:00:00
395955    0 days 00:00:00
407899    0 days 00:00:00
492954    0 days 00:00:00
1290325   0 days 00:00:00
1674887   0 days 00:00:00
1749283   0 days 00:00:00
2198867   0 days 00:00:00
2415410   0 days 00:00:00
dtype: timedelta64[ns]

In [27]:
# Conver Time Delta to integer in seconds
na_trip_delta = (na_trip_delta / np.timedelta64(1, 's')).astype('int64')
na_trip_delta

59771       0
74933       0
78219      60
86634       0
120157      0
159043      0
188615     60
312368      0
395955      0
407899      0
492954      0
1290325     0
1674887     0
1749283     0
2198867     0
2415410     0
dtype: int64

In [28]:
# Expected rows after drop
df_bike_share_trips.shape[0] - df_bike_share_trips['Trip  Duration'].isna().sum()

19259352

In [29]:
# Drop Rows where Trip Duration 
df_bike_share_trips.dropna(subset=['Trip  Duration'], inplace=True)

In [30]:
# Check that total rows are as expected
df_bike_share_trips.shape

(19259352, 10)

In [31]:
#Check Missing Values
df_bike_share_trips.isna().sum()

# Check that Trip Duration missing values no longer exist

Trip Id                    0
Trip  Duration             0
Start Station Id           0
Start Time                 0
Start Station Name    783014
End Station Id          7947
End Time                   0
End Station Name      791926
Bike Id                  276
User Type                  0
dtype: int64

---

### Missing Data - End Station Id

In [32]:
# Magnitude of Missing Data
df_bike_share_trips.isna()['End Station Id'].sum()

7947

In [33]:
# Visualize missing data for End Station Id
df_bike_share_trips[df_bike_share_trips['End Station Id'].isna()]

Unnamed: 0,Trip Id,Trip Duration,Start Station Id,Start Time,Start Station Name,End Station Id,End Time,End Station Name,Bike Id,User Type
693086,5370500,696,7228,2019-06-17 13:21:00,Queen St W / Roncesvalles Ave,,2019-06-17 13:32:00,,2345,Annual Member
969363,5679465,0,7077,2019-07-11 16:45:00,College Park South,,2019-07-11 16:45:00,,1232,Casual Member
1289598,6033723,327,7444,2019-08-06 01:41:00,Clendenan Ave / Rowland St - SMART,,2019-08-06 01:46:00,,1890,Casual Member
1917536,6735107,979,7432,2019-09-25 14:48:00,Frederick St / King St E,,2019-09-25 15:04:00,,2113,Annual Member
1927909,6746731,60,7077,2019-09-26 13:31:00,College Park South,,2019-09-26 13:32:00,,1430,Annual Member
...,...,...,...,...,...,...,...,...,...,...
19257075,26680335,0,7197,2023-12-31 15:04:00,Queen St W / Dovercourt Rd,,2023-12-31 15:04:00,,5739,Casual Member
19258545,26681858,0,7770,2023-12-31 18:50:00,,,2023-12-31 18:50:00,,5715,Casual Member
19258922,26682252,0,7537,2023-12-31 20:56:00,Euclid Ave / Herrick St - SMART,,2023-12-31 20:56:00,,3727,Casual Member
19258925,26682255,0,7537,2023-12-31 20:56:00,Euclid Ave / Herrick St - SMART,,2023-12-31 20:56:00,,2991,Casual Member


In [34]:
# Check that the same rows are missing for End Station ID and Name
print ("Number of rows where End Station Id and End Station Name, both have null values: ", 
       str(df_bike_share_trips[df_bike_share_trips['End Station Id'].isna()]['End Station Name'].isna().sum()))

# If it's 7947, then the End Station Name is also blank for all rows.

Number of rows where End Station Id and End Station Name, both have null values:  7947


In [35]:
# calculate the data missing as a ratio to total rows
df_bike_share_trips[df_bike_share_trips['End Station Id'].isna()]['End Station Name'].isna().sum() / df_bike_share_trips.shape[0] 

0.000412630705332142

In [36]:
# Missing data as a percentage of total rows
print("Percent of total rows : "+
      str(round(df_bike_share_trips[df_bike_share_trips['End Station Id'].isna()]['End Station Name'].isna().sum() / df_bike_share_trips.shape[0] *100,3))+
      "%"     
     )

Percent of total rows : 0.041%


##### Findings:
- When the End Station ID is missing, so is End Station Name, there is no way to fill in this data without knowledge of what happened to this system (i.e. specific outage to a station, glitch, etc). 
- As End Station ID / Name are bothe categorical in nature (i.e. non-numerical). We cannot use numerical approaches to fill in data (averages, medians, forward / back fills, etc)
- These rows with missing values only represents 0.038% of the data. They will be dropped.

In [37]:
# Expected rows after drop
df_bike_share_trips.shape[0] - df_bike_share_trips['End Station Id'].isna().sum()

19251405

In [38]:
# Drop Rows where Trip Duration 
df_bike_share_trips.dropna(subset=['End Station Id'], inplace=True)

In [39]:
# Check that total rows are as expected
df_bike_share_trips.shape

(19251405, 10)

In [40]:
#Check Missing Values
df_bike_share_trips.isna().sum()

# Check that End Station Id missing values no longer exist

Trip Id                    0
Trip  Duration             0
Start Station Id           0
Start Time                 0
Start Station Name    782629
End Station Id             0
End Time                   0
End Station Name      783979
Bike Id                  275
User Type                  0
dtype: int64

### Missing Data - Start Station and End Station Names

In [41]:
# Magnitude of Missing Data
df_bike_share_trips.isna()[['Start Station Name', 'End Station Name']].sum()

Start Station Name    782629
End Station Name      783979
dtype: int64

Look at *Start Station Name* first, in conjunction with Id

In [42]:
st_id_name = df_bike_share_trips[['Start Station Id', 'Start Station Name']].drop_duplicates()
st_id_name

Unnamed: 0,Start Station Id,Start Station Name
0,7021,Bay St / Albert St
1,7160,King St W / Tecumseth St
2,7055,Jarvis St / Carlton St
3,7012,Elizabeth St / Edward St (Bus Terminal)
4,7041,Edward St / Yonge St
...,...,...
19209850,7116,555 Bloor St East
19212940,7917,
19215025,7916,
19231691,7915,


In [43]:
st_id_missing_name = st_id_name[st_id_name['Start Station Name'].isna()]['Start Station Id']
st_id_missing_name # Get series of stations Ids which are missing names

4850058     7659
4898087     7660
5068606     7662
5266799     7663
5294110     7664
            ... 
19204704    7882
19212940    7917
19215025    7916
19231691    7915
19238958    7918
Name: Start Station Id, Length: 241, dtype: string

In [44]:
# Check how many of them have names in our dataset
st_id_name[st_id_name['Start Station Id'].isin(st_id_missing_name)].dropna().count()

# Number of Ids already within our dataset

Start Station Id      15
Start Station Name    15
dtype: int64

In [45]:
# Visualizing the nu
st_id_name[st_id_name['Start Station Id'].isin(st_id_missing_name)].dropna()

Unnamed: 0,Start Station Id,Start Station Name
5437285,7667,Spadina Ave / Sussex Ave - SMART
5437382,7662,Beaty Ave / Queen St W
5437648,7659,Amroth Ave / Danforth Ave
5440055,7663,Kilgour Rd / Rumsey Rd
5441067,7665,Sunnybrook Health Centre - S Wing
5441074,7660,285 Victoria St
5449285,7666,Dundas St W / St Helen Ave - SMART
5456428,7664,Sunnybrook Health Centre - L Wing
8170265,7675,1525 Dundas St
8170304,7680,Princes Gate / Nunavut Dr


Only 15 station names available based on data already in the dataset, I will need to consider data outside of the dataset. In this case, the Bike Stations dataset can help with names.

In [46]:
# Look to see how many unique Station Names can be filled in with Stations Dataset
df_bike_stations[df_bike_stations['station_id'].isin(st_id_missing_name)]

Unnamed: 0,station_id,station_name,lat,lon,name
541,7659,Amroth Ave / Danforth Ave,43.685613,-79.311683,
542,7660,285 Victoria St,43.656633,-79.379625,
543,7662,Beaty Ave / Queen St W,43.639311,-79.440814,
544,7663,Kilgour Rd / Rumsey Rd,43.718039,-79.371914,
545,7664,Sunnybrook Health Centre - L Wing,43.722680,-79.376440,
...,...,...,...,...,...
773,7914,York St / Wellington St W (2),43.647043,-79.383141,
774,7915,Rogers Rd / Watt Ave,43.682337,-79.468059,
775,7916,University Ave / Pearl St,43.648430,-79.385566,
776,7917,King St W / University Ave,43.647818,-79.384280,


The Stations Dataset can fill in about 234 / 241 stations with missing ID

*Fill in Empty Start Station Names*

In [47]:
df_bike_share_trips.isna().sum()

Trip Id                    0
Trip  Duration             0
Start Station Id           0
Start Time                 0
Start Station Name    782629
End Station Id             0
End Time                   0
End Station Name      783979
Bike Id                  275
User Type                  0
dtype: int64

In [48]:
bike_stations_series = df_bike_stations.set_index('station_id')['station_name']
bike_stations_series

station_id
7000    Fort York  Blvd / Capreol Ct
7001       Wellesley Station Green P
7002      St. George St / Bloor St W
7003        Madison Ave / Bloor St W
7005             King St W / York St
                    ...             
7330                            <NA>
7369                            <NA>
7372                            <NA>
7382                            <NA>
7392                            <NA>
Name: station_name, Length: 810, dtype: string

In [49]:
# Fill in null values for Start / End Station Names 

df_bike_share_trips['Start Station Name'] = np.where(df_bike_share_trips['Start Station Name'].isna(), 
                                                    df_bike_share_trips['Start Station Id'].map(bike_stations_series),
                                                    df_bike_share_trips['Start Station Name'])


df_bike_share_trips['End Station Name'] = np.where(df_bike_share_trips['End Station Name'].isna(),
                                                   df_bike_share_trips['End Station Id'].map(bike_stations_series),
                                                   df_bike_share_trips['End Station Name'])

In [50]:
df_bike_share_trips.isna().sum()

Trip Id                   0
Trip  Duration            0
Start Station Id          0
Start Time                0
Start Station Name    25058
End Station Id            0
End Time                  0
End Station Name      20765
Bike Id                 275
User Type                 0
dtype: int64

In [51]:
# Express missing Data as a percentage 
print(round(df_bike_share_trips['Start Station Name'].isna().sum() / df_bike_share_trips.shape[0] * 100,4), " %")

0.1302  %


In [52]:
# Express missing Data as a percentage 
print(round(df_bike_share_trips['End Station Name'].isna().sum() / df_bike_share_trips.shape[0] * 100,4), " %")

0.1079  %


##### Findings:
Only a small percentage of Data has missing start / end Station Names.

---

### Merge Lat and Lon Data

Goal of this section is to bring in Lat and Lon values into our bike share trips data for start and end stations.

In [53]:
# Add in Lat and Lon for Start Station
    # Use Left Join
df_bike_share_trips = pd.merge(df_bike_share_trips, df_bike_stations[['station_id','lat','lon']], 
                               how= 'left', left_on= "Start Station Id", right_on= "station_id")

# Rename Columns
df_bike_share_trips.rename(columns={'lat': 'start_lat', 'lon': 'start_lon'}, inplace=True)

# Drop 'station_id' that is brought in by merge
df_bike_share_trips.drop(columns=['station_id'], inplace=True)

In [54]:
# Add in Lat and Lon for End Station
    # Use Left Join
df_bike_share_trips = pd.merge(df_bike_share_trips, df_bike_stations[['station_id','lat','lon']], 
                               how= 'left', left_on= "End Station Id", right_on= "station_id")

# Rename Columns
df_bike_share_trips.rename(columns={'lat': 'end_lat', 'lon': 'end_lon'}, inplace=True)

# Drop 'station_id' that is brought in by merge
df_bike_share_trips.drop(columns=['station_id'], inplace=True)

In [55]:
df_bike_share_trips.isna().sum()

Trip Id                    0
Trip  Duration             0
Start Station Id           0
Start Time                 0
Start Station Name     25058
End Station Id             0
End Time                   0
End Station Name       20765
Bike Id                  275
User Type                  0
start_lat             579356
start_lon             579356
end_lat               579540
end_lon               579540
dtype: int64

In [56]:
# Missing Values - Start_lat
print(round(df_bike_share_trips['start_lat'].isna().sum() / df_bike_share_trips.shape[0] * 100,4), " %")

3.0094  %


In [57]:
# Missing Values - End_lat
print(round(df_bike_share_trips['end_lat'].isna().sum() / df_bike_share_trips.shape[0] * 100,4), " %")

3.0104  %


##### Findings:
- If this was a professional setting, I would find ways to close this gap of missing data. Potential solutions would be to look at any other sources of data (i.e. internally at the organization or externally, via searching intersections based on station names, etc).
- However, the goal of this project is to showcase my skills rather than trying to search for missing data. 
- I will be dropping rows where station data is unavailable (for the purpose of being able to perform visualizations, etc).

In [58]:
# Drop Rows where the Station data is unavailable at this time
df_bike_share_trips.dropna(subset= ['start_lat','end_lat'], inplace=True)

In [59]:
# Display Missing values
df_bike_share_trips.isna().sum()

Trip Id                 0
Trip  Duration          0
Start Station Id        0
Start Time              0
Start Station Name      0
End Station Id          0
End Time                0
End Station Name        0
Bike Id               255
User Type               0
start_lat               0
start_lon               0
end_lat                 0
end_lon                 0
dtype: int64

In [60]:
df_bike_share_trips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18128950 entries, 0 to 19251404
Data columns (total 14 columns):
 #   Column              Dtype         
---  ------              -----         
 0   Trip Id             string        
 1   Trip  Duration      Int64         
 2   Start Station Id    string        
 3   Start Time          datetime64[ns]
 4   Start Station Name  object        
 5   End Station Id      string        
 6   End Time            datetime64[ns]
 7   End Station Name    object        
 8   Bike Id             string        
 9   User Type           string        
 10  start_lat           float64       
 11  start_lon           float64       
 12  end_lat             float64       
 13  end_lon             float64       
dtypes: Int64(1), datetime64[ns](2), float64(4), object(2), string(5)
memory usage: 2.0+ GB


In [61]:
# Write to CSV 
df_bike_share_trips.to_csv("data/2019_to_2023_bike_ridership_clean.csv",
                           index = False) # don't include index as a separate column)

---