<h2 align='center'> Cyclistic: How Does a Bike-Share Navigate Speedy Success? </h2>

<h4 align='center'> (Part 1: Data Cleaning & Transformation) </h4>

### 1. Import libraries and datasets

In [1]:
import pandas as pd

In [2]:
bt_202004 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202004-divvy-tripdata.csv')
bt_202005 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202005-divvy-tripdata.csv')
bt_202006 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202006-divvy-tripdata.csv')
bt_202007 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202007-divvy-tripdata.csv')
bt_202008 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202008-divvy-tripdata.csv')
bt_202009 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202009-divvy-tripdata.csv')
bt_202010 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202010-divvy-tripdata.csv')
bt_202011 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202011-divvy-tripdata.csv')
bt_202012 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202012-divvy-tripdata.csv')
bt_202101 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202101-divvy-tripdata.csv')
bt_202102 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202102-divvy-tripdata.csv')
bt_202103 = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/202103-divvy-tripdata.csv')

### 2. Examine datasets

#### By briefly examining all 12 datasets, I've noticed these datasets have been organized in the same format of 13 columns:

| Column | This column measures |
| :---: | :---: |
| *ride_id* | unique id of each bike ride initiated |
| *rideable_type* | 3 types of bikes used by Cycslistic |
| *started_at* | start date and time of each bike ride |
| *ended_at* | end date and time of each bike ride |
| *start_station_name* | bike station name where each bike ride starts |
| *start_station_id* | bike station id where each bike ride starts |
| *end_station_name* | bike station name where each bike ride ends |
| *end_station_id* | bike station id where each bike ride ends |
| *start_lat* | latitude of bike station where each bike ride starts |
| *start_lng* | longitude of biek station where each bike ride starts |
| *end_lat* | latitude of bike station where each bike ride ends |
| *end_lng* | longitude of bike station where each bike ride ends |
| *member_casual* | whether a bike rider is a member of Cyclistic program or not |

#### For each dataset, I will perform these data cleaning and transformations:
- check the uniqueness of *ride_id*
- drop any NULL values in all columns
- check the data types of each column, and convert to the correct type if necessary
- check the consistency between *start_station_name* and *start_station_id*
- check the consistency between *end_station_name* and *end_station_id*
- drop irrelevant columns: *start_lat*, *start_lng*, *end_lat*, *end_lng*

***Note:*** 

For all 12 datasets, *started_at* and *ended_at* columns are strings, which they should be datetimes. However, I will leave them "as-is" now since my goal for this this phase of data cleaning is mainly clean out invalid data and transform data for further analysis.

#### (2-1) dataset: bike trips of April 2020

In [3]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202004.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [4]:
bt_202004.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id        int64
end_station_name       object
end_station_id        float64
member_casual          object
dtype: object

In [5]:
# check uniqueness of ride_id
bt_202004['ride_id'].value_counts()

180BFC2CEBA46994    1
8509DEDDDAA9FFBB    1
5B87B45E922A3295    1
689A3BF47F4D1F3D    1
9C2E445CDC3F0793    1
                   ..
DB18E2F0B9FDD7BC    1
9F8A694B13CA745B    1
795891B9DB26AF89    1
A4F063F6183C3B38    1
A47C0641866DFD75    1
Name: ride_id, Length: 84776, dtype: int64

In [6]:
# check NAN values
bt_202004.isna().sum()

ride_id                0
rideable_type          0
started_at             0
ended_at               0
start_station_name     0
start_station_id       0
end_station_name      99
end_station_id        99
member_casual          0
dtype: int64

In [7]:
# drop all NAN values
bt_202004.dropna(inplace = True)

In [8]:
# convert 'end_station_id' column from floating points to integers
bt_202004['end_station_id'] = bt_202004['end_station_id'].astype('int64')

#### (2-2) dataset: bike trips of May 2020

In [9]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202005.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [10]:
bt_202005.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id        int64
end_station_name       object
end_station_id        float64
member_casual          object
dtype: object

In [11]:
# check uniqueness of ride_id
bt_202005['ride_id'].value_counts()

9DA1D40B49BE87ED    1
93AC49BEBB95603C    1
029F9FB40454F748    1
C2960CFC94029246    1
B2263918A268ED11    1
                   ..
8C493B18A238FB13    1
2AEEC9C39CC4B885    1
4B9915803027704D    1
5CB85E22CB30FDBC    1
3371E3D02F2E55DC    1
Name: ride_id, Length: 200274, dtype: int64

In [12]:
# check NAN values
bt_202005.isna().sum()

ride_id                 0
rideable_type           0
started_at              0
ended_at                0
start_station_name      0
start_station_id        0
end_station_name      321
end_station_id        321
member_casual           0
dtype: int64

In [13]:
# drop all NAN values
bt_202005.dropna(inplace = True)

In [14]:
# convert 'end_station_id' from floating points to integers
bt_202005['end_station_id'] = bt_202005['end_station_id'].astype('int64')

#### (2-3) dataset: bike trips of June 2020

In [15]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202006.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [16]:
bt_202006.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id        int64
end_station_name       object
end_station_id        float64
member_casual          object
dtype: object

In [17]:
# check uniqueness of ride_id
bt_202006['ride_id'].value_counts()

1EE2632EBD249B4B    1
39DBCAD831F4816B    1
95FBADB59D875758    1
FC76A830FE106B0C    1
F3E5D7709A2DFC61    1
                   ..
47A4903A5D7A4F8A    1
23DE09463AC304A9    1
42E81992E3CDA0E3    1
63CA826F8129C963    1
696615A559CD365A    1
Name: ride_id, Length: 343005, dtype: int64

In [18]:
# check NAN values
bt_202006.isna().sum()

ride_id                 0
rideable_type           0
started_at              0
ended_at                0
start_station_name      0
start_station_id        0
end_station_name      468
end_station_id        468
member_casual           0
dtype: int64

In [19]:
# drop all NAN values
bt_202006.dropna(inplace = True)

In [20]:
# convert 'end_station_id' from floating points to integers
bt_202006['end_station_id'] = bt_202006['end_station_id'].astype('int64')

#### (2-4) dataset: bike trips of July 2020

In [21]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202007.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [22]:
bt_202007.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id      float64
end_station_name       object
end_station_id        float64
member_casual          object
dtype: object

In [23]:
# check uniqueness of ride_id
bt_202007['ride_id'].value_counts()

37F18C5D0C6ADFC9    1
5AE0ED2811D1A181    1
D94C8F00A434E68B    1
552CA14FA6DD95A9    1
48801333DE8F78FA    1
                   ..
51D0355A0ED747C3    1
6765B9813FCF4FE4    1
0FBC7E5C58B55444    1
FFB0F68524DF45E5    1
DBB9BDE0A2070964    1
Name: ride_id, Length: 551480, dtype: int64

In [24]:
# check NAN values
bt_202007.isna().sum()

ride_id                 0
rideable_type           0
started_at              0
ended_at                0
start_station_name    149
start_station_id      152
end_station_name      967
end_station_id        969
member_casual           0
dtype: int64

In [25]:
# drop all NAN values
bt_202007.dropna(inplace = True)

In [26]:
# convert 'start_station_id' column from floating points to integers
bt_202007['start_station_id'] = bt_202007['start_station_id'].astype('int64')

# convert 'end_station_id' column from floating points to integers
bt_202007['end_station_id'] = bt_202007['end_station_id'].astype('int64')

#### (2-5) dataset: bike trips of August 2020

In [27]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202008.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [28]:
bt_202008.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id      float64
end_station_name       object
end_station_id        float64
member_casual          object
dtype: object

In [29]:
# check uniqueness of ride_id
bt_202008['ride_id'].value_counts()

BC10BD9EA2FF6100    1
651B5008DF498A35    1
A96E3EB2AA84BA4C    1
376D892277265523    1
DF157958D55CFF61    1
                   ..
7AFCAE0B95D8FC7E    1
074817E17F039D9E    1
EEA7454B2C642A2B    1
B62630C9347EA24D    1
4C331F38463DB43F    1
Name: ride_id, Length: 622361, dtype: int64

In [30]:
# check NAN values
bt_202008.isna().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name     7595
start_station_id       7691
end_station_name      10035
end_station_id        10110
member_casual             0
dtype: int64

In [31]:
# drop all NAN values
bt_202008.dropna(inplace = True)

In [32]:
# convert 'start_station_id' column from floating points to integers
bt_202008['start_station_id'] = bt_202008['start_station_id'].astype('int64')

# convert 'end_station_id' column from floating points to integers
bt_202008['end_station_id'] = bt_202008['end_station_id'].astype('int64')

#### (2-6) dataset: bike trips of September 2020

In [33]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202009.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [34]:
bt_202009.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id      float64
end_station_name       object
end_station_id        float64
member_casual          object
dtype: object

In [35]:
# check uniqueness of ride_id
bt_202009['ride_id'].value_counts()

C38CFDF0A9CB024D    1
5DB7DB39B859D7D5    1
AAED78CB97E9FC86    1
A66282B15D0FBC66    1
58E30991B514B877    1
                   ..
C741DDFDC4E96E60    1
5C64D660E7D3E5FC    1
8A7A71760A24853E    1
285CBF451A7CA7DA    1
6299733F5451D8A2    1
Name: ride_id, Length: 532958, dtype: int64

In [36]:
# check NAN values
bt_202009.isna().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    19691
start_station_id      19901
end_station_name      23373
end_station_id        23524
member_casual             0
dtype: int64

In [37]:
# drop all NAN values
bt_202009.dropna(inplace = True)

In [38]:
# convert 'start_station_id' column from floating points to integers
bt_202009['start_station_id'] = bt_202009['start_station_id'].astype('int64')

# convert 'end_station_id' column from floating points to integers
bt_202009['end_station_id'] = bt_202009['end_station_id'].astype('int64')

#### (2-7) dataset: bike trips of October 2020

In [39]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202010.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [40]:
bt_202010.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id      float64
end_station_name       object
end_station_id        float64
member_casual          object
dtype: object

In [41]:
# check uniqueness of ride_id
bt_202010['ride_id'].value_counts()

529C47F0822DC5C0    1
3501346A5E53B621    1
643794E6635E95EF    1
100FD8233A8F9952    1
CE323640AE8819E6    1
                   ..
CEBFCB046378858F    1
4F83930DBBF450FD    1
A32229F2E45808FC    1
571D240C7E9AB8AF    1
FE4F6793524CF371    1
Name: ride_id, Length: 388653, dtype: int64

In [42]:
# check NAN values
bt_202010.isna().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    31198
start_station_id      31405
end_station_name      35631
end_station_id        35787
member_casual             0
dtype: int64

In [43]:
# drop all NAN values
bt_202010.dropna(inplace = True)

In [44]:
# convert 'start_station_id' column from floating points to integers
bt_202010['start_station_id'] = bt_202010['start_station_id'].astype('int64')

# convert 'end_station_id' column from floating points to integers
bt_202010['end_station_id'] = bt_202010['end_station_id'].astype('int64')

#### (2-8) dataset: bike trips of November 2020

In [45]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202011.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [46]:
bt_202011.dtypes

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
start_station_id      float64
end_station_name       object
end_station_id        float64
member_casual          object
dtype: object

In [47]:
# check uniqueness of ride_id
bt_202011['ride_id'].value_counts()

670D924F0A941808    1
A6111D8F8B5EC300    1
37ABDE4F6F8B13C3    1
8A31E13EC06FC3EF    1
1C81244028D1B74A    1
                   ..
D20ADADA52FAA92C    1
7488FE9C5D8BEF7B    1
6826EBF87881E5C1    1
8BC746D2D951308F    1
F8EA85941899FB8E    1
Name: ride_id, Length: 259716, dtype: int64

In [48]:
# check NAN values
bt_202011.isna().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    24324
start_station_id      24434
end_station_name      26749
end_station_id        26826
member_casual             0
dtype: int64

In [49]:
# drop all NAN values
bt_202011.dropna(inplace = True)

In [50]:
# convert 'start_station_id' column from floating points to integers
bt_202011['start_station_id'] = bt_202011['start_station_id'].astype('int64')

# convert 'end_station_id' column from floating points to integers
bt_202011['end_station_id'] = bt_202011['end_station_id'].astype('int64')

#### (2-9) dataset: bike trips of December 2020

In [51]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202012.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [52]:
bt_202012.dtypes

ride_id               object
rideable_type         object
started_at            object
ended_at              object
start_station_name    object
start_station_id      object
end_station_name      object
end_station_id        object
member_casual         object
dtype: object

In [53]:
# check uniqueness of ride_id
bt_202012['ride_id'].value_counts()

B990CBAEAD194EA4    1
7F1AD2AE92A14C0A    1
380DC29DF49D4F85    1
7BA60FA22FEF1AED    1
7574D453E5146A83    1
                   ..
815F6FCB970E21C6    1
DD3F5639D5E79367    1
672A2535006E067B    1
AC3FF8C84C46ACB4    1
EC10274146377638    1
Name: ride_id, Length: 131573, dtype: int64

In [54]:
# check NAN values
bt_202012.isna().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    11699
start_station_id      11699
end_station_name      13237
end_station_id        13237
member_casual             0
dtype: int64

In [55]:
# drop all NAN values
bt_202012.dropna(inplace = True)

***Note:***

After examining *start_station_id* and *end_station_id* in this dataset, I found an inconsistency issue between station ids and their corresponding names:
- for example, if a start or end station's name is "Rhodes Ave & 32nd St", then its corresponding id is 263 in all previous nine datasets; However, in this dataset it is associated with id 13215. 

Therefore, I will re-validate the correct ids from cross-referecing all previous datasets, and then cast these two columns into integer datatype.

In [56]:
# get all pairs of station names and ids from previous 8 datasets

station_id = dict(zip(bt_202004['start_station_name'], bt_202004['start_station_id']))
station_id.update(zip(bt_202004['end_station_name'], bt_202004['end_station_id']))

station_id.update(zip(bt_202005['start_station_name'], bt_202005['start_station_id']))
station_id.update(zip(bt_202005['end_station_name'], bt_202005['end_station_id']))

station_id.update(zip(bt_202006['start_station_name'], bt_202006['start_station_id']))
station_id.update(zip(bt_202006['end_station_name'], bt_202006['end_station_id']))

station_id.update(zip(bt_202007['start_station_name'], bt_202007['start_station_id']))
station_id.update(zip(bt_202007['end_station_name'], bt_202007['end_station_id']))

station_id.update(zip(bt_202008['start_station_name'], bt_202008['start_station_id']))
station_id.update(zip(bt_202008['end_station_name'], bt_202005['end_station_id']))

station_id.update(zip(bt_202009['start_station_name'], bt_202009['start_station_id']))
station_id.update(zip(bt_202009['end_station_name'], bt_202009['end_station_id']))

station_id.update(zip(bt_202010['start_station_name'], bt_202010['start_station_id']))
station_id.update(zip(bt_202010['end_station_name'], bt_202010['end_station_id']))

station_id.update(zip(bt_202011['start_station_name'], bt_202011['start_station_id']))
station_id.update(zip(bt_202011['end_station_name'], bt_202011['end_station_id']))

#### I will create a function 'correct_station_name_id' to grab all start station names and end station names from the input dataframe, and determine if each station's name appears in *station_id* dictionary:
- if it does exist, update its corresponding id from the input dataframe to the correct one;
- if it does not exist, mark the station's name in a list 'not_changed' for further cleaning.

In [57]:
not_changed = []

def correct_station_name_id(dataframe):
    start_station_names = set(dataframe['start_station_name'])
    end_station_names = set(dataframe['end_station_name'])
    
    # correct start_station_ids in the dataset
    for name in start_station_names:
        if name not in station_id.keys():
            not_changed.append(name) # catch exceptions in case any station name haven't appeared in previous 8 datasets
        else:
            dataframe.loc[dataframe['start_station_name'] == name, 'start_station_id'] = station_id[name]
    
    # correct end_station_ids in the dataset
    for name in end_station_names:
        if name in not_changed:
            continue
        elif name not in station_id.keys():
            not_changed.append(name) # catch exceptions in case any station name haven't appeared in previous 8 datasets
        else:
            dataframe.loc[dataframe['end_station_name'] == name, 'end_station_id'] = station_id[name]
    
    print('These stations below are not existed in the previous 8 datasets:')
    return not_changed

In [58]:
# update stations' ids in bt_202012 and catch stations that're not existed from previous 8 datasets
correct_station_name_id(bt_202012)

These stations below are not existed in the previous 8 datasets:


['W Oakdale Ave & N Broadway',
 'N Green St & W Lake St',
 'W Armitage Ave & N Sheffield Ave',
 'Base - 2132 W Hubbard Warehouse',
 'N Carpenter St & W Lake St']

#### As showed above, 5 station names were caught as exceptions in *not_changed*. 

#### Therefore, I've found the official dataset of all Divvy Bicycle Stations at <a href="https://data.cityofchicago.org/d/bbyy-e7gq?category=Transportation&view_name=Divvy-Bicycle-Stations"> here </a> on <a href="httP://data.cityofchicago.org"> The Chicago Data Portal website. </a></href>
- I will download this dataset and import it below as the official reference to get the correct station names and/or corresponding ids.</h4>

In [59]:
# import the official bike stations information dataset
official_station_info = pd.read_csv('~/PROJECTS/02_cyclistic_bike_share/original_datasets/official_bike_stations_info.csv')

# get the station name and id in pairs from the official station information dataset
official_station_name_id = dict(zip(official_station_info['Station Name'], official_station_info['ID']))

#### I will create a function 'get_correct_id' to get the correct id of each station from *official_station_name_id* which is not existed in the previous 8 datasets:
- if it does exist, print out the station name and its corresponding id;
- if it does not exist, print out the station name.

In [60]:
def get_correct_id():
    cannot_find_this_station = []
    for name in not_changed:
        if name not in official_station_name_id.keys():
            if name not in cannot_find_this_station:
                cannot_find_this_station.append(name)
        else:
            correct_id = official_station_name_id[name]
            print(name, ': ', correct_id)
            print()
    
    print('These stations below cannot be found in the official stations information dataset:')
    return cannot_find_this_station

In [61]:
# get the correct ids of stations in not_changed
get_correct_id()

W Oakdale Ave & N Broadway :  1436495100903691938

N Green St & W Lake St :  1436495109493626546

W Armitage Ave & N Sheffield Ave :  1436495105198659242

N Carpenter St & W Lake St :  1436495105198659246

These stations below cannot be found in the official stations information dataset:


['Base - 2132 W Hubbard Warehouse']

***Note:*** 

The station "Base - 2132 W Hubbard Warehouse" cannot be found in the official station information dataset. Due to lack of information, I will drop these records.

In [62]:
# drop records with start_station_name as 'Base - 2132 W Hubbard Warehouse'
bt_202012.drop(bt_202012[bt_202012['start_station_name'] == 'Base - 2132 W Hubbard Warehouse'].index, 
               inplace = True)

# drop records with end_station_name as 'Base - 2132 W Hubbard Warehouse'
bt_202012.drop(bt_202012[bt_202012['end_station_name'] == 'Base - 2132 W Hubbard Warehouse'].index, 
               inplace = True)

***Note:***

For the other 4 stations, I will manually update the ids in bt_202012 </h4>

In [63]:
# update station's id with name of 'N Green St & W Lake St'
filt_a1 = (bt_202012['start_station_name'] == 'N Green St & W Lake St')
bt_202012.loc[filt_a1, 'start_station_id'] = 1436495109493626546

filt_a2 = (bt_202012['end_station_name'] == 'N Green St & W Lake St')
bt_202012.loc[filt_a2, 'end_station_id'] = 1436495109493626546

In [64]:
# update station's id with name of 'W Oakdale Ave & N Broadway'
filt_b1 = (bt_202012['start_station_name'] == 'W Oakdale Ave & N Broadway')
bt_202012.loc[filt_b1, 'start_station_id'] = 1436495100903691938

filt_b2 = (bt_202012['end_station_name'] == 'W Oakdale Ave & N Broadway')
bt_202012.loc[filt_b2, 'end_station_id'] = 1436495100903691938

In [65]:
# update station's id with name of 'W Armitage Ave & N Sheffield Ave'
filt_c1 = (bt_202012['start_station_name'] == 'W Armitage Ave & N Sheffield Ave')
bt_202012.loc[filt_c1, 'start_station_id'] = 1436495105198659242

filt_c2 = (bt_202012['end_station_name'] == 'W Armitage Ave & N Sheffield Ave')
bt_202012.loc[filt_c2, 'end_station_id'] = 1436495105198659242

In [66]:
# update station's id with name of 'N Carpenter St & W Lake St'
filt_d1 = (bt_202012['start_station_name'] == 'N Carpenter St & W Lake St')
bt_202012.loc[filt_d1, 'start_station_id'] = 1436495105198659246

filt_d2 = (bt_202012['end_station_name'] == 'N Carpenter St & W Lake St')
bt_202012.loc[filt_d2, 'end_station_id'] = 1436495105198659246

#### (2-10) dataset: bike trips of January 2021

In [67]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202101.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [68]:
bt_202101.dtypes

ride_id               object
rideable_type         object
started_at            object
ended_at              object
start_station_name    object
start_station_id      object
end_station_name      object
end_station_id        object
member_casual         object
dtype: object

In [69]:
# check uniqueness of ride_id
bt_202101['ride_id'].value_counts()

5352EE4B7007BA45    1
5506B158F04CA5BF    1
995F861B1B2524C9    1
E5CFFAE3A6F6CBEE    1
FE285BBC342840BA    1
                   ..
14557A062DA65751    1
625E6804CD6C756E    1
F03EE2FB00F571DF    1
3EB57EF1C89A26C8    1
8A3C1330E06CE106    1
Name: ride_id, Length: 96834, dtype: int64

In [70]:
# check NAN values
bt_202101.isna().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name     8625
start_station_id       8625
end_station_name      10277
end_station_id        10277
member_casual             0
dtype: int64

In [71]:
# drop all NAN values
bt_202101.dropna(inplace = True)

#### After examining *start_station_id* and *end_station_id* in this dataset, I found the same inconsistency issue between station ids and their corresponding names as before. 
    
#### Therefore, I will re-run the whole process of validating station names with corresponding ids using:
- the official stations information dataset from Chicago Data Portal website;
- the dictionary "station_id" of all stations and ids existed in the previous 8 datasets;
- the function "correct_station_name_id" to either update ids or catch exceptions;
- the function "get_correct_id" to grab the correct ids for exceptions.

In [72]:
# reset the not_changed list to empty
not_changed = []

# update stations' ids in bt_202101 or catch stations that're not existed from previous 8 datasets
correct_station_name_id(bt_202101)

These stations below are not existed in the previous 8 datasets:


['W Oakdale Ave & N Broadway',
 'N Paulina St & Lincoln Ave',
 'Malcolm X College Vaccination Site',
 'N Green St & W Lake St',
 'Broadway & Wilson - Truman College Vaccination Site',
 'W Armitage Ave & N Sheffield Ave',
 'N Southport Ave & W Newport Ave',
 'N Sheffield Ave & W Wellington Ave',
 'Base - 2132 W Hubbard Warehouse',
 'Avenue L & 114th St',
 'N Carpenter St & W Lake St',
 'N Damen Ave & W Wabansia St',
 'Western & 28th - Velasquez Institute Vaccination Site',
 'W Washington Blvd & N Peoria St']

In [73]:
# get the correct ids of exceptional stations in list not_changed
get_correct_id()

W Oakdale Ave & N Broadway :  1436495100903691938

N Paulina St & Lincoln Ave :  1436495122378528446

Malcolm X College Vaccination Site :  631

N Green St & W Lake St :  1436495109493626546

Broadway & Wilson - Truman College Vaccination Site :  293

W Armitage Ave & N Sheffield Ave :  1436495105198659242

N Southport Ave & W Newport Ave :  1436495115557663136

N Sheffield Ave & W Wellington Ave :  1436495118083561146

Avenue L & 114th St :  1448642175142467184

N Carpenter St & W Lake St :  1436495105198659246

Western & 28th - Velasquez Institute Vaccination Site :  446

W Washington Blvd & N Peoria St :  1436495109493626544

These stations below cannot be found in the official stations information dataset:


['Base - 2132 W Hubbard Warehouse', 'N Damen Ave & W Wabansia St']

***Note:***

The stations 'Base - 2132 W Hubbard Warehouse' and 'N Damen Ave & W Wabansia St' cannot be found in the official station information dataset. Due to lack of information, I will drop these records.

In [74]:
# drop records with start_station_name as 'Base - 2132 W Hubbard Warehouse'
bt_202101.drop(bt_202101[bt_202101['start_station_name'] == 'Base - 2132 W Hubbard Warehouse'].index, 
               inplace=True)

# drop records with end_station_name as 'Base - 2132 W Hubbard Warehouse'
bt_202101.drop(bt_202101[bt_202101['end_station_name'] == 'Base - 2132 W Hubbard Warehouse'].index, 
               inplace=True)

# drop records with start_station_name as 'N Damen Ave & W Wabansia St'
bt_202101.drop(bt_202101[bt_202101['start_station_name'] == 'N Damen Ave & W Wabansia St'].index, 
               inplace=True)

# drop records with end_station_name as 'N Damen Ave & W Wabansia St'
bt_202101.drop(bt_202101[bt_202101['end_station_name'] == 'N Damen Ave & W Wabansia St'].index, 
               inplace=True)

***Note:***

For the other 12 stations, I will manually update the ids in bt_202101 </h4>

In [75]:
# update station's id with name of 'Malcolm X College Vaccination Site'
filt_a1 = (bt_202101['start_station_name'] == 'Malcolm X College Vaccination Site')
bt_202101.loc[filt_a1, 'start_station_id'] = 631

filt_a2 = (bt_202101['end_station_name'] == 'Malcolm X College Vaccination Site')
bt_202101.loc[filt_a2, 'end_station_id'] = 631

In [76]:
# update station's id with name of 'N Southport Ave & W Newport Ave'
filt_b1 = (bt_202101['start_station_name'] == 'N Southport Ave & W Newport Ave')
bt_202101.loc[filt_b1, 'start_station_id'] = 1436495115557663136

filt_b2 = (bt_202101['end_station_name'] == 'N Southport Ave & W Newport Ave')
bt_202101.loc[filt_b2, 'end_station_id'] = 1436495115557663136

In [77]:
# update station's id with name of 'N Sheffield Ave & W Wellington Ave'
filt_c1 = (bt_202101['start_station_name'] == 'N Sheffield Ave & W Wellington Ave')
bt_202101.loc[filt_c1, 'start_station_id'] = 1436495118083561146

filt_c2 = (bt_202101['end_station_name'] == 'N Sheffield Ave & W Wellington Ave')
bt_202101.loc[filt_c2, 'end_station_id'] = 1436495115557663136

In [78]:
# update station's id with name of 'N Paulina St & Lincoln Ave'
filt_d1 = (bt_202101['start_station_name'] == 'N Paulina St & Lincoln Ave')
bt_202101.loc[filt_d1, 'start_station_id'] = 1436495122378528446

In [79]:
# update station's id with name of 'N Green St & W Lake St'
filt_e1 = (bt_202101['start_station_name'] == 'N Green St & W Lake St')
bt_202101.loc[filt_e1, 'start_station_id'] = 1436495109493626546

filt_e2 = (bt_202101['end_station_name'] == 'N Green St & W Lake St')
bt_202101.loc[filt_e2, 'end_station_id'] = 1436495109493626546

In [80]:
# update station's id with name of 'W Oakdale Ave & N Broadway'
filt_f1 = (bt_202101['start_station_name'] == 'W Oakdale Ave & N Broadway')
bt_202101.loc[filt_f1, 'start_station_id'] = 1436495100903691938

filt_f2 = (bt_202101['end_station_name'] == 'W Oakdale Ave & N Broadway')
bt_202101.loc[filt_f2, 'end_station_id'] = 1436495100903691938

In [81]:
# update station's id with name of 'W Armitage Ave & N Sheffield Ave'
filt_g1 = (bt_202101['start_station_name'] == 'W Armitage Ave & N Sheffield Ave')
bt_202101.loc[filt_g1, 'start_station_id'] = 1436495105198659242

filt_g2 = (bt_202101['end_station_name'] == 'W Armitage Ave & N Sheffield Ave')
bt_202101.loc[filt_g2, 'end_station_id'] = 1436495105198659242

In [82]:
# update station's id with name of 'Broadway & Wilson - Truman College Vaccination Site'
filt_h1 = (bt_202101['start_station_name'] == 'Broadway & Wilson - Truman College Vaccination Site')
bt_202101.loc[filt_h1, 'start_station_id'] = 293

filt_h2 = (bt_202101['end_station_name'] == 'Broadway & Wilson - Truman College Vaccination Site')
bt_202101.loc[filt_h2, 'end_station_id'] = 293

In [83]:
# update station's id with name of 'N Carpenter St & W Lake St'
filt_i1 = (bt_202101['end_station_name'] == 'N Carpenter St & W Lake St')
bt_202101.loc[filt_i1, 'end_station_id'] = 1436495105198659246

In [84]:
# update station's id with name of 'Western & 28th - Velasquez Institute Vaccination Site'
filt_j1 = (bt_202101['end_station_name'] == 'Western & 28th - Velasquez Institute Vaccination Site')
bt_202101.loc[filt_j1, 'end_station_id'] = 446

In [85]:
# update station's id with name of 'W Washington Blvd & N Peoria St'
filt_k1 = (bt_202101['end_station_name'] == 'W Washington Blvd & N Peoria St')
bt_202101.loc[filt_k1, 'end_station_id'] = 1436495109493626544

In [86]:
# update station's id with name of 'Avenue L & 114th St'
filt_l1 = (bt_202101['end_station_name'] == 'Avenue L & 114th St')
bt_202101.loc[filt_l1, 'end_station_id'] = 1448642175142467184

#### (2-11) dataset: bike trips of Feburary 2021

In [87]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202102.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [88]:
bt_202102.dtypes

ride_id               object
rideable_type         object
started_at            object
ended_at              object
start_station_name    object
start_station_id      object
end_station_name      object
end_station_id        object
member_casual         object
dtype: object

In [89]:
# check uniqueness of ride_id
bt_202102['ride_id'].value_counts()

BEDCDEC418EE3E9E    1
7FD24B6A74774BD1    1
952F32C68DA31DB7    1
E1857BC43C61E582    1
DE15856A79C7D4FE    1
                   ..
DAC95B1BB797CF7E    1
AD34B8ACBF76B13D    1
AC121EBEC03B02C4    1
9832A441D41D835D    1
5F5FC1F78D682453    1
Name: ride_id, Length: 49622, dtype: int64

In [90]:
# check NAN values
bt_202102.isna().sum()

ride_id                  0
rideable_type            0
started_at               0
ended_at                 0
start_station_name    4046
start_station_id      4046
end_station_name      5358
end_station_id        5358
member_casual            0
dtype: int64

In [91]:
# drop all NAN values
bt_202102.dropna(inplace = True)

#### After examining *start_station_id* and *end_station_id* in this dataset, I found the same inconsistency issue between station ids and their corresponding names as before. 
    
#### Therefore, I will re-run the whole process of validating station names with corresponding ids using:
- the official stations information dataset from Chicago Data Portal website;
- the dictionary "station_id" of all stations and ids existed in the previous 8 datasets;
- the function "correct_station_name_id" to either update ids or catch exceptions;
- the function "get_correct_id" to grab the correct ids for exceptions.

In [92]:
# reset the not_changed list to empty
not_changed = []

# update stations' ids in bt_202102 or catch stations that're not existed from previous 8 datasets
correct_station_name_id(bt_202102)

These stations below are not existed in the previous 8 datasets:


['W Oakdale Ave & N Broadway',
 'N Paulina St & Lincoln Ave',
 'Malcolm X College Vaccination Site',
 'Broadway & Wilson - Truman College Vaccination Site',
 'W Armitage Ave & N Sheffield Ave',
 'W Washington Blvd & N Peoria St',
 'Base - 2132 W Hubbard Warehouse',
 'N Hampden Ct & W Diversey Ave',
 'N Carpenter St & W Lake St',
 'N Green St & W Lake St',
 'Western & 28th - Velasquez Institute Vaccination Site',
 'N Sheffield Ave & W Wellington Ave']

In [93]:
# get the correct ids of exceptional stations in list not_changed
get_correct_id()

W Oakdale Ave & N Broadway :  1436495100903691938

N Paulina St & Lincoln Ave :  1436495122378528446

Malcolm X College Vaccination Site :  631

Broadway & Wilson - Truman College Vaccination Site :  293

W Armitage Ave & N Sheffield Ave :  1436495105198659242

W Washington Blvd & N Peoria St :  1436495109493626544

N Carpenter St & W Lake St :  1436495105198659246

N Green St & W Lake St :  1436495109493626546

Western & 28th - Velasquez Institute Vaccination Site :  446

N Sheffield Ave & W Wellington Ave :  1436495118083561146

These stations below cannot be found in the official stations information dataset:


['Base - 2132 W Hubbard Warehouse', 'N Hampden Ct & W Diversey Ave']

***Note:***

The stations 'Base - 2132 W Hubbard Warehouse' and 'N Hampden Ct & W Diversey Ave' cannot be found in the official station information dataset. Due to lack of information, I will drop these records.

In [94]:
# drop records with start_station_name as 'Base - 2132 W Hubbard Warehouse'
bt_202102.drop(bt_202102[bt_202102['start_station_name'] == 'Base - 2132 W Hubbard Warehouse'].index, 
               inplace=True)

# drop records with end_station_name as 'Base - 2132 W Hubbard Warehouse'
bt_202102.drop(bt_202102[bt_202102['end_station_name'] == 'Base - 2132 W Hubbard Warehouse'].index, 
               inplace=True)

# drop records with start_station_name as 'N Hampden Ct & W Diversey Ave'
bt_202102.drop(bt_202102[bt_202102['start_station_name'] == 'N Hampden Ct & W Diversey Ave'].index, 
               inplace=True)

# drop records with end_station_name as 'N Hampden Ct & W Diversey Ave'
bt_202102.drop(bt_202102[bt_202102['end_station_name'] == 'N Hampden Ct & W Diversey Ave'].index, 
               inplace=True)

***Note:***

For the other 10 stations, I will manually update the ids in bt_202102

In [95]:
# update station's id with name of 'Malcolm X College Vaccination Site'
filt_a1 = (bt_202102['start_station_name'] == 'Malcolm X College Vaccination Site')
bt_202102.loc[filt_a1, 'start_station_id'] = 631

filt_a2 = (bt_202102['end_station_name'] == 'Malcolm X College Vaccination Site')
bt_202102.loc[filt_a2, 'end_station_id'] = 631

In [96]:
# update station's id with name of 'N Paulina St & Lincoln Ave'
filt_b1 = (bt_202102['start_station_name'] == 'N Paulina St & Lincoln Ave')
bt_202102.loc[filt_b1, 'start_station_id'] = 1436495122378528446

filt_b2 = (bt_202102['end_station_name'] == 'N Paulina St & Lincoln Ave')
bt_202102.loc[filt_b2, 'end_station_id'] = 1436495122378528446

In [97]:
# update station's id with name of 'W Oakdale Ave & N Broadway'
filt_c1 = (bt_202102['start_station_name'] == 'W Oakdale Ave & N Broadway')
bt_202102.loc[filt_c1, 'start_station_id'] = 1436495100903691938

filt_c2 = (bt_202102['end_station_name'] == 'W Oakdale Ave & N Broadway')
bt_202102.loc[filt_c2, 'end_station_id'] = 1436495100903691938

In [98]:
# update station's id with name of 'W Washington Blvd & N Peoria St'
filt_d1 = (bt_202102['start_station_name'] == 'W Washington Blvd & N Peoria St')
bt_202102.loc[filt_d1, 'start_station_id'] = 1436495109493626544

filt_d2 = (bt_202102['end_station_name'] == 'W Washington Blvd & N Peoria St')
bt_202102.loc[filt_d2, 'end_station_id'] = 1436495109493626544

In [99]:
# update station's id with name of 'W Armitage Ave & N Sheffield Ave'
filt_e1 = (bt_202102['start_station_name'] == 'W Armitage Ave & N Sheffield Ave')
bt_202102.loc[filt_e1, 'start_station_id'] = 1436495105198659242

filt_e2 = (bt_202102['end_station_name'] == 'W Armitage Ave & N Sheffield Ave')
bt_202102.loc[filt_e2, 'end_station_id'] = 1436495105198659242

In [100]:
# update station's id with name of 'Broadway & Wilson - Truman College Vaccination Site'
filt_f1 = (bt_202102['start_station_name'] == 'Broadway & Wilson - Truman College Vaccination Site')
bt_202102.loc[filt_f1, 'start_station_id'] = 293

filt_f2 = (bt_202102['end_station_name'] == 'Broadway & Wilson - Truman College Vaccination Site')
bt_202102.loc[filt_f2, 'end_station_id'] = 293

In [101]:
# update station's id with name of 'N Carpenter St & W Lake St'
filt_g1 = (bt_202102['end_station_name'] == 'N Carpenter St & W Lake St')
bt_202102.loc[filt_g1, 'end_station_id'] = 1436495105198659246

In [102]:
# update station's id with name of 'Western & 28th - Velasquez Institute Vaccination Site'
filt_h1 = (bt_202102['end_station_name'] == 'Western & 28th - Velasquez Institute Vaccination Site')
bt_202102.loc[filt_h1, 'end_station_id'] = 446

In [103]:
# update station's id with name of 'N Sheffield Ave & W Wellington Ave'
filt_i1 = (bt_202102['end_station_name'] == 'N Sheffield Ave & W Wellington Ave')
bt_202102.loc[filt_i1, 'end_station_id'] = 1436495118083561146

In [104]:
# update station's id with name of 'N Green St & W Lake St'
filt_j1 = (bt_202102['end_station_name'] == 'N Green St & W Lake St')
bt_202102.loc[filt_j1, 'end_station_id'] = 1436495109493626546

#### (2-12) dataset: bike trips of March 2021

In [105]:
# drop irrelevant columns: start_lat, start_lng, end_lat, end_lng
bt_202103.drop(columns = ['start_lat', 'start_lng', 'end_lat', 'end_lng'], inplace = True)

In [106]:
bt_202103.dtypes

ride_id               object
rideable_type         object
started_at            object
ended_at              object
start_station_name    object
start_station_id      object
end_station_name      object
end_station_id        object
member_casual         object
dtype: object

In [107]:
# check uniqueness of ride_id
bt_202103['ride_id'].value_counts()

76A1C5D305B63FFF    1
6373E7C6ADDA8648    1
E776E0D82830FA93    1
6BF9A57E48F85219    1
171F7CB6F55A87A6    1
                   ..
00B2D81DD205863F    1
153DBD11DCF784CD    1
05E268EA182D34E7    1
1FE7C0924FC356E8    1
7595BEA1DBA8B969    1
Name: ride_id, Length: 228496, dtype: int64

In [108]:
# check NAN values
bt_202103.isna().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    14848
start_station_id      14848
end_station_name      16727
end_station_id        16727
member_casual             0
dtype: int64

In [109]:
# drop all NAN values
bt_202103.dropna(inplace = True)

#### After examining *start_station_id* and *end_station_id* in this dataset, I found the same inconsistency issue between station ids and their corresponding names as before. 
    
#### Therefore, I will re-run the whole process of validating station names with corresponding ids using:
- the official stations information dataset from Chicago Data Portal website;
- the dictionary "station_id" of all stations and ids existed in the previous 8 datasets;
- the function "correct_station_name_id" to either update ids or catch exceptions;
- the function "get_correct_id" to grab the correct ids for exceptions.

In [110]:
# reset the not_changed list to empty
not_changed = []

# update stations' ids in bt_202103 or catch stations that're not existed from previous 8 datasets
correct_station_name_id(bt_202103)

These stations below are not existed in the previous 8 datasets:


['W Oakdale Ave & N Broadway',
 'N Hampden Ct & W Diversey Ave',
 'N Paulina St & Lincoln Ave',
 'Halsted & 63rd - Kennedy-King Vaccination Site',
 'Kedzie Ave & 110th St',
 'Chicago State University',
 'N Carpenter St & W Lake St',
 'Malcolm X College Vaccination Site',
 'N Green St & W Lake St',
 'Broadway & Wilson - Truman College Vaccination Site',
 'W Armitage Ave & N Sheffield Ave',
 'Damen Ave & Wabansia Ave',
 'Western & 28th - Velasquez Institute Vaccination Site',
 'N Southport Ave & W Newport Ave',
 'W Washington Blvd & N Peoria St',
 'N Sheffield Ave & W Wellington Ave',
 'Base - 2132 W Hubbard Warehouse',
 'N Damen Ave & W Wabansia St']

In [111]:
# get the correct ids of exceptional stations in list not_changed
get_correct_id()

W Oakdale Ave & N Broadway :  1436495100903691938

N Paulina St & Lincoln Ave :  1436495122378528446

Halsted & 63rd - Kennedy-King Vaccination Site :  388

Kedzie Ave & 110th St :  736

Chicago State University :  737

N Carpenter St & W Lake St :  1436495105198659246

Malcolm X College Vaccination Site :  631

N Green St & W Lake St :  1436495109493626546

Broadway & Wilson - Truman College Vaccination Site :  293

W Armitage Ave & N Sheffield Ave :  1436495105198659242

Damen Ave & Wabansia Ave :  1521686986436309688

Western & 28th - Velasquez Institute Vaccination Site :  446

N Southport Ave & W Newport Ave :  1436495115557663136

W Washington Blvd & N Peoria St :  1436495109493626544

N Sheffield Ave & W Wellington Ave :  1436495118083561146

These stations below cannot be found in the official stations information dataset:


['N Hampden Ct & W Diversey Ave',
 'Base - 2132 W Hubbard Warehouse',
 'N Damen Ave & W Wabansia St']

***Note:***

The stations 'Base - 2132 W Hubbard Warehouse', 'N Hampden Ct & W Diversey Ave', and 'N Damen Ave & W Wabansia St' cannot be found in the official station information dataset. Due to lack of information, I will drop these records.

In [112]:
# drop records with start_station_name as 'Base - 2132 W Hubbard Warehouse'
bt_202103.drop(bt_202103[bt_202103['start_station_name'] == 'Base - 2132 W Hubbard Warehouse'].index, 
               inplace=True)

# drop records with end_station_name as 'Base - 2132 W Hubbard Warehouse'
bt_202103.drop(bt_202103[bt_202103['end_station_name'] == 'Base - 2132 W Hubbard Warehouse'].index, 
               inplace=True)

# drop records with start_station_name as 'N Hampden Ct & W Diversey Ave'
bt_202103.drop(bt_202103[bt_202103['start_station_name'] == 'N Hampden Ct & W Diversey Ave'].index, 
               inplace=True)

# drop records with end_station_name as 'N Hampden Ct & W Diversey Ave'
bt_202103.drop(bt_202103[bt_202103['end_station_name'] == 'N Hampden Ct & W Diversey Ave'].index, 
               inplace=True)

# drop records with start_station_name as 'N Damen Ave & W Wabansia St'
bt_202103.drop(bt_202103[bt_202103['start_station_name'] == 'N Damen Ave & W Wabansia St'].index, 
               inplace=True)

# drop records with end_station_name as 'N Damen Ave & W Wabansia St'
bt_202103.drop(bt_202103[bt_202103['end_station_name'] == 'N Damen Ave & W Wabansia St'].index, 
               inplace=True)

***Note:***

For the other 15 stations, I will manually update the ids in bt_202102

In [113]:
# update station's id with name of 'Broadway & Wilson - Truman College Vaccination Site'
filt_a1 = (bt_202103['start_station_name'] == 'Broadway & Wilson - Truman College Vaccination Site')
bt_202103.loc[filt_a1, 'start_station_id'] = 293

filt_a2 = (bt_202103['end_station_name'] == 'Broadway & Wilson - Truman College Vaccination Site')
bt_202103.loc[filt_a2, 'end_station_id'] = 293

In [114]:
# update station's id with name of 'Damen Ave & Wabansia Ave'
filt_b1 = (bt_202103['start_station_name'] == 'Damen Ave & Wabansia Ave')
bt_202103.loc[filt_b1, 'start_station_id'] = 1521686986436309688

filt_b2 = (bt_202103['end_station_name'] == 'Damen Ave & Wabansia Ave')
bt_202103.loc[filt_b2, 'end_station_id'] = 1521686986436309688

In [115]:
# update station's id with name of 'Western & 28th - Velasquez Institute Vaccination Site'
filt_c1 = (bt_202103['start_station_name'] == 'Western & 28th - Velasquez Institute Vaccination Site')
bt_202103.loc[filt_c1, 'start_station_id'] = 446

filt_c2 = (bt_202103['end_station_name'] == 'Western & 28th - Velasquez Institute Vaccination Site')
bt_202103.loc[filt_c2, 'end_station_id'] = 446

In [116]:
# update station's id with name of 'Chicago State University'
filt_d1 = (bt_202103['start_station_name'] == 'Chicago State University')
bt_202103.loc[filt_d1, 'start_station_id'] = 737

filt_d2 = (bt_202103['end_station_name'] == 'Chicago State University')
bt_202103.loc[filt_d2, 'end_station_id'] = 737

In [117]:
# update station's id with name of 'W Armitage Ave & N Sheffield Ave'
filt_e1 = (bt_202103['start_station_name'] == 'W Armitage Ave & N Sheffield Ave')
bt_202103.loc[filt_e1, 'start_station_id'] = 1436495105198659242

filt_e2 = (bt_202103['end_station_name'] == 'W Armitage Ave & N Sheffield Ave')
bt_202103.loc[filt_e2, 'end_station_id'] = 1436495105198659242

In [118]:
# update station's id with name of 'W Washington Blvd & N Peoria St'
filt_f1 = (bt_202103['start_station_name'] == 'W Washington Blvd & N Peoria St')
bt_202103.loc[filt_f1, 'start_station_id'] = 1436495109493626544

filt_f2 = (bt_202103['end_station_name'] == 'W Washington Blvd & N Peoria St')
bt_202103.loc[filt_f2, 'end_station_id'] = 1436495109493626544

In [119]:
# update station's id with name of 'N Paulina St & Lincoln Ave'
filt_g1 = (bt_202103['start_station_name'] == 'N Paulina St & Lincoln Ave')
bt_202103.loc[filt_g1, 'start_station_id'] = 1436495122378528446
              
filt_g2 = (bt_202103['end_station_name'] == 'N Paulina St & Lincoln Ave')
bt_202103.loc[filt_g2, 'end_station_id'] = 1436495122378528446

In [120]:
# update station's id with name of 'Halsted & 63rd - Kennedy-King Vaccination Site'
filt_h1 = (bt_202103['start_station_name'] == 'Halsted & 63rd - Kennedy-King Vaccination Site')
bt_202103.loc[filt_h1, 'start_station_id'] = 388

filt_h2 = (bt_202103['end_station_name'] == 'Halsted & 63rd - Kennedy-King Vaccination Site')
bt_202103.loc[filt_h2, 'end_station_id'] = 388

In [121]:
# update station's id with name of 'Kedzie Ave & 110th St'
filt_i1 = (bt_202103['start_station_name'] == 'Kedzie Ave & 110th St')
bt_202103.loc[filt_i1, 'start_station_id'] = 736

filt_i2 = (bt_202103['end_station_name'] == 'Kedzie Ave & 110th St')
bt_202103.loc[filt_i2, 'end_station_id'] = 736

In [122]:
# update station's id with name of 'N Carpenter St & W Lake St'
filt_j1 = (bt_202103['start_station_name'] == 'N Carpenter St & W Lake St')
bt_202103.loc[filt_j1, 'start_station_id'] = 1436495105198659246

filt_j2 = (bt_202103['end_station_name'] == 'N Carpenter St & W Lake St')
bt_202103.loc[filt_j2, 'end_station_id'] = 1436495105198659246

In [123]:
# update station's id with name of 'Malcolm X College Vaccination Site'
filt_k1 = (bt_202103['start_station_name'] == 'Malcolm X College Vaccination Site')
bt_202103.loc[filt_k1, 'start_station_id'] = 631

filt_k2 = (bt_202103['end_station_name'] == 'Malcolm X College Vaccination Site')
bt_202103.loc[filt_k2, 'end_station_id'] = 631

In [124]:
# update station's id with name of 'N Sheffield Ave & W Wellington Ave'
filt_l1 = (bt_202103['start_station_name'] == 'N Sheffield Ave & W Wellington Ave')
bt_202103.loc[filt_l1, 'start_station_id'] = 1436495118083561146

filt_l2 = (bt_202103['end_station_name'] == 'N Sheffield Ave & W Wellington Ave')
bt_202103.loc[filt_l2, 'end_station_id'] = 1436495118083561146

In [125]:
# update station's id with name of 'W Oakdale Ave & N Broadway'
filt_m1 = (bt_202103['start_station_name'] == 'W Oakdale Ave & N Broadway')
bt_202103.loc[filt_m1, 'start_station_id'] = 1436495100903691938

filt_m2 = (bt_202103['end_station_name'] == 'W Oakdale Ave & N Broadway')
bt_202103.loc[filt_m2, 'end_station_id'] = 1436495100903691938

In [126]:
# update station's id with name of 'N Southport Ave & W Newport Ave'
filt_n1 = (bt_202103['start_station_name'] == 'N Southport Ave & W Newport Ave')
bt_202103.loc[filt_n1, 'start_station_id'] = 1436495115557663136

filt_n2 = (bt_202103['end_station_name'] == 'N Southport Ave & W Newport Ave')
bt_202103.loc[filt_n2, 'end_station_id'] = 1436495115557663136

In [127]:
# update station's id with name of 'N Green St & W Lake St'
filt_o1 = (bt_202103['start_station_name'] == 'N Green St & W Lake St')
bt_202103.loc[filt_o1, 'start_station_id'] = 1436495109493626546

filt_o2 = (bt_202103['end_station_name'] == 'N Green St & W Lake St')
bt_202103.loc[filt_o2, 'end_station_id'] = 1436495109493626546

### 3. Save cleaned datasets for next phase of data analysis

In [128]:
bt_202004.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202004.csv', index = False)
bt_202005.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202005.csv', index = False)
bt_202006.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202006.csv', index = False)
bt_202007.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202007.csv', index = False)
bt_202008.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202008.csv', index = False)
bt_202009.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202009.csv', index = False)
bt_202010.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202010.csv', index = False)
bt_202011.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202011.csv', index = False)
bt_202012.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202012.csv', index = False)
bt_202101.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202101.csv', index = False)
bt_202102.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202102.csv', index = False)
bt_202103.to_csv('~/PROJECTS/02_cyclistic_bike_share/cleaned_datasets/bt_202103.csv', index = False)