# Preparing datasets: Feb'24
This notebook is meant to generate a new dataset (only for February 2024) based on *Citibike Trip data* that can be used to determine **bike rentals** based on a <u>fixed time resolution</u>. 

## Preparing Trip data

In [1]:
import pandas as pd
citi_feb_24 = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Datasets/Citibike Trip Data/202402-citibike-tripdata.csv")
citi_feb_24.head()

  citi_feb_24 = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Datasets/Citibike Trip Data/202402-citibike-tripdata.csv")


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,B2B980E0EAE1D6F1,classic_bike,2024-02-25 20:25:40.894,2024-02-25 20:43:58.504,Greenwich St & Hubert St,5470.1,Hudson Blvd W & W 36 St,6611.07,40.721319,-74.010065,40.756765,-73.999714,member
1,1069DDA1FED20568,classic_bike,2024-02-21 22:21:38.446,2024-02-21 22:40:12.259,Greenwich St & Hubert St,5470.1,Hudson Blvd W & W 36 St,6611.07,40.721319,-74.010065,40.756765,-73.999714,member
2,B58850AF6F2D8BD5,electric_bike,2024-02-14 08:31:14.609,2024-02-14 08:42:30.427,Mercer St & Bleecker St,5679.05,W 20 St & 10 Ave,6306.01,40.727068,-73.996554,40.745686,-74.005141,member
3,D46E6C5A69048E11,electric_bike,2024-02-05 08:42:25.999,2024-02-05 08:56:26.899,E 20 St & FDR Dr,5886.13,E 74 St & 1 Ave,6953.08,40.733155,-73.975561,40.768974,-73.954823,member
4,707AF4CF2C7834C2,electric_bike,2024-02-08 11:13:15.969,2024-02-08 11:18:44.259,E 20 St & FDR Dr,5886.13,Pitt St & Stanton St,5406.04,40.733184,-73.975525,40.719261,-73.98178,member


In [3]:
# There are NaN values in the dataset
sum(citi_feb_24["start_station_name"].isna())

1866

In [7]:
# Stations are not defined only where 'electric bikes' are used
citi_feb_24[citi_feb_24["start_station_name"].isna()]["rideable_type"].unique()

array(['electric_bike'], dtype=object)

In [8]:
# If the dataset is filtered for 'classic bikes' all starting stations should be visible
sum(citi_feb_24[citi_feb_24["rideable_type"] == "classic_bike"]["start_station_name"].isna())

0

In [10]:
# filtering the dataset
cbikes = citi_feb_24[citi_feb_24["rideable_type"] == "classic_bike"]
cbikes["rideable_type"].unique()

array(['classic_bike'], dtype=object)

In [11]:
# reducing columns
cbikes = cbikes[["ride_id", "started_at", "ended_at", "start_station_name", "start_station_id", "end_station_name", "end_station_id", "start_lat", "start_lng", "end_lat", "end_lng"]]

# ride_id can be used to count trips
print(len(cbikes), len(cbikes["ride_id"].unique()))

753520 753520


Instead of relying on station_ids provided by **Citibike**, the manually constructed IDs can be entered in dataset. This will make it easy for referencing and connecting *rentals* table below with this original filtered dataset.<br><br>
<u>Note</u>: Original filtered dataset means data for classic bikes where NaN values and redundant columns are removed i.e. **cbikes** dataframe.

In [13]:
# Copying cbikes df
cbikes_altered = cbikes.copy()

# Merging the two dfs
new_df = pd.merge(
    left=cbikes_altered, 
    right=rentals,
    how='left',
    left_on=['start_station_name', 'start_lat', 'start_lng'],
    right_on=['name', 'lat', 'lng'],
)

# Preserving bike trip info only with the starting station info
cbikes_altered = new_df[["ride_id", "started_at", "start_station_name", "station_id", "start_lat", "start_lng"]]

# Checking validity
print(len(cbikes_altered["station_id"].unique()))
cbikes_altered.head()

2124


Unnamed: 0,ride_id,started_at,start_station_name,station_id,start_lat,start_lng
0,B2B980E0EAE1D6F1,2024-02-25 20:25:40.894,Greenwich St & Hubert St,1258,40.721319,-74.010065
1,1069DDA1FED20568,2024-02-21 22:21:38.446,Greenwich St & Hubert St,1258,40.721319,-74.010065
2,D1D8AD3420FF1023,2024-02-08 08:40:29.551,E 22 St & 2 Ave,1005,40.737169,-73.981225
3,F3F683675DEFAD69,2024-02-22 15:02:53.380,Grand Concourse & E 205 St,1238,40.875531,-73.886386
4,89C20366C7C4DFC6,2024-02-22 08:31:11.656,E 106 St & Madison Ave,904,40.793434,-73.94945


In [14]:
# exporting to 'C:\Users\singh\Desktop\TUD (All Semesters)\Courses - Semester 5 (TU Dresden)\Research Task - Spatial Modelling\Code'
cbikes_altered.to_csv("Tripdata_with_manual_ids_feb.csv", index=False)

In [17]:
# The trip data contains some information about Jan - should be removed
min(cbikes_altered["started_at"])

'2024-01-30 23:04:35.828'

## Creating Rentals Template
The new template should contain stations with their location and time information included. Each row represents <u> one unit of chosen time resolution</u>.
<br><br>
**NOTE:** Station ID needs to be retained because there are some stations with the same name but slightly different locations!
<br><br>
In the dataset, some IDs are stored as text and others are stored as numbers. Converting th data type to *string* does not work because **\n** and some other text is being added automatically.
<br><br>
**Solution:**<br>
The stations can be manually given unique ids for the purpose of indentification.

In [12]:
# Creating the new dataset
rentals = pd.DataFrame()
rentals[["name", "lat", "lng"]] = cbikes[["start_station_name", "start_lat", "start_lng"]].drop_duplicates()
rentals = rentals.sort_values(by = ["name"], ignore_index=True)

# Adding IDs manually
rentals["station_id"] = range(len(rentals))

# Verifying the number of stations
print(len(cbikes[["start_station_name", "start_lat", "start_lng"]].drop_duplicates()), len(rentals["station_id"]))

2124 2124


## Defining time resolution
Here time units are defined that will later be merged with the *rentals* template.

In [20]:
# importing the data
import pandas as pd
cbikes_altered = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Code/Tripdata_with_manual_ids_feb.csv")

# defining the 7 time units for a given date
time_res = ['08:00:00.000', '10:00:00.000', '12:00:00.000', '14:00:00.000', '16:00:00.000', '18:00:00.000', '20:00:00.000']

# repeating time units 29 times - till 29 Feb
time_res = time_res*29

# creating dates
dates_included = []
while len(dates_included) < 29:
    dates_included.append('2024-02-01')

# days list
nums = ['01', '02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29']

# correcting days
for i in range(len(nums)):
    dates_included[i] = dates_included[i][:-2] + nums[i]

dates_included[:5]

['2024-02-01', '2024-02-02', '2024-02-03', '2024-02-04', '2024-02-05']

Now, we are supposed to combine the date and the time information together. Each day has 7 time units, so each day must be repeated 7 times to accodomate the time resolution.

In [22]:
# repeating each element 7 times
dates_included = [element for element in dates_included for _ in range(7)]

# Adding time dimension
date_time = dates_included.copy()

for t in range(0,len(dates_included)):
    date_time[t] = dates_included[t] + ' ' + time_res[t]

date_time[:7]

['2024-02-01 08:00:00.000',
 '2024-02-01 10:00:00.000',
 '2024-02-01 12:00:00.000',
 '2024-02-01 14:00:00.000',
 '2024-02-01 16:00:00.000',
 '2024-02-01 18:00:00.000',
 '2024-02-01 20:00:00.000']

## Adding time dimensionality to the *rentals* template
A time resolution needs to be added so that number of rentals can be calculated in terms of that time unit, per station.

In [23]:
# rows needed for each station
len(date_time)

203

In [24]:
# duplicating each station 203 times
rentals = rentals.loc[rentals.index.repeat(203)].reset_index(drop=True)

# Repeating date_time: 2124 unique stations
date_time = date_time*2124
rentals["datetime"] = date_time
rentals.head()

Unnamed: 0,name,lat,lng,station_id,datetime
0,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 08:00:00.000
1,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 10:00:00.000
2,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 12:00:00.000
3,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 14:00:00.000
4,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 16:00:00.000


In [25]:
len(rentals)

431172

## Calculating Demand
The rentals are calculated for every time unit, per day per station. The dataset is being prepared only until 29 February 2024.

In [26]:
# importing
tripdata = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Code/Tripdata_with_manual_ids_feb.csv")

# adding a rentals column 
rentals['#_rentals'] = 0

rentals.head()

Unnamed: 0,name,lat,lng,station_id,datetime,#_rentals
0,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 08:00:00.000,0
1,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 10:00:00.000,0
2,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 12:00:00.000,0
3,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 14:00:00.000,0
4,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 16:00:00.000,0


In [27]:
# Updating the rentals dataset
rentals['year'] = pd.Series([x.date().year for x in pd.to_datetime(rentals["datetime"])])
rentals['month'] = pd.Series([x.date().month for x in pd.to_datetime(rentals["datetime"])])
rentals['day'] = pd.Series([x.date().day for x in pd.to_datetime(rentals["datetime"])])

rentals['hour'] = pd.Series([x.time().hour for x in pd.to_datetime(rentals["datetime"])])
rentals['minute'] = pd.Series([x.time().minute for x in pd.to_datetime(rentals["datetime"])])
rentals['second'] = pd.Series([x.time().second for x in pd.to_datetime(rentals["datetime"])])
rentals['microsecond'] = pd.Series([x.time().microsecond for x in pd.to_datetime(rentals["datetime"])])

# Updating the trips dataset
tripdata['started_at_year'] = pd.Series([x.date().year for x in pd.to_datetime(tripdata["started_at"])])
tripdata['started_at_month'] = pd.Series([x.date().month for x in pd.to_datetime(tripdata["started_at"])])
tripdata['started_at_day'] = pd.Series([x.date().day for x in pd.to_datetime(tripdata["started_at"])])

tripdata['started_at_hour'] = pd.Series([x.time().hour for x in pd.to_datetime(tripdata["started_at"])])
tripdata['started_at_minute'] = pd.Series([x.time().minute for x in pd.to_datetime(tripdata["started_at"])])
tripdata['started_at_second'] = pd.Series([x.time().second for x in pd.to_datetime(tripdata["started_at"])])
tripdata['started_at_microsecond'] = pd.Series([x.time().microsecond for x in pd.to_datetime(tripdata["started_at"])])

In [31]:
# The tripdata has some misleading Jan data that should be removed
min(tripdata["started_at_month"])

1

In [32]:
# After Correction
tripdata = tripdata[tripdata["started_at_month"] > 1]
min(tripdata["started_at_month"])

2

In [33]:
# Calculating demand from tripdata for Feb 2024
for k in range(1, 30):
    
    # determining demand for 2024; 08-10hrs
    for i in rentals["station_id"].unique():
        idx = (rentals["station_id"] == i) & (rentals["year"] == 2024) & (rentals["hour"] == 8) & (rentals["day"] == k)
        rentals.loc[idx, "#_rentals"] = len(tripdata[(tripdata["started_at_year"] == 2024) & (tripdata["started_at_day"] == k) & (tripdata["started_at_hour"] < 10) & (tripdata["started_at_hour"] >= 8) & (tripdata["station_id"] == i)])
        
    # determining demand for 2024; 10-12hrs
    for i in rentals["station_id"].unique():
        idx = (rentals["station_id"] == i) & (rentals["year"] == 2024) & (rentals["hour"] == 10) & (rentals["day"] == k)
        rentals.loc[idx, "#_rentals"] = len(tripdata[(tripdata["started_at_year"] == 2024) & (tripdata["started_at_day"] == k) & (tripdata["started_at_hour"] >= 10) & (tripdata["started_at_hour"] < 12) & (tripdata["station_id"] == i)])
        
    # determining demand for 2024; 12-14hrs
    for i in rentals["station_id"].unique():
        idx = (rentals["station_id"] == i) & (rentals["year"] == 2024) & (rentals["hour"] == 12) & (rentals["day"] == k)
        rentals.loc[idx, "#_rentals"] = len(tripdata[(tripdata["started_at_year"] == 2024) & (tripdata["started_at_day"] == k) & (tripdata["started_at_hour"] >= 12) & (tripdata["started_at_hour"] < 14) & (tripdata["station_id"] == i)])
        
    # determining demand for 2024; 14-16hrs
    for i in rentals["station_id"].unique():
        idx = (rentals["station_id"] == i) & (rentals["year"] == 2024) & (rentals["hour"] == 14) & (rentals["day"] == k)
        rentals.loc[idx, "#_rentals"] = len(tripdata[(tripdata["started_at_year"] == 2024) & (tripdata["started_at_day"] == k) & (tripdata["started_at_hour"] >= 14) & (tripdata["started_at_hour"] < 16) & (tripdata["station_id"] == i)])
        
    # determining demand for 2024; 16-18hrs
    for i in rentals["station_id"].unique():
        idx = (rentals["station_id"] == i) & (rentals["year"] == 2024) & (rentals["hour"] == 16) & (rentals["day"] == k)
        rentals.loc[idx, "#_rentals"] = len(tripdata[(tripdata["started_at_year"] == 2024) & (tripdata["started_at_day"] == k) & (tripdata["started_at_hour"] >= 16) & (tripdata["started_at_hour"] < 18) & (tripdata["station_id"] == i)])
        
    # determining demand for 2024; 18-20hrs
    for i in rentals["station_id"].unique():
        idx = (rentals["station_id"] == i) & (rentals["year"] == 2024) & (rentals["hour"] == 18) & (rentals["day"] == k)
        rentals.loc[idx, "#_rentals"] = len(tripdata[(tripdata["started_at_year"] == 2024) & (tripdata["started_at_day"] == k) & (tripdata["started_at_hour"] >= 18) & (tripdata["started_at_hour"] < 20) & (tripdata["station_id"] == i)])
    
    # determining demand for 2024; 20-22hrs
    for i in rentals["station_id"].unique():
        idx = (rentals["station_id"] == i) & (rentals["year"] == 2024) & (rentals["hour"] == 20) & (rentals["day"] == k)
        rentals.loc[idx, "#_rentals"] = len(tripdata[(tripdata["started_at_year"] == 2024) & (tripdata["started_at_day"] == k) & (tripdata["started_at_hour"] >= 20) & (tripdata["started_at_hour"] < 22) & (tripdata["station_id"] == i)])

In [34]:
# total trips on feb 1 2024
print(len(tripdata[(tripdata["started_at_year"] == 2024)&(tripdata["started_at_day"] == 1)&(tripdata["started_at_hour"] >= 8)&(tripdata["started_at_hour"] < 22)]))

# rentals recorded for feb 1 2024 
sum(rentals.loc[(rentals["year"] == 2024) & (rentals["day"] == 1),"#_rentals"])

27784


27784

Hence, the dataset is correct!

In [35]:
# Created dataset
rentals_final = rentals.copy()

# removing unneeded columns
rentals_final.drop(columns=['minute','second','microsecond'], inplace=True)

# final look
rentals_final.head()

Unnamed: 0,name,lat,lng,station_id,datetime,#_rentals,year,month,day,hour
0,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 08:00:00.000,9,2024,2,1,8
1,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 10:00:00.000,1,2024,2,1,10
2,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 12:00:00.000,3,2024,2,1,12
3,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 14:00:00.000,5,2024,2,1,14
4,1 Ave & E 110 St,40.792327,-73.9383,0,2024-02-01 16:00:00.000,1,2024,2,1,16


In [36]:
# exporting
rentals_final.to_csv("rentals_with_demand_new_time_units_feb.csv", index=False)