# Preparing New Dataset
This notebook is meant to generate a new dataset (only for January 2024) based on *Citibike Trip data* that can be used to determine **bike rentals** based on a <u>fixed time resolution</u>. 

## Pre-processing the original data
Before preparing a new dataset, the original data is pre-processed to be suitable for creating a new dataset.

In [1]:
# Reading the data
import pandas as pd
complete_data = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Datasets/Citibike Trip Data/202401-citibike-tripdata.csv")
complete_data.info()

  complete_data = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Datasets/Citibike Trip Data/202401-citibike-tripdata.csv")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1888085 entries, 0 to 1888084
Data columns (total 13 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
dtypes: float64(4), object(9)
memory usage: 187.3+ MB


In [2]:
# Types of ride vehicles
complete_data["rideable_type"].unique()

array(['electric_bike', 'classic_bike'], dtype=object)

The dataset contains both *ebikes* and *classic bikes*. The complete dataset is filtered for **classic bikes** only since they are much more in numbers and is the primary source of mobility service.

In [3]:
# filtering the dataset
cbikes = complete_data[complete_data["rideable_type"] == "classic_bike"]
cbikes["rideable_type"].unique()

array(['classic_bike'], dtype=object)

Additionally, there might be data points with <u>no information</u> about the **starting station name and location**. These data points can't provide information about rentals, hence they should be removed from the dataset.

In [10]:
# there are no rows with missing info about start
sum(cbikes["start_station_name"].isna())

0

In [11]:
# checking for redundant columns
cbikes.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
15,58F39536036FB9A4,classic_bike,2024-01-15 14:15:52.165,2024-01-15 14:23:35.528,Frederick Douglass Blvd & W 145 St,7954.12,St Nicholas Ave & W 126 St,7756.1,40.823061,-73.941928,40.811432,-73.951878,member
22,21A2AE1F29B69343,classic_bike,2024-01-18 21:45:06.206,2024-01-18 22:12:03.996,West End Ave & W 107 St,7650.05,E 74 St & 1 Ave,6953.08,40.802117,-73.968181,40.768974,-73.954823,member
38,03DF018B20ED02D9,classic_bike,2024-01-13 14:51:04.773,2024-01-14 15:50:59.897,Columbia St & Kane St,4422.05,,,40.687632,-74.001626,,,casual
49,826ACF9021A6C8DD,classic_bike,2024-01-18 21:34:50.882,2024-01-18 21:55:23.975,West End Ave & W 107 St,7650.05,Adam Clayton Powell Blvd & W 118 St,7670.09,40.802117,-73.968181,40.804372,-73.951475,member
54,A0A09C76891121C6,classic_bike,2024-01-12 18:14:37.307,2024-01-12 18:20:40.605,Frederick Douglass Blvd & W 145 St,7954.12,St Nicholas Ave & W 126 St,7756.1,40.823061,-73.941928,40.811432,-73.951878,member


The columns *rideable_type* and *member_casual* are redundant and can be removed from the dataset.

In [13]:
# reducing columns
cbikes = cbikes[["ride_id", "started_at", "ended_at", "start_station_name", "start_station_id", "end_station_name", "end_station_id", "start_lat", "start_lng", "end_lat", "end_lng"]]
cbikes.columns

Index(['ride_id', 'started_at', 'ended_at', 'start_station_name',
       'start_station_id', 'end_station_name', 'end_station_id', 'start_lat',
       'start_lng', 'end_lat', 'end_lng'],
      dtype='object')

In [17]:
# 'ride_id' can be used to count bike rentals
print(len(cbikes), len(cbikes["ride_id"].unique()))

673559 673559


## Creating New Dataset
The new dataset should contain stations with their location and time information included. Each row represents <u> one unit of chosen time resolution</u>.
<br><br>
**NOTE:** Station ID needs to be retained because there are some stations with the same name but slightly different locations!
<br><br>
In the dataset, some IDs are stored as text and others are stored as numbers. Converting th data type to *string* does not work because **\n** and some other text is being added automatically.
<br><br>
**Solution:**<br>
The stations can be manually given unique ids for the purpose of indentification.

In [272]:
# Creating the new dataset
rentals = pd.DataFrame()
rentals[["name", "lat", "lng"]] = cbikes[["start_station_name", "start_lat", "start_lng"]].drop_duplicates()
rentals = rentals.sort_values(by = ["name"], ignore_index=True)
rentals.head()

Unnamed: 0,name,lat,lng
0,1 Ave & E 110 St,40.792327,-73.9383
1,1 Ave & E 16 St,40.732219,-73.981656
2,1 Ave & E 18 St,40.733812,-73.980544
3,1 Ave & E 30 St,40.741444,-73.975361
4,1 Ave & E 39 St,40.74714,-73.97113


In [273]:
# Adding IDs manually
rentals["station_id"] = range(len(rentals))
rentals.head()

Unnamed: 0,name,lat,lng,station_id
0,1 Ave & E 110 St,40.792327,-73.9383,0
1,1 Ave & E 16 St,40.732219,-73.981656,1
2,1 Ave & E 18 St,40.733812,-73.980544,2
3,1 Ave & E 30 St,40.741444,-73.975361,3
4,1 Ave & E 39 St,40.74714,-73.97113,4


In [274]:
# Verifying the number of stations
print(len(cbikes[["start_station_name", "start_lat", "start_lng"]].drop_duplicates()), len(rentals["station_id"]))

2124 2124


In [275]:
# 5 stations have repeating names, but their location is different
rentals[rentals["name"].duplicated()]

Unnamed: 0,name,lat,lng,station_id
511,Amsterdam Ave & W 119 St,40.808625,-73.959621,511
856,DeKalb Ave & S Portland Ave,40.68981,-73.974931,856
1040,E 47 St & Park Ave,40.755008,-73.974627,1040
1325,Jay St & Tech Pl,40.695,-73.98917,1325
2046,Warren St & Roosevelt Ave,40.74919,-73.87043,2046


In [140]:
# example
rentals[rentals["name"] == "Amsterdam Ave & W 119 St"]

Unnamed: 0,name,lat,lng,station_id
510,Amsterdam Ave & W 119 St,40.808632,-73.959586,510
511,Amsterdam Ave & W 119 St,40.808625,-73.959621,511


## Altering the original (filtered) dataset
Instead of relying on station_ids provided by **Citibike**, the manually constructed IDs can be entered in dataset. This will make it easy for referencing and connecting *rentals* table with this original filteered dataset.<br><br>
<u>Note</u>: Original filtered dataset means data for classic bikes where NaN values and redundant columns are removed i.e. **cbikes** dataframe.

In [141]:
# Copying cbikes df
cbikes_altered = cbikes
cbikes_altered.head()

Unnamed: 0,ride_id,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng
1542152,FD1815E65D61FAF1,2024-01-07 08:42:55.647,2024-01-07 08:44:27.312,1 Ave & E 110 St,7522.02,E 114 St & 1 Ave,7540.02,40.792327,-73.9383,40.794566,-73.936254
139989,C81EA1D1AF08E764,2024-01-05 16:47:51.873,2024-01-05 16:55:21.591,1 Ave & E 110 St,7522.02,Central Park North & Adam Clayton Powell Blvd,7617.07,40.792327,-73.9383,40.799484,-73.955613
377860,B68072A759AEBC11,2024-01-28 21:16:58.578,2024-01-28 21:21:37.778,1 Ave & E 110 St,7522.02,2 Ave & E 99 St,7386.1,40.792327,-73.9383,40.786259,-73.945526
269167,40253EB730335B64,2024-01-14 06:21:56.489,2024-01-14 06:38:02.179,1 Ave & E 110 St,7522.02,W 110 St & Amsterdam Ave,7646.04,40.792327,-73.9383,40.802692,-73.96295
157120,F3788085AE49CFD5,2024-01-10 17:09:08.398,2024-01-10 17:28:59.593,1 Ave & E 110 St,7522.02,W 87 St & West End Ave,7484.05,40.792327,-73.9383,40.789622,-73.97757


In [148]:
# Merging the two dfs
new_df = pd.merge(
    left=cbikes_altered, 
    right=rentals,
    how='left',
    left_on=['start_station_name', 'start_lat', 'start_lng'],
    right_on=['name', 'lat', 'lng'],
)

In [150]:
# Preserving bike trip info only with the starting station info
cbikes_altered = new_df[["ride_id", "started_at", "start_station_name", "station_id", "start_lat", "start_lng"]]
cbikes_altered.head()

Unnamed: 0,ride_id,started_at,start_station_name,station_id,start_lat,start_lng
0,FD1815E65D61FAF1,2024-01-07 08:42:55.647,1 Ave & E 110 St,0,40.792327,-73.9383
1,C81EA1D1AF08E764,2024-01-05 16:47:51.873,1 Ave & E 110 St,0,40.792327,-73.9383
2,B68072A759AEBC11,2024-01-28 21:16:58.578,1 Ave & E 110 St,0,40.792327,-73.9383
3,40253EB730335B64,2024-01-14 06:21:56.489,1 Ave & E 110 St,0,40.792327,-73.9383
4,F3788085AE49CFD5,2024-01-10 17:09:08.398,1 Ave & E 110 St,0,40.792327,-73.9383


In [151]:
# Checking validity
len(cbikes_altered["station_id"].unique())

2124

Hence, the created dataset is correct.

## Exporting this dataset
This dataset can be used to calculate number of rentals by the manually generated station IDs.

In [153]:
# exporting to 'C:\Users\singh\Desktop\TUD (All Semesters)\Courses - Semester 5 (TU Dresden)\Research Task - Spatial Modelling\Code'
cbikes_altered.to_csv("Tripdata_with_manual_ids.csv", index=False)

## Adding time dimensionality to the *rentals* dataset
Now that station location information is available, a time resolution needs to be added so that number of rentals can be calculated in terms of that time unit, per station.

In [156]:
# Checking the range of temporal information in trip data
print(min(cbikes_altered["started_at"])) 
print(max(cbikes_altered["started_at"]))

2023-12-31 02:36:55.648
2024-01-31 23:58:30.270


In [163]:
# how is time info stored?
print(cbikes_altered["started_at"].iloc[0],
type(cbikes_altered["started_at"].iloc[0]))

2024-01-07 08:42:55.647 <class 'str'>


In [170]:
# datetime info stored as a string
h = '2024-01-07 08:42:55.647'
print(h)
min(cbikes_altered["started_at"]) # without the print cmd

2024-01-07 08:42:55.647


'2023-12-31 02:36:55.648'

Date-time info will have to be extracted from the **str** class to *datetime* objects. 

## Generating 4 hr time resolution
Each day will be divided into four hour units. So there should be 6 units of time per day spanning as follows:<br>
- 0 to 4
- 4 to 8
- 8 to 12
- 12 to 16
- 16 to 20
- last 4 hours (which are assumed to be from 20:00 onwards)

In [260]:
# defining time units for a given date
time_res = ['00:00:00.000', '04:00:00.000', '08:00:00.000', '12:00:00.000', '16:00:00.000', '20:00:00.000']

In [261]:
# repeating time units 32 times
time_res = time_res*32
print(time_res[1], time_res[8])

04:00:00.000 08:00:00.000


In [172]:
# time operations can even be done in strings
'03:00:00.000' < '04:00:00.000'

True

In [262]:
# creating dates
dates_included = []

while len(dates_included) < 32:
    dates_included.append('2024-01-01')

In [230]:
print(dates_included[:3], len(dates_included))

['2024-01-01', '2024-01-01', '2024-01-01'] 32


In [250]:
nums = ['00', '01', '02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31']

In [263]:
# correcting days
for i in range(len(nums)):
    dates_included[i] = dates_included[i][:-2] + nums[i]

In [233]:
dates_included

['2024-01-00',
 '2024-01-01',
 '2024-01-02',
 '2024-01-03',
 '2024-01-04',
 '2024-01-05',
 '2024-01-06',
 '2024-01-07',
 '2024-01-08',
 '2024-01-09',
 '2024-01-10',
 '2024-01-11',
 '2024-01-12',
 '2024-01-13',
 '2024-01-14',
 '2024-01-15',
 '2024-01-16',
 '2024-01-17',
 '2024-01-18',
 '2024-01-19',
 '2024-01-20',
 '2024-01-21',
 '2024-01-22',
 '2024-01-23',
 '2024-01-24',
 '2024-01-25',
 '2024-01-26',
 '2024-01-27',
 '2024-01-28',
 '2024-01-29',
 '2024-01-30',
 '2024-01-31']

In [264]:
dates_included[0] = '2023-12-31'
dates_included[:5]

['2023-12-31', '2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04']

Now, we are supposed to combine the date and the time information together. Each day has 6 time units, so each day must be repeated 6 times to accodomate the time resolution. However, there are actually 7 time windows because of 6 time units.

In [265]:
# repeating each element 6 times
dates_included = [element for element in dates_included for _ in range(6)]
dates_included

['2023-12-31',
 '2023-12-31',
 '2023-12-31',
 '2023-12-31',
 '2023-12-31',
 '2023-12-31',
 '2024-01-01',
 '2024-01-01',
 '2024-01-01',
 '2024-01-01',
 '2024-01-01',
 '2024-01-01',
 '2024-01-02',
 '2024-01-02',
 '2024-01-02',
 '2024-01-02',
 '2024-01-02',
 '2024-01-02',
 '2024-01-03',
 '2024-01-03',
 '2024-01-03',
 '2024-01-03',
 '2024-01-03',
 '2024-01-03',
 '2024-01-04',
 '2024-01-04',
 '2024-01-04',
 '2024-01-04',
 '2024-01-04',
 '2024-01-04',
 '2024-01-05',
 '2024-01-05',
 '2024-01-05',
 '2024-01-05',
 '2024-01-05',
 '2024-01-05',
 '2024-01-06',
 '2024-01-06',
 '2024-01-06',
 '2024-01-06',
 '2024-01-06',
 '2024-01-06',
 '2024-01-07',
 '2024-01-07',
 '2024-01-07',
 '2024-01-07',
 '2024-01-07',
 '2024-01-07',
 '2024-01-08',
 '2024-01-08',
 '2024-01-08',
 '2024-01-08',
 '2024-01-08',
 '2024-01-08',
 '2024-01-09',
 '2024-01-09',
 '2024-01-09',
 '2024-01-09',
 '2024-01-09',
 '2024-01-09',
 '2024-01-10',
 '2024-01-10',
 '2024-01-10',
 '2024-01-10',
 '2024-01-10',
 '2024-01-10',
 '2024-01-

In [266]:
# Adding time dimension
date_time = dates_included

for t in range(0,len(dates_included)):
    date_time[t] = dates_included[t] + ' ' + time_res[t]

date_time[:6]

['2023-12-31 00:00:00.000',
 '2023-12-31 04:00:00.000',
 '2023-12-31 08:00:00.000',
 '2023-12-31 12:00:00.000',
 '2023-12-31 16:00:00.000',
 '2023-12-31 20:00:00.000']

In [267]:
# for 1st January 2024
date_time[:14]

['2023-12-31 00:00:00.000',
 '2023-12-31 04:00:00.000',
 '2023-12-31 08:00:00.000',
 '2023-12-31 12:00:00.000',
 '2023-12-31 16:00:00.000',
 '2023-12-31 20:00:00.000',
 '2024-01-01 00:00:00.000',
 '2024-01-01 04:00:00.000',
 '2024-01-01 08:00:00.000',
 '2024-01-01 12:00:00.000',
 '2024-01-01 16:00:00.000',
 '2024-01-01 20:00:00.000',
 '2024-01-02 00:00:00.000',
 '2024-01-02 04:00:00.000']

**Note:** The time <u>2023-12-31 00:00:00.000</u> means that bike rentals are mentioned between **00-04** hours. At every checkpoint the demand gets mentioned for the coming four hours.
<br><br>
The created date_time information must be added for each station in *rentals*.

## Finalizing the rentals dataset (without demand)

In [268]:
# rows needed for each station
len(date_time)

192

In [277]:
# duplicating each station 192 times
rentals = rentals.loc[rentals.index.repeat(192)].reset_index(drop=True)
rentals.head()

Unnamed: 0,name,lat,lng,station_id
0,1 Ave & E 110 St,40.792327,-73.9383,0
1,1 Ave & E 110 St,40.792327,-73.9383,0
2,1 Ave & E 110 St,40.792327,-73.9383,0
3,1 Ave & E 110 St,40.792327,-73.9383,0
4,1 Ave & E 110 St,40.792327,-73.9383,0


In [278]:
# Repeating date_time: 2124 unique stations
date_time = date_time*2124
rentals["datetime"] = date_time
rentals.head()

Unnamed: 0,name,lat,lng,station_id,datetime
0,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 00:00:00.000
1,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 04:00:00.000
2,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 08:00:00.000
3,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 12:00:00.000
4,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 16:00:00.000


In [279]:
len(rentals)

407808

In [280]:
# exporting the rentals dataset
rentals.to_csv("rentals_without_demand.csv", index=False)