## Motivation for the experiment
Capturing hourly variations can pose difficulty for the model because the features must contain sufficient information to signal sudden short-term changes in the *target*. Predicting *daily demand* is likely to smooth out these sudden changes, and has a much better chance of improved prediction accuracy on the test set.

### Creating a daily rentals df

In [1]:
# importing the data
import pandas as pd
rentals = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Code/rentals_with_demand.csv")
rentals.head()

Unnamed: 0,name,lat,lng,station_id,datetime,#_rentals,year,month,day,hour,minute,second,microsecond
0,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 00:00:00.000,0,2023,12,31,0,0,0,0
1,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 04:00:00.000,0,2023,12,31,4,0,0,0
2,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 08:00:00.000,0,2023,12,31,8,0,0,0
3,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 12:00:00.000,0,2023,12,31,12,0,0,0
4,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 16:00:00.000,0,2023,12,31,16,0,0,0


In [2]:
# clearing out columns that are not required
rentals = rentals[["name","lat","lng","station_id","#_rentals", "year", "month", "day", "hour"]]
rentals.head()

Unnamed: 0,name,lat,lng,station_id,#_rentals,year,month,day,hour
0,1 Ave & E 110 St,40.792327,-73.9383,0,0,2023,12,31,0
1,1 Ave & E 110 St,40.792327,-73.9383,0,0,2023,12,31,4
2,1 Ave & E 110 St,40.792327,-73.9383,0,0,2023,12,31,8
3,1 Ave & E 110 St,40.792327,-73.9383,0,0,2023,12,31,12
4,1 Ave & E 110 St,40.792327,-73.9383,0,0,2023,12,31,16


In [3]:
# calculating daily demand for a given station and day
sum(rentals.loc[(rentals["station_id"] == 656) & (rentals["day"] == 11), "#_rentals"])

36

In [4]:
# creating a new df
rentals_daily = rentals.copy()

In [5]:
# converting hourly demand to daily demand for 2023
for i in rentals_daily["station_id"].unique():
    rentals_daily.loc[(rentals_daily["station_id"] == i) & (rentals_daily["year"] == 2023), "#_rentals"] = sum(rentals_daily.loc[(rentals_daily["station_id"] == i) & (rentals_daily["year"] == 2023), "#_rentals"])

In [6]:
# Checking 31 Dec 2023 rentals, for a given station
rentals_daily[(rentals_daily["year"] == 2023) & (rentals_daily["station_id"] == 656)]

Unnamed: 0,name,lat,lng,station_id,#_rentals,year,month,day,hour
125952,Broadway & W 36 St,40.750977,-73.987654,656,1,2023,12,31,0
125953,Broadway & W 36 St,40.750977,-73.987654,656,1,2023,12,31,4
125954,Broadway & W 36 St,40.750977,-73.987654,656,1,2023,12,31,8
125955,Broadway & W 36 St,40.750977,-73.987654,656,1,2023,12,31,12
125956,Broadway & W 36 St,40.750977,-73.987654,656,1,2023,12,31,16
125957,Broadway & W 36 St,40.750977,-73.987654,656,1,2023,12,31,20


In [7]:
# It works!
rentals[(rentals["year"] == 2023) & (rentals["station_id"] == 656)]

Unnamed: 0,name,lat,lng,station_id,#_rentals,year,month,day,hour
125952,Broadway & W 36 St,40.750977,-73.987654,656,1,2023,12,31,0
125953,Broadway & W 36 St,40.750977,-73.987654,656,0,2023,12,31,4
125954,Broadway & W 36 St,40.750977,-73.987654,656,0,2023,12,31,8
125955,Broadway & W 36 St,40.750977,-73.987654,656,0,2023,12,31,12
125956,Broadway & W 36 St,40.750977,-73.987654,656,0,2023,12,31,16
125957,Broadway & W 36 St,40.750977,-73.987654,656,0,2023,12,31,20


In [8]:
# Pick station_id, day and year
rentals_daily[(rentals_daily["year"] == 2024) & (rentals_daily["day"] == 1) & (rentals_daily["station_id"] == 656)]

Unnamed: 0,name,lat,lng,station_id,#_rentals,year,month,day,hour
125958,Broadway & W 36 St,40.750977,-73.987654,656,5,2024,1,1,0
125959,Broadway & W 36 St,40.750977,-73.987654,656,0,2024,1,1,4
125960,Broadway & W 36 St,40.750977,-73.987654,656,1,2024,1,1,8
125961,Broadway & W 36 St,40.750977,-73.987654,656,1,2024,1,1,12
125962,Broadway & W 36 St,40.750977,-73.987654,656,3,2024,1,1,16
125963,Broadway & W 36 St,40.750977,-73.987654,656,3,2024,1,1,20


In [9]:
# updating for 2024
for i in rentals_daily["station_id"].unique():
    for j in range(1,32):
        rentals_daily.loc[(rentals_daily["station_id"] == i) & (rentals_daily["year"] == 2024) & (rentals_daily["day"] == j), "#_rentals"] = sum(rentals_daily.loc[(rentals_daily["station_id"] == i) & (rentals_daily["year"] == 2024) & (rentals_daily["day"] == j), "#_rentals"])

In [10]:
# Checking 31 Jan 2024 rentals, for a given station
rentals_daily[(rentals_daily["year"] == 2024) & (rentals_daily["station_id"] == 656) & (rentals_daily["day"] == 31)]

Unnamed: 0,name,lat,lng,station_id,#_rentals,year,month,day,hour
126138,Broadway & W 36 St,40.750977,-73.987654,656,44,2024,1,31,0
126139,Broadway & W 36 St,40.750977,-73.987654,656,44,2024,1,31,4
126140,Broadway & W 36 St,40.750977,-73.987654,656,44,2024,1,31,8
126141,Broadway & W 36 St,40.750977,-73.987654,656,44,2024,1,31,12
126142,Broadway & W 36 St,40.750977,-73.987654,656,44,2024,1,31,16
126143,Broadway & W 36 St,40.750977,-73.987654,656,44,2024,1,31,20


Since *#_rentals* is the same for each time unit, we can <u>delete the *hour* column</u> and then use **drop duplicates** to attain daily demand in the rentals_daily df.

In [11]:
# switching to daily demand
rentals_daily["rentals"] = rentals_daily["#_rentals"]
rentals_daily.drop(columns=["hour", "#_rentals"], inplace=True)
rentals_daily.drop_duplicates(inplace=True)
rentals_daily.head()

Unnamed: 0,name,lat,lng,station_id,year,month,day,rentals
0,1 Ave & E 110 St,40.792327,-73.9383,0,2023,12,31,0
6,1 Ave & E 110 St,40.792327,-73.9383,0,2024,1,1,2
12,1 Ave & E 110 St,40.792327,-73.9383,0,2024,1,2,28
18,1 Ave & E 110 St,40.792327,-73.9383,0,2024,1,3,16
24,1 Ave & E 110 St,40.792327,-73.9383,0,2024,1,4,13


In [18]:
# station 656, 12 Jan 2024
rentals_daily[(rentals_daily["year"] == 2024) & (rentals_daily["day"] == 12) & (rentals_daily["station_id"] == 656)]

Unnamed: 0,name,lat,lng,station_id,year,month,day,rentals
126024,Broadway & W 36 St,40.750977,-73.987654,656,2024,1,12,30


The dataset works as expected.

In [19]:
# exporting
rentals_daily.to_csv("rentals_with_daily_demand.csv", index=False) 

### Preparing training and test data
The current dataset cannot be used for the purpose of training. Dividing the dataset and creating dummy variables is required here. 

In [13]:
# ['name', 'lat', 'lng'] variables are excluded 
## no 'time' column is used

# training data: response variable
X_train = rentals_daily.loc[(rentals_daily["year"] == 2023) | ((rentals_daily["year"] == 2024) & (rentals_daily["day"] < 26)),["station_id", "year", "month", "day"]]

# test data: response variable
X_test = rentals_daily.loc[(rentals_daily["year"] == 2024) & (rentals_daily["day"] >= 26),["station_id", "year", "month", "day"]]

import random
random.seed(42)
station_ids = list(rentals_daily["station_id"].unique())

# Picking 450 unique station ids randomly
sel_st = random.sample(station_ids, 450)
sel_st[:5]

[456, 102, 1126, 1003, 914]

In [14]:
# filtering for 450 chosen stations
X_train = X_train[X_train["station_id"].isin(sel_st)]
X_test = X_test[X_test["station_id"].isin(sel_st)]

# creating dummies for X_train and X_test
X_train = pd.get_dummies(X_train, columns=['station_id'])
X_test = pd.get_dummies(X_test, columns=['station_id'])

In [15]:
# training data: target
y_train = rentals_daily.loc[(rentals_daily["year"] == 2023) | ((rentals_daily["year"] == 2024) & (rentals_daily["day"] < 26)), "rentals"]

# test data: target
y_test = rentals_daily.loc[(rentals_daily["year"] == 2024) & (rentals_daily["day"] >= 26), "rentals"]

# Also need to filter y_train and y_test
y_train = y_train[y_train.index.isin(X_train.index)]
y_test = y_test[y_test.index.isin(X_test.index)]

### Applying random forest to this df
Fixing the same number of stations, the performance for this approach should improve compared to the four-hour time resolution.

In [16]:
# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# training performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_train_pred = forest.predict(X_train)
print(mean_squared_error(y_train, y_train_pred),r2_score(y_train, y_train_pred))

6.933711735459663 0.9676774213425731


In [17]:
# performance on test set
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred), r2_score(y_test, y_test_pred))

95.0652412195122 0.7058550464926872


### Preliminary Results
Comparing the performance for test data as the number of stations are increased, the prediction accuracy seems to increase for <u>daily rentals</u>. 
- Switching to daily demand, the r2 score was 0.51 (50 stations)
- Switching to daily demand, the r2 score was 0.7 (450 stations)

It appears that **daily rentals** should be take-on for the research task. However, this kind of approach is of no practical value. Urban mobility companies require *hourly* predictions for addressing sudden changes in demand or for fleet management. Increasing the data and using more sophisticated models should increase the performance even further for *hourly rentals*.