# Rental Predictions
To make rental predictions, the *rentals* dataset first needs to be divided into training and test datasets.

In [1]:
# importing rentals 
import pandas as pd
rentals = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Code/rentals_with_demand.csv")
rentals.head()

Unnamed: 0,name,lat,lng,station_id,datetime,#_rentals,year,month,day,hour,minute,second,microsecond
0,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 00:00:00.000,0,2023,12,31,0,0,0,0
1,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 04:00:00.000,0,2023,12,31,4,0,0,0
2,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 08:00:00.000,0,2023,12,31,8,0,0,0
3,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 12:00:00.000,0,2023,12,31,12,0,0,0
4,1 Ave & E 110 St,40.792327,-73.9383,0,2023-12-31 16:00:00.000,0,2023,12,31,16,0,0,0


In [2]:
# from 2023, data is available only for 31 Dec'23 (all time units)
rentals[(rentals["year"] == 2023)&(rentals["day"]==31)&(rentals["station_id"]==656)]

Unnamed: 0,name,lat,lng,station_id,datetime,#_rentals,year,month,day,hour,minute,second,microsecond
125952,Broadway & W 36 St,40.750977,-73.987654,656,2023-12-31 00:00:00.000,1,2023,12,31,0,0,0,0
125953,Broadway & W 36 St,40.750977,-73.987654,656,2023-12-31 04:00:00.000,0,2023,12,31,4,0,0,0
125954,Broadway & W 36 St,40.750977,-73.987654,656,2023-12-31 08:00:00.000,0,2023,12,31,8,0,0,0
125955,Broadway & W 36 St,40.750977,-73.987654,656,2023-12-31 12:00:00.000,0,2023,12,31,12,0,0,0
125956,Broadway & W 36 St,40.750977,-73.987654,656,2023-12-31 16:00:00.000,0,2023,12,31,16,0,0,0
125957,Broadway & W 36 St,40.750977,-73.987654,656,2023-12-31 20:00:00.000,0,2023,12,31,20,0,0,0


## How to divide the data?
Each station has records for 32 days, with each day containing six time units (*of 4hrs each*).<br><br>
So, data can be divided for first 25~27 days as <u>training data</u> and the remaining days as <u>test data</u>.
<br><br>
- y: *#_rentals*
- x: *remaining variables*

Keeping 80% of 32 days i.e. approx. 25 days as training data and the remainig 7 days (~20%) as test data.

In [15]:
# training data: target
y_train = rentals.loc[(rentals["year"] == 2023) | ((rentals["year"] == 2024) & (rentals["day"] < 26)), "#_rentals"]

# test data: target
y_test = rentals.loc[(rentals["year"] == 2024) & (rentals["day"] >= 26), "#_rentals"]

In [4]:
# Is there any common index value between y_train and y_test
y_train.index.intersection(y_test.index)

Index([], dtype='int64')

Hence, y_train and y_test are mutually exclusive.

In [16]:
# ['name', 'lat', 'lng'] variables are excluded 
## for 'time' only ["hour"] column is used

# training data: response variable
X_train = rentals.loc[(rentals["year"] == 2023) | ((rentals["year"] == 2024) & (rentals["day"] < 26)),["station_id", "year", "month", "day", "hour"]]

# test data: response variable
X_test = rentals.loc[(rentals["year"] == 2024) & (rentals["day"] >= 26),["station_id", "year", "month", "day", "hour"]]

In [6]:
# Is there any common index value between X_train and X_test
X_train.index.intersection(X_test.index)

Index([], dtype='int64')

Hence, X_train and X_test are mutually exclusive.

### CAUTION
There is a problem in how *X_train* and *X_test* are built for rendering tree-based methods such as <u>Random Forest</u>. Since **station_id** is a categorical variable, *One Hot Encoding* must be used to highlight station information.

In [7]:
# how many stations do I have?
len(rentals["name"].unique())

2119

The number of stations are too high if I'll be using dummy variables. Selecting a subset of stations should be better as it will reduce the dimensionality.

In [17]:
import random
random.seed(42)
station_ids = list(rentals["station_id"])

# Picking 450 unique station ids randomly
sel_st = random.sample(station_ids, 450)
sel_st[:5]

[1746, 304, 68, 2024, 751]

In [9]:
# how station info is currently stored
X_train.head()

Unnamed: 0,station_id,year,month,day,hour
0,0,2023,12,31,0
1,0,2023,12,31,4
2,0,2023,12,31,8
3,0,2023,12,31,12
4,0,2023,12,31,16


In [18]:
# filtering for 50 chosen stations
X_train = X_train[X_train["station_id"].isin(sel_st)]
X_test = X_test[X_test["station_id"].isin(sel_st)]

# creating dummies for X_train and X_test
X_train = pd.get_dummies(X_train, columns=['station_id'])
X_test = pd.get_dummies(X_test, columns=['station_id'])

In [11]:
# ["station_id"] is transformed
X_train.columns

Index(['year', 'month', 'day', 'hour', 'station_id_17', 'station_id_68',
       'station_id_72', 'station_id_81', 'station_id_86', 'station_id_237',
       'station_id_253', 'station_id_255', 'station_id_264', 'station_id_279',
       'station_id_304', 'station_id_381', 'station_id_424', 'station_id_435',
       'station_id_542', 'station_id_587', 'station_id_597', 'station_id_601',
       'station_id_609', 'station_id_635', 'station_id_668', 'station_id_751',
       'station_id_758', 'station_id_759', 'station_id_919', 'station_id_929',
       'station_id_1037', 'station_id_1145', 'station_id_1152',
       'station_id_1154', 'station_id_1226', 'station_id_1379',
       'station_id_1488', 'station_id_1489', 'station_id_1532',
       'station_id_1609', 'station_id_1612', 'station_id_1643',
       'station_id_1746', 'station_id_1774', 'station_id_1847',
       'station_id_1906', 'station_id_1915', 'station_id_1955',
       'station_id_2011', 'station_id_2022', 'station_id_2024',
       '

In [19]:
# Also need to filter y_train and y_test
y_train = y_train[y_train.index.isin(X_train.index)]
y_test = y_test[y_test.index.isin(X_test.index)]

This feature map can be used now for <u>tree-based</u> methods. The cardinality has been reduced for better training time.

## Selecting a Model
A simple **Random Forest** approach is tried first to evaluate what the performane looks like. Note that Random forest can't handle *non-numeric* data until it is transformed! 

In [20]:
# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# training performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_train_pred = forest.predict(X_train)
print(mean_squared_error(y_train, y_train_pred),r2_score(y_train, y_train_pred))

0.6746495255775576 0.9447442447917629


In [21]:
# performance on test set
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred), r2_score(y_test, y_test_pred))

8.455069602585258 0.5286546992647055


## Preliminary Results
- With 20 stations, R-squared on test data was 0.541 
- With 50 stations, R-squared on test data was 0.527
- With 450 stations, R-squared on test data was 0.528

More than 50% of the variation in the target variable can be explained just by using *datetime* features. It is expected that having more relevant features should be able to explain higher portion of total variance in *y*.
<br><br>
Hyperparameter tuning and model selection should also help increase performance on the test set.