# Predictions on Daily Rentals
The aim of this notebook is to see if there is any improvement in prediction performance by adding more weather related features i.e. temperature, wind, rainfall, etc.

## Creating a temperature dataset
The data is taken from the [National Weather Service](https://www.weather.gov/wrh/climate?wfo=okx). As a reference location for New York City, weather readings from **Central Park** station have been used.

In [1]:
# importing weather data
import pandas as pd
temp_daily = pd.DataFrame()
temp_daily.head()

In [2]:
# adding values for January 2024
day = [i for i in range(1,32)]
temp_daily["day"] = day
temp_daily["year"] = 2024
temp_daily["month"] = 1
temp_daily["avg_temp_(F)"] = 0
temp_daily.head()

Unnamed: 0,day,year,month,avg_temp_(F)
0,1,2024,1,0
1,2,2024,1,0
2,3,2024,1,0
3,4,2024,1,0
4,5,2024,1,0


In [3]:
# creating avg_temp column
temp = [41.0,35.5,38.5,36.5,31.5,34.5,36.0,40.5,46.5,50.5,44.0,45.0,47.5,35.0,26.0,27.0,20.5,28.0,29.0,22.0,25.5,31.5,37.0,43.0,52.5,44.0,46.0,40.5,38.0,36.5,37.0]
temp_daily["avg_temp_(F)"] = temp
temp_daily.head()

Unnamed: 0,day,year,month,avg_temp_(F)
0,1,2024,1,41.0
1,2,2024,1,35.5
2,3,2024,1,38.5
3,4,2024,1,36.5
4,5,2024,1,31.5


In [4]:
# creating precipitation column
T = 0.001 

prec = [0.03, 0.00, 0.00, 0.00, 0.00, 0.41, 0.24, 0.00, 1.73, 0.22, 0.00, 0.08, 0.81, T,
0.04, 0.28, 0.00, T,
0.04, 0.00, 0.00, 0.00, 0.05, 0.05, 0.24, 0.19, 0.00, 0.82, 0.05, T, T]

T means "Trace". It suggests that the prec was slightly above 0.00, but it can't be measured. Here the value is chosen to be *0.001* to <u>avoid missing values</u>.

In [5]:
temp_daily["prec"] = prec
temp_daily.head()

Unnamed: 0,day,year,month,avg_temp_(F),prec
0,1,2024,1,41.0,0.03
1,2,2024,1,35.5,0.0
2,3,2024,1,38.5,0.0
3,4,2024,1,36.5,0.0
4,5,2024,1,31.5,0.0


In [6]:
# creating snow depth column
sd = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
temp_daily["snow_depth"] = sd
temp_daily.head()

Unnamed: 0,day,year,month,avg_temp_(F),prec,snow_depth
0,1,2024,1,41.0,0.03,0
1,2,2024,1,35.5,0.0,0
2,3,2024,1,38.5,0.0,0
3,4,2024,1,36.5,0.0,0
4,5,2024,1,31.5,0.0,0


As of now, the data has been added only for January 2024. The data for 31 Dec 2023 also needs to be added for the purpose of completion.

In [7]:
# How the data is stored
temp_daily.iloc[30]

day               31.000
year            2024.000
month              1.000
avg_temp_(F)      37.000
prec               0.001
snow_depth         0.000
Name: 30, dtype: float64

In [8]:
# Updating
new_row = pd.DataFrame({'day':31, 'year':2023, 'month':12, 'avg_temp_(F)':41.0, 'prec':0.00, 'snow_depth':0},index=['31'])
temp_daily = pd.concat([new_row,temp_daily.loc[:]]).reset_index(drop=True)
temp_daily.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   day           32 non-null     int64  
 1   year          32 non-null     int64  
 2   month         32 non-null     int64  
 3   avg_temp_(F)  32 non-null     float64
 4   prec          32 non-null     float64
 5   snow_depth    32 non-null     int64  
dtypes: float64(2), int64(4)
memory usage: 1.6 KB


In [9]:
# It works!
print(temp_daily.iloc[0])
print(len(temp_daily))

day               31.0
year            2023.0
month             12.0
avg_temp_(F)      41.0
prec               0.0
snow_depth         0.0
Name: 0, dtype: float64
32


## Combining weather with rentals_daily
Weather based features are combined with daily rentals dataset.

In [10]:
# importing
rentals_daily = pd.read_csv("C:/Users/singh/Desktop/TUD (All Semesters)/Courses - Semester 5 (TU Dresden)/Research Task - Spatial Modelling/Code/rentals_with_daily_demand.csv")

In [11]:
rentals_daily[(rentals_daily["day"] == 1)]

Unnamed: 0,name,lat,lng,station_id,year,month,day,rentals
1,1 Ave & E 110 St,40.792327,-73.938300,0,2024,1,1,2
33,1 Ave & E 16 St,40.732219,-73.981656,1,2024,1,1,40
65,1 Ave & E 18 St,40.733812,-73.980544,2,2024,1,1,49
97,1 Ave & E 30 St,40.741444,-73.975361,3,2024,1,1,28
129,1 Ave & E 39 St,40.747140,-73.971130,4,2024,1,1,36
...,...,...,...,...,...,...,...,...
67809,Wyckoff Ave & Stanhope St,40.703545,-73.917775,2119,2024,1,1,15
67841,Wyckoff St & 3 Ave,40.682755,-73.982586,2120,2024,1,1,6
67873,Wythe Ave & Metropolitan Ave,40.716887,-73.963198,2121,2024,1,1,19
67905,Wythe Ave & N 13 St,40.722767,-73.957021,2122,2024,1,1,12


In [12]:
# merging the dfs
rentals_daily_temp = rentals_daily.merge(temp_daily, how='left', on= ['day','month','year'])

In [13]:
# It works!
rentals_daily_temp[(rentals_daily_temp["day"] == 31)&(rentals_daily_temp["station_id"] == 656)]

Unnamed: 0,name,lat,lng,station_id,year,month,day,rentals,avg_temp_(F),prec,snow_depth
20992,Broadway & W 36 St,40.750977,-73.987654,656,2023,12,31,1,41.0,0.0,0
21023,Broadway & W 36 St,40.750977,-73.987654,656,2024,1,31,44,37.0,0.001,0


## Developing training and test sets
The default of 450 random stations are taken as before.

In [14]:
# ['name', 'lat', 'lng'] variables are excluded 
## no 'time' column is used

# training data: response variable
X_train = rentals_daily_temp.loc[(rentals_daily_temp["year"] == 2023) | ((rentals_daily_temp["year"] == 2024) & (rentals_daily_temp["day"] < 26)),["station_id", "year", "month", "day", "avg_temp_(F)", "prec", "snow_depth"]]

# test data: response variable
X_test = rentals_daily_temp.loc[(rentals_daily_temp["year"] == 2024) & (rentals_daily_temp["day"] >= 26),["station_id", "year", "month", "day", "avg_temp_(F)", "prec", "snow_depth"]]

import random
random.seed(42)
station_ids = list(rentals_daily_temp["station_id"].unique())

# Picking 450 unique station ids randomly
sel_st = random.sample(station_ids, 450)
sel_st[:5]

[456, 102, 1126, 1003, 914]

In [15]:
# filtering for 450 chosen stations
X_train = X_train[X_train["station_id"].isin(sel_st)]
X_test = X_test[X_test["station_id"].isin(sel_st)]

# creating dummies for X_train and X_test
X_train = pd.get_dummies(X_train, columns=['station_id'])
X_test = pd.get_dummies(X_test, columns=['station_id'])

In [16]:
# training data: target
y_train = rentals_daily_temp.loc[(rentals_daily_temp["year"] == 2023) | ((rentals_daily_temp["year"] == 2024) & (rentals_daily_temp["day"] < 26)), "rentals"]

# test data: target
y_test = rentals_daily_temp.loc[(rentals_daily_temp["year"] == 2024) & (rentals_daily_temp["day"] >= 26), "rentals"]

# Also need to filter y_train and y_test
y_train = y_train[y_train.index.isin(X_train.index)]
y_test = y_test[y_test.index.isin(X_test.index)]

## Applying a linear model
It turns out that a simple Linear Regression Model is actually performing better than a *random forest* for prediction performance on test set. This suggests that the relationship between features and the response variable is <u>not far from linear</u>.

In [17]:
# Checking for number of stations considered
X_train.columns

Index(['year', 'month', 'day', 'avg_temp_(F)', 'prec', 'snow_depth',
       'station_id_1', 'station_id_4', 'station_id_6', 'station_id_14',
       ...
       'station_id_2086', 'station_id_2088', 'station_id_2090',
       'station_id_2092', 'station_id_2094', 'station_id_2098',
       'station_id_2099', 'station_id_2109', 'station_id_2114',
       'station_id_2122'],
      dtype='object', length=456)

In [18]:
# training performance
from sklearn.linear_model import LinearRegression
lr = LinearRegression().fit(X_train, y_train)

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_train_pred_lr = lr.predict(X_train)
print(mean_squared_error(y_train, y_train_pred_lr),r2_score(y_train, y_train_pred_lr))

38.533321314102565 0.7853914177208063


In [19]:
# performance on test set
y_test_pred_lr = lr.predict(X_test)
print(mean_squared_error(y_test, y_test_pred_lr), r2_score(y_test, y_test_pred_lr))

58.979618055555555 0.7755859463435186


## Applying random forest
This time the model has numeric weather attributes. The expectation is that this should lead to a performance increase.

In [20]:
# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# training performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_train_pred = forest.predict(X_train)
print(mean_squared_error(y_train, y_train_pred),r2_score(y_train, y_train_pred))

4.839111213675214 0.9730489155452555


In [21]:
# performance on test set
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred), r2_score(y_test, y_test_pred))

76.65187622222221 0.7083440206886278


### Preliminary Results
#### Random Forest
Comparing the performance for test data as the number of stations are increased, the prediction accuracy seems to increase for <u>daily rentals</u>. 
- Switching to daily demand, the r2 score was 0.77 (50 stations)
- Switching to daily demand, the r2 score was 0.7 (450 stations)
- Adding weather information, the r2 score was 0.7 (450 stations)

Adding **weather attributes** increases the performance further for *daily rentals*, but not by a huge margin. Maybe the effect of this will be more visible when data is considered for 3-4 months.
<br><br>
**Note**: The <u>training time</u> seems to have *improved*. This was not expected.

#### Linear Regression
<u>Case for 450 stations</u>: Linear model has lower r2 score on training but better r2 score on the test set, compared to *random forest*. This suggests that random forest is **overfitting**.

### Further Actions
- With 50 stations: The r2-score of *linear model* is 0.79 
- With 450 stations: The r2-score of *linear model* is 0.77

It looks like the performance of linear model is decreasing with increase in the volume of data. Also, the performance of *random forest* seems to be changing every time it is run. This suggest that there might be some room for <u>hyperparameter tuning</u>.