## Feature Construction Issue
Previously in the notebook 'constructing features', it was seen that the value for 8 AM rentals was not missing for all the stations. This means there must have been some issue with the way these lagged rental values were created!

### Re-creating the old method 
Here, the normal shift(1) method is used for generating lagged varaibles, as it was done in the previous *Constructing_Features* notebook.

In [1]:
# Importing
import pandas as pd
selected_rentals = pd.read_csv(r"C:\Users\singh\Desktop\TUD (All Semesters)\Courses - Semester 5 (TU Dresden)\Research Task - Spatial Modelling\Code\rentals_near200_st.csv")

# remove data for 2023, sort by station, remove [name, lat, lng]
selected_rentals = selected_rentals[["#_rentals", "datetime", "year", "month", "day", "hour", "ID", 'coordinates']]
selected_rentals = selected_rentals[~(selected_rentals["year"] == 2023)]
selected_rentals.sort_values(by=["ID","year","month","day","hour"], ignore_index=True, inplace=True)
selected_rentals[:15]

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272)
1,0,2024-01-01 10:00:00.000,2024,1,1,10,0,POINT (-73.9383 40.7923272)
2,0,2024-01-01 12:00:00.000,2024,1,1,12,0,POINT (-73.9383 40.7923272)
3,0,2024-01-01 14:00:00.000,2024,1,1,14,0,POINT (-73.9383 40.7923272)
4,0,2024-01-01 16:00:00.000,2024,1,1,16,0,POINT (-73.9383 40.7923272)
5,0,2024-01-01 18:00:00.000,2024,1,1,18,0,POINT (-73.9383 40.7923272)
6,0,2024-01-01 20:00:00.000,2024,1,1,20,0,POINT (-73.9383 40.7923272)
7,3,2024-01-02 08:00:00.000,2024,1,2,8,0,POINT (-73.9383 40.7923272)
8,0,2024-01-02 10:00:00.000,2024,1,2,10,0,POINT (-73.9383 40.7923272)
9,6,2024-01-02 12:00:00.000,2024,1,2,12,0,POINT (-73.9383 40.7923272)


In [2]:
# Introducing lagged rentals
selected_rentals["#_rentals_lag_1"] = selected_rentals["#_rentals"].shift(1)
selected_rentals[:15]

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),
1,0,2024-01-01 10:00:00.000,2024,1,1,10,0,POINT (-73.9383 40.7923272),0.0
2,0,2024-01-01 12:00:00.000,2024,1,1,12,0,POINT (-73.9383 40.7923272),0.0
3,0,2024-01-01 14:00:00.000,2024,1,1,14,0,POINT (-73.9383 40.7923272),0.0
4,0,2024-01-01 16:00:00.000,2024,1,1,16,0,POINT (-73.9383 40.7923272),0.0
5,0,2024-01-01 18:00:00.000,2024,1,1,18,0,POINT (-73.9383 40.7923272),0.0
6,0,2024-01-01 20:00:00.000,2024,1,1,20,0,POINT (-73.9383 40.7923272),0.0
7,3,2024-01-02 08:00:00.000,2024,1,2,8,0,POINT (-73.9383 40.7923272),0.0
8,0,2024-01-02 10:00:00.000,2024,1,2,10,0,POINT (-73.9383 40.7923272),3.0
9,6,2024-01-02 12:00:00.000,2024,1,2,12,0,POINT (-73.9383 40.7923272),0.0


In [3]:
# Checking for 8am values for all IDs
selected_rentals[(selected_rentals["hour"] == 8) & (selected_rentals["day"] == 1)]

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),
217,9,2024-02-01 08:00:00.000,2024,2,1,8,0,POINT (-73.9383 40.7923272),0.0
420,3,2024-03-01 08:00:00.000,2024,3,1,8,0,POINT (-73.9383 40.7923272),0.0
637,3,2024-04-01 08:00:00.000,2024,4,1,8,0,POINT (-73.9383 40.7923272),1.0
847,0,2024-01-01 08:00:00.000,2024,1,1,8,9,POINT (-73.94594 40.7817212),2.0
...,...,...,...,...,...,...,...,...,...
168343,0,2024-04-01 08:00:00.000,2024,4,1,8,2065,POINT (-73.92037 40.812299),0.0
168553,0,2024-01-01 08:00:00.000,2024,1,1,8,2074,POINT (-73.913863 40.800933),0.0
168770,0,2024-02-01 08:00:00.000,2024,2,1,8,2074,POINT (-73.913863 40.800933),0.0
168973,0,2024-03-01 08:00:00.000,2024,3,1,8,2074,POINT (-73.913863 40.800933),0.0


The problem we see here is that for station ID '0', we have lagged rental values of *Nan* but for station ID '9', the lagged rental value is 2. This suggests that there is a problem.

In [4]:
import numpy as np

# removing the column for lagged rentals
selected_rentals['#_rentals_lag_1'] = np.nan

# sorting the data
selected_rentals.sort_values(by = ['ID', 'month', 'day', 'hour'], inplace=True, ignore_index=True)
selected_rentals.head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),
1,0,2024-01-01 10:00:00.000,2024,1,1,10,0,POINT (-73.9383 40.7923272),
2,0,2024-01-01 12:00:00.000,2024,1,1,12,0,POINT (-73.9383 40.7923272),
3,0,2024-01-01 14:00:00.000,2024,1,1,14,0,POINT (-73.9383 40.7923272),
4,0,2024-01-01 16:00:00.000,2024,1,1,16,0,POINT (-73.9383 40.7923272),


In [5]:
# For Jan, 31 days (7 time units * 31 days)
len(selected_rentals[(selected_rentals["ID"] == 0) & (selected_rentals["month"] == 1)])

217

### Developing a new method for creating lagged values
The old method had issues, as we have seen. This section is trying to explore what can be done to develop a better way.

In [6]:
# Creating lag: Selecting ID=0, 1st day of Jan all time units
lag_list = list(
selected_rentals.loc[(selected_rentals.ID == 0) & (selected_rentals.month == 1) & (selected_rentals.day == 1), "#_rentals"].shift(1)
)

lag_list

[nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In [7]:
selected_rentals.ID.unique()

array([   0,    9,   54,   55,   62,   63,   66,   67,   72,   74,   75,
         76,   77,   78,   83,   85,   90,   91,   92,  101,  102,  103,
        113,  114,  115,  124,  133,  134,  143,  150,  154,  156,  157,
        168,  169,  184,  185,  281,  286,  287,  292,  293,  294,  439,
        482,  483,  484,  485,  486,  487,  488,  489,  499,  505,  506,
        507,  535,  616,  631,  634,  663,  664,  665,  732,  734,  735,
        736,  791,  793,  801,  819,  823,  861,  891,  892,  893,  894,
        895,  896,  901,  902,  903,  904,  905,  906,  907,  908,  911,
        912,  913,  916,  917,  918,  919,  920,  921,  922,  923,  924,
        927,  928,  929,  930,  931,  934,  935, 1066, 1067, 1068, 1069,
       1070, 1071, 1072, 1074, 1075, 1076, 1077, 1078, 1079, 1080, 1081,
       1082, 1083, 1084, 1085, 1088, 1089, 1090, 1091, 1092, 1093, 1094,
       1095, 1106, 1119, 1157, 1158, 1159, 1160, 1187, 1211, 1366, 1367,
       1368, 1369, 1370, 1371, 1372, 1373, 1374, 13

In [8]:
# Creating a lag list for the month of Jan only!

lag_list_jan = []

for i in selected_rentals.ID.unique():
    for k in selected_rentals[selected_rentals.month == 1].day.unique():
        lag_list_jan.append(selected_rentals.loc[(selected_rentals.ID == i) & (selected_rentals.month == 1) & (selected_rentals.day == k), "#_rentals"].shift(1))
            
lag_list_jan

[0    NaN
 1    0.0
 2    0.0
 3    0.0
 4    0.0
 5    0.0
 6    0.0
 Name: #_rentals, dtype: float64,
 7     NaN
 8     3.0
 9     0.0
 10    6.0
 11    6.0
 12    7.0
 13    1.0
 Name: #_rentals, dtype: float64,
 14    NaN
 15    3.0
 16    1.0
 17    3.0
 18    0.0
 19    5.0
 20    1.0
 Name: #_rentals, dtype: float64,
 21    NaN
 22    4.0
 23    1.0
 24    1.0
 25    1.0
 26    0.0
 27    1.0
 Name: #_rentals, dtype: float64,
 28    NaN
 29    2.0
 30    1.0
 31    1.0
 32    1.0
 33    1.0
 34    1.0
 Name: #_rentals, dtype: float64,
 35    NaN
 36    2.0
 37    0.0
 38    1.0
 39    2.0
 40    2.0
 41    0.0
 Name: #_rentals, dtype: float64,
 42    NaN
 43    1.0
 44    0.0
 45    0.0
 46    0.0
 47    4.0
 48    0.0
 Name: #_rentals, dtype: float64,
 49    NaN
 50    3.0
 51    1.0
 52    0.0
 53    0.0
 54    3.0
 55    2.0
 Name: #_rentals, dtype: float64,
 56    NaN
 57    4.0
 58    4.0
 59    1.0
 60    0.0
 61    0.0
 62    0.0
 Name: #_rentals, dtype: float64,
 63    N

In [9]:
# Creating a lag list for the month of Feb only!

lag_list_feb = []

for i in selected_rentals.ID.unique():
    for k in selected_rentals[selected_rentals.month == 2].day.unique():
        lag_list_feb.append(selected_rentals.loc[(selected_rentals.ID == i) & (selected_rentals.month == 2) & (selected_rentals.day == k), "#_rentals"].shift(1))
            
lag_list_feb

[217    NaN
 218    9.0
 219    1.0
 220    3.0
 221    5.0
 222    1.0
 223    3.0
 Name: #_rentals, dtype: float64,
 224    NaN
 225    4.0
 226    0.0
 227    2.0
 228    0.0
 229    3.0
 230    2.0
 Name: #_rentals, dtype: float64,
 231    NaN
 232    1.0
 233    3.0
 234    2.0
 235    5.0
 236    3.0
 237    1.0
 Name: #_rentals, dtype: float64,
 238    NaN
 239    0.0
 240    0.0
 241    2.0
 242    1.0
 243    5.0
 244    1.0
 Name: #_rentals, dtype: float64,
 245    NaN
 246    6.0
 247    1.0
 248    2.0
 249    1.0
 250    2.0
 251    1.0
 Name: #_rentals, dtype: float64,
 252    NaN
 253    6.0
 254    2.0
 255    2.0
 256    2.0
 257    3.0
 258    1.0
 Name: #_rentals, dtype: float64,
 259    NaN
 260    7.0
 261    3.0
 262    1.0
 263    2.0
 264    3.0
 265    1.0
 Name: #_rentals, dtype: float64,
 266    NaN
 267    3.0
 268    1.0
 269    2.0
 270    7.0
 271    5.0
 272    3.0
 Name: #_rentals, dtype: float64,
 273    NaN
 274    7.0
 275    2.0
 276    3.0
 277    

In [10]:
# Creating a lag list for the month of Mar only!

lag_list_mar = []

for i in selected_rentals.ID.unique():
    for k in selected_rentals[selected_rentals.month == 3].day.unique():
        lag_list_mar.append(selected_rentals.loc[(selected_rentals.ID == i) & (selected_rentals.month == 3) & (selected_rentals.day == k), "#_rentals"].shift(1))
            
lag_list_mar

[420    NaN
 421    3.0
 422    1.0
 423    1.0
 424    0.0
 425    0.0
 426    1.0
 Name: #_rentals, dtype: float64,
 427    NaN
 428    2.0
 429    2.0
 430    0.0
 431    0.0
 432    0.0
 433    0.0
 Name: #_rentals, dtype: float64,
 434     NaN
 435     1.0
 436     1.0
 437     2.0
 438     4.0
 439    10.0
 440     2.0
 Name: #_rentals, dtype: float64,
 441    NaN
 442    4.0
 443    6.0
 444    7.0
 445    2.0
 446    2.0
 447    5.0
 Name: #_rentals, dtype: float64,
 448    NaN
 449    0.0
 450    0.0
 451    1.0
 452    3.0
 453    0.0
 454    0.0
 Name: #_rentals, dtype: float64,
 455    NaN
 456    6.0
 457    1.0
 458    1.0
 459    0.0
 460    0.0
 461    0.0
 Name: #_rentals, dtype: float64,
 462    NaN
 463    6.0
 464    1.0
 465    0.0
 466    1.0
 467    0.0
 468    0.0
 Name: #_rentals, dtype: float64,
 469    NaN
 470    1.0
 471    2.0
 472    2.0
 473    2.0
 474    2.0
 475    4.0
 Name: #_rentals, dtype: float64,
 476    NaN
 477    2.0
 478    0.0
 479    2.0
 

In [11]:
# Creating a lag list for the month of Apr only!

lag_list_apr = []

for i in selected_rentals.ID.unique():
    for k in selected_rentals[selected_rentals.month == 4].day.unique():
        lag_list_apr.append(selected_rentals.loc[(selected_rentals.ID == i) & (selected_rentals.month == 4) & (selected_rentals.day == k), "#_rentals"].shift(1))
            
lag_list_apr

[637    NaN
 638    3.0
 639    0.0
 640    0.0
 641    1.0
 642    3.0
 643    4.0
 Name: #_rentals, dtype: float64,
 644    NaN
 645    4.0
 646    0.0
 647    4.0
 648    0.0
 649    0.0
 650    1.0
 Name: #_rentals, dtype: float64,
 651    NaN
 652    2.0
 653    0.0
 654    0.0
 655    0.0
 656    1.0
 657    1.0
 Name: #_rentals, dtype: float64,
 658    NaN
 659    3.0
 660    2.0
 661    1.0
 662    0.0
 663    1.0
 664    2.0
 Name: #_rentals, dtype: float64,
 665    NaN
 666    6.0
 667    2.0
 668    2.0
 669    4.0
 670    1.0
 671    4.0
 Name: #_rentals, dtype: float64,
 672    NaN
 673    0.0
 674    3.0
 675    4.0
 676    3.0
 677    5.0
 678    1.0
 Name: #_rentals, dtype: float64,
 679    NaN
 680    1.0
 681    4.0
 682    3.0
 683    2.0
 684    4.0
 685    1.0
 Name: #_rentals, dtype: float64,
 686    NaN
 687    6.0
 688    4.0
 689    5.0
 690    3.0
 691    0.0
 692    2.0
 Name: #_rentals, dtype: float64,
 693    NaN
 694    6.0
 695    5.0
 696    4.0
 697    

The data for lagged values is created in such a manner that all stations are considered for each month! So we need to sort the dataset so that all stations appear by month -- Jan, Feb, Mar, and Apr. 

In [12]:
# Re-sorting the dataset by month (index is reset)

selected_rentals.sort_values(by=["month", "ID", "day"], inplace=True, ignore_index=True)
selected_rentals.head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),
1,0,2024-01-01 10:00:00.000,2024,1,1,10,0,POINT (-73.9383 40.7923272),
2,0,2024-01-01 12:00:00.000,2024,1,1,12,0,POINT (-73.9383 40.7923272),
3,0,2024-01-01 14:00:00.000,2024,1,1,14,0,POINT (-73.9383 40.7923272),
4,0,2024-01-01 16:00:00.000,2024,1,1,16,0,POINT (-73.9383 40.7923272),


### A problem with lag list!
I tried to enter the combined lag list as a column in the *selected_rentals* dataset. However, an error was encountered. The reason for this error can be seen below.

In [13]:
# The problem:
print(len(lag_list_jan + lag_list_feb + lag_list_mar + lag_list_apr),
      len(selected_rentals))

24200 169400


In [14]:
# Where is the problem?
print(len(lag_list_jan), 
      len(selected_rentals[selected_rentals.month == 1]))

6200 43400


In [15]:
# The problem is that the list contains fragmented data

print(lag_list_jan[0],
      lag_list_jan[0][:3])

0    NaN
1    0.0
2    0.0
3    0.0
4    0.0
5    0.0
6    0.0
Name: #_rentals, dtype: float64 0    NaN
1    0.0
2    0.0
Name: #_rentals, dtype: float64


In [16]:
for j in lag_list_jan[0]:
    print(j)  # Each element is a datablock!

nan
0.0
0.0
0.0
0.0
0.0
0.0


In [17]:
# Unpacking for all months

lag_list_jan_unpacked = []

for i in range(len(lag_list_jan)):
    for j in lag_list_jan[i]:
        lag_list_jan_unpacked.append(j)
        
lag_list_feb_unpacked = []

for i in range(len(lag_list_feb)):
    for j in lag_list_feb[i]:
        lag_list_feb_unpacked.append(j)

lag_list_mar_unpacked = []

for i in range(len(lag_list_mar)):
    for j in lag_list_mar[i]:
        lag_list_mar_unpacked.append(j)
        
lag_list_apr_unpacked = []

for i in range(len(lag_list_apr)):
    for j in lag_list_apr[i]:
        lag_list_apr_unpacked.append(j)

print(len(lag_list_jan_unpacked), len(lag_list_feb_unpacked), len(lag_list_mar_unpacked), len(lag_list_apr_unpacked))

43400 40600 43400 42000


In [18]:
# Now using the lag_list for generating lagged rentals
selected_rentals["#_rentals_lag_1"] = lag_list_jan_unpacked + lag_list_feb_unpacked + lag_list_mar_unpacked + lag_list_apr_unpacked
selected_rentals.head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),
1,0,2024-01-01 10:00:00.000,2024,1,1,10,0,POINT (-73.9383 40.7923272),0.0
2,0,2024-01-01 12:00:00.000,2024,1,1,12,0,POINT (-73.9383 40.7923272),0.0
3,0,2024-01-01 14:00:00.000,2024,1,1,14,0,POINT (-73.9383 40.7923272),0.0
4,0,2024-01-01 16:00:00.000,2024,1,1,16,0,POINT (-73.9383 40.7923272),0.0


In [19]:
# Verifying for all 8am values
selected_rentals[selected_rentals.hour == 8]

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),
7,3,2024-01-02 08:00:00.000,2024,1,2,8,0,POINT (-73.9383 40.7923272),
14,3,2024-01-03 08:00:00.000,2024,1,3,8,0,POINT (-73.9383 40.7923272),
21,4,2024-01-04 08:00:00.000,2024,1,4,8,0,POINT (-73.9383 40.7923272),
28,2,2024-01-05 08:00:00.000,2024,1,5,8,0,POINT (-73.9383 40.7923272),
...,...,...,...,...,...,...,...,...,...
169365,0,2024-04-26 08:00:00.000,2024,4,26,8,2074,POINT (-73.913863 40.800933),
169372,1,2024-04-27 08:00:00.000,2024,4,27,8,2074,POINT (-73.913863 40.800933),
169379,0,2024-04-28 08:00:00.000,2024,4,28,8,2074,POINT (-73.913863 40.800933),
169386,0,2024-04-29 08:00:00.000,2024,4,29,8,2074,POINT (-73.913863 40.800933),


In [20]:
# Rental values look good!
selected_rentals[selected_rentals.ID == 54]

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1
434,0,2024-01-01 08:00:00.000,2024,1,1,8,54,POINT (-73.92743647098541 40.772768286288304),
435,1,2024-01-01 10:00:00.000,2024,1,1,10,54,POINT (-73.92743647098541 40.772768286288304),0.0
436,3,2024-01-01 12:00:00.000,2024,1,1,12,54,POINT (-73.92743647098541 40.772768286288304),1.0
437,2,2024-01-01 14:00:00.000,2024,1,1,14,54,POINT (-73.92743647098541 40.772768286288304),3.0
438,0,2024-01-01 16:00:00.000,2024,1,1,16,54,POINT (-73.92743647098541 40.772768286288304),2.0
...,...,...,...,...,...,...,...,...,...
128025,3,2024-04-30 12:00:00.000,2024,4,30,12,54,POINT (-73.92743647098541 40.772768286288304),1.0
128026,2,2024-04-30 14:00:00.000,2024,4,30,14,54,POINT (-73.92743647098541 40.772768286288304),3.0
128027,1,2024-04-30 16:00:00.000,2024,4,30,16,54,POINT (-73.92743647098541 40.772768286288304),2.0
128028,1,2024-04-30 18:00:00.000,2024,4,30,18,54,POINT (-73.92743647098541 40.772768286288304),1.0


### Testing baseline prediction performance 
This corrected dataset is used to evaluate what the baseline performance looks like without adding any complicated features for now. This performance can be compared with performance of models with added features to evaluate how effective those features are.

In [21]:
# Firstly, the conversion of ID col into dummies is required!

# creating dummies for ID
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1"]], columns = ["ID"], drop_first=False)
selected_rentals_dum.head()

Unnamed: 0,#_rentals,year,month,day,hour,#_rentals_lag_1,ID_0,ID_9,ID_54,ID_55,...,ID_1879,ID_1881,ID_1884,ID_2010,ID_2017,ID_2062,ID_2063,ID_2064,ID_2065,ID_2074
0,0,2024,1,1,8,,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,0,2024,1,1,10,0.0,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,0,2024,1,1,12,0.0,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,0,2024,1,1,14,0.0,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,0,2024,1,1,16,0.0,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [22]:
# Developing a training/test data for Feb

# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.722882392857143 0.2567099194968341


In [23]:
# Developing a training/test data for Mar

# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

3.9153760952380954 0.4448607863858628


### Other Features I can probably work with
Look at the notes in notebook. The **datetime** column can be used for generating those features.

#### Evaluating performance increase with *Name of Day* variable
Firstly, the name of the day is created using the datetime information. It is then converted into a dummy variable since it's categorical.

In [24]:
# Add day_of_week 
selected_rentals['name_of_day'] = pd.Series([x.day_name() for x in pd.to_datetime(selected_rentals["datetime"])])
selected_rentals.head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1,name_of_day
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),,Monday
1,0,2024-01-01 10:00:00.000,2024,1,1,10,0,POINT (-73.9383 40.7923272),0.0,Monday
2,0,2024-01-01 12:00:00.000,2024,1,1,12,0,POINT (-73.9383 40.7923272),0.0,Monday
3,0,2024-01-01 14:00:00.000,2024,1,1,14,0,POINT (-73.9383 40.7923272),0.0,Monday
4,0,2024-01-01 16:00:00.000,2024,1,1,16,0,POINT (-73.9383 40.7923272),0.0,Monday


In [25]:
# creating dummies for ID and name_of_day
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "name_of_day"]], columns = ["ID", "name_of_day"], drop_first=False)
selected_rentals_dum.head()

Unnamed: 0,#_rentals,year,month,day,hour,#_rentals_lag_1,ID_0,ID_9,ID_54,ID_55,...,ID_2064,ID_2065,ID_2074,name_of_day_Friday,name_of_day_Monday,name_of_day_Saturday,name_of_day_Sunday,name_of_day_Thursday,name_of_day_Tuesday,name_of_day_Wednesday
0,0,2024,1,1,8,,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
1,0,2024,1,1,10,0.0,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
2,0,2024,1,1,12,0.0,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
3,0,2024,1,1,14,0.0,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
4,0,2024,1,1,16,0.0,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False


In [26]:
# Evaluating change in prediction performance: Feb

# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.2193783035714287 0.39415603029493884


In [27]:
# Evaluating change in prediction performance: Mar

# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

3.6642331547619045 0.4804689351535526


The advantage is clearly there for adding **name_of_day**, especially more for the February data (which was clearly needed since it had less r2 score before!)

#### Evaluating performance increase with *Weekend* variable
A *weekend* variable is created that stores binary values, depending on **name_of_day**.

In [28]:
# Declaring 0 (i.e. False) as default value
selected_rentals["weekend"] = 0

# Specifying the condition for value of 1 (i.e. True)
for i in range(len(selected_rentals)):
    if selected_rentals.loc[i, "name_of_day"] in ["Saturday", "Sunday"]:
        selected_rentals.loc[i, "weekend"] = 1
        
selected_rentals[selected_rentals.name_of_day == "Saturday"].head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1,name_of_day,weekend
35,2,2024-01-06 08:00:00.000,2024,1,6,8,0,POINT (-73.9383 40.7923272),,Saturday,1
36,0,2024-01-06 10:00:00.000,2024,1,6,10,0,POINT (-73.9383 40.7923272),2.0,Saturday,1
37,1,2024-01-06 12:00:00.000,2024,1,6,12,0,POINT (-73.9383 40.7923272),0.0,Saturday,1
38,2,2024-01-06 14:00:00.000,2024,1,6,14,0,POINT (-73.9383 40.7923272),1.0,Saturday,1
39,2,2024-01-06 16:00:00.000,2024,1,6,16,0,POINT (-73.9383 40.7923272),2.0,Saturday,1


In [29]:
# creating dummies for ID and name_of_day
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "name_of_day", "weekend"]], columns = ["ID", "name_of_day"], drop_first=False)

# Evaluating change in prediction performance: Feb
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

# Evaluating change in prediction performance: Mar
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.179808428571429 0.40495778054732057
3.6342633214285716 0.4847181897089573


While adding the binary *weekend* variable did not drastically improve performance, adding it in conjunction with *name_of_day* does help improve performance slightly. So, this variable can also be used as an important feature.
<br><br>
It can be concluded that some combination of <u>Day name</u> and <u>weekend indicator</u> is creating the best results, in terms of r2 score.

#### Evaluating performance increase with *public holiday* variable
There is a way to identify if a datetime object is a public holiday, in python. This should also influence **rentals** for bike-sharing systems. *is_holiday* variable is generated as a binary variable.

In [30]:
# pip install holidays
import holidays
us_holidays = holidays.US()
"2024-07-04" in us_holidays # 4th of July 2024 (American Independence Day)

True

In [31]:
# Converting datetime to dates
dates = pd.Series([x.date() for x in pd.to_datetime(selected_rentals["datetime"])])

# creating holiday as an empty list
holiday = []

# Evaluating if the stored date is a public holiday
for i in range(len(dates)):
    if dates[i] in us_holidays:
        holiday.append(1)
    else:
        holiday.append(0)

holiday[:9] # 1st Jan 2024 was a holiday, but 2nd Jan wasn't!

[1, 1, 1, 1, 1, 1, 1, 0, 0]

The first 7 values *(seven time units)* denote rental values on Jan 1st, 2024 for station ID 0. Subsequent two values are for the same station on Jan 2nd, 2024.

In [32]:
# Adding this variable as a column
selected_rentals["is_holiday"] = holiday
selected_rentals.head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1,name_of_day,weekend,is_holiday
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),,Monday,0,1
1,0,2024-01-01 10:00:00.000,2024,1,1,10,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1
2,0,2024-01-01 12:00:00.000,2024,1,1,12,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1
3,0,2024-01-01 14:00:00.000,2024,1,1,14,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1
4,0,2024-01-01 16:00:00.000,2024,1,1,16,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1


In [33]:
# creating dummies for ID and name_of_day
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "is_holiday"]], columns = ["ID"], drop_first=False)

# Evaluating change in prediction performance: Feb
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

# Evaluating change in prediction performance: Mar
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.716597625 0.2584255299906012
3.9179368809523805 0.4444977069439181


For the *is_holiday* variable, the performance for Feb data increased very slightly. Additionally, for the month of March the performance in fact slightly decreased. This variable is also increasing the cardinality of the dataset. <u>Hence the inclusion of this variable is optional, can be tested on the completed model</u>.

#### Evaluating performance with *2nd_lag*
Up till now only the first lag value was considered, i.e. rental value for the *t-1* time period. Now performance impact is evaluated by adding rental values as a feature for *t-2* time period.
<br><br>
More lagged rental values *(i.e. t-3, t-4, ...., t-6)* are not considered since it can lead to creation of a lot of **missing** values which contributes no valuable information to the learner!

In [34]:
# This time the month is incorporated within the loop
lag_list_two = []

# The dataset is sorted first by month, and then by ID
for i in selected_rentals.month.unique():
    for j in selected_rentals[selected_rentals.month == i].ID.unique():
        for k in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j)].day.unique():
            lag_list_two.append(selected_rentals.loc[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k), "#_rentals"].shift(2))
            
# Unpacking
lag_list_two_unpacked = []

for i in range(len(lag_list_two)):
    for j in lag_list_two[i]:
        lag_list_two_unpacked.append(j)

print(len(selected_rentals), len(lag_list_two_unpacked))

169400 169400


In [35]:
# Verifying if dataset is sorted by month and then year -- checking the index values
selected_rentals[215:220] # Yes, sorting and indexing are correct

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1,name_of_day,weekend,is_holiday
215,3,2024-01-31 18:00:00.000,2024,1,31,18,0,POINT (-73.9383 40.7923272),2.0,Wednesday,0,0
216,0,2024-01-31 20:00:00.000,2024,1,31,20,0,POINT (-73.9383 40.7923272),3.0,Wednesday,0,0
217,0,2024-01-01 08:00:00.000,2024,1,1,8,9,POINT (-73.94594 40.7817212),,Monday,0,1
218,1,2024-01-01 10:00:00.000,2024,1,1,10,9,POINT (-73.94594 40.7817212),0.0,Monday,0,1
219,1,2024-01-01 12:00:00.000,2024,1,1,12,9,POINT (-73.94594 40.7817212),1.0,Monday,0,1


In [36]:
# Assigning the second-lagged values to a column
selected_rentals["#_rentals_lag_2"] = lag_list_two_unpacked
selected_rentals.head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1,name_of_day,weekend,is_holiday,#_rentals_lag_2
0,0,2024-01-01 08:00:00.000,2024,1,1,8,0,POINT (-73.9383 40.7923272),,Monday,0,1,
1,0,2024-01-01 10:00:00.000,2024,1,1,10,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,
2,0,2024-01-01 12:00:00.000,2024,1,1,12,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0
3,0,2024-01-01 14:00:00.000,2024,1,1,14,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0
4,0,2024-01-01 16:00:00.000,2024,1,1,16,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0


In [37]:
# creating dummies for ID
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "#_rentals_lag_2"]], columns = ["ID"], drop_first=False)

# Evaluating change in prediction performance: Feb
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

# Evaluating change in prediction performance: Mar
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.5954866964285714 0.2914862865197184
3.7900022023809523 0.46263684738115396


Here also we see a performance advantage of adding *#_rentals_lag_2*, compared to baseline performance. Hence, this can also be kept.

#### Evaluating performance for *prev_day*
The rentals today at a given time might have some relationship with rentals the previous day for the same time. This information is captured in the *prev_day* variable. <u>However this approach is going to result in some missing values for the first day of every month! Missing values negative affects model performance and should be avoided if possible</u>.

In [38]:
prev_day_list = []

for i in selected_rentals.month.unique():
    for j in selected_rentals[selected_rentals.month == i].ID.unique():
        for k in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j)].day.unique():
            for l in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k)].hour.unique():
                if k<2:
                    prev_day_list.append(np.nan)
                else:
                    prev_day_list.append(selected_rentals.loc[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k-1) & (selected_rentals.hour == l), "#_rentals"].iloc[0])
  
print(len(selected_rentals), len(prev_day_list))

169400 169400


In [39]:
# Adding this information as a column
selected_rentals["prev_day"] = prev_day_list
selected_rentals[13:23]

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1,name_of_day,weekend,is_holiday,#_rentals_lag_2,prev_day
13,1,2024-01-02 20:00:00.000,2024,1,2,20,0,POINT (-73.9383 40.7923272),1.0,Tuesday,0,0,7.0,0.0
14,3,2024-01-03 08:00:00.000,2024,1,3,8,0,POINT (-73.9383 40.7923272),,Wednesday,0,0,,3.0
15,1,2024-01-03 10:00:00.000,2024,1,3,10,0,POINT (-73.9383 40.7923272),3.0,Wednesday,0,0,,0.0
16,3,2024-01-03 12:00:00.000,2024,1,3,12,0,POINT (-73.9383 40.7923272),1.0,Wednesday,0,0,3.0,6.0
17,0,2024-01-03 14:00:00.000,2024,1,3,14,0,POINT (-73.9383 40.7923272),3.0,Wednesday,0,0,1.0,6.0
18,5,2024-01-03 16:00:00.000,2024,1,3,16,0,POINT (-73.9383 40.7923272),0.0,Wednesday,0,0,3.0,7.0
19,1,2024-01-03 18:00:00.000,2024,1,3,18,0,POINT (-73.9383 40.7923272),5.0,Wednesday,0,0,0.0,1.0
20,0,2024-01-03 20:00:00.000,2024,1,3,20,0,POINT (-73.9383 40.7923272),1.0,Wednesday,0,0,5.0,1.0
21,4,2024-01-04 08:00:00.000,2024,1,4,8,0,POINT (-73.9383 40.7923272),,Thursday,0,0,,3.0
22,1,2024-01-04 10:00:00.000,2024,1,4,10,0,POINT (-73.9383 40.7923272),4.0,Thursday,0,0,,1.0


In [40]:
# creating dummies for ID
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "prev_day"]], columns = ["ID"], drop_first=False)

# Evaluating change in prediction performance: Feb
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

# Evaluating change in prediction performance: Mar
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.4637038214285716 0.327460261754435
3.8511207142857145 0.45397119641665507


The **prev_day** variable is also a very important feature since performance increase is considerable, compared to the baseline.

#### Evaluating performance for *prev_week* variable
There is functionaly in python to calculate the date that lies 7 days prior to the specified date as shown in the cell below. With this we can know what the rental values were for the same week of day and time, 7 days prior!  

In [41]:
# Retrieving date of the previous week, for a specified date

from datetime import timedelta
pd.to_datetime("2024-01-02 20:00:00.000") - timedelta(7)

Timestamp('2023-12-26 20:00:00')

In [42]:
# Trying to convert each item in the datetime column into datetime objects
datetime_obj = [pd.to_datetime(x) for x in selected_rentals.datetime]
datetime_obj[:3]

[Timestamp('2024-01-01 08:00:00'),
 Timestamp('2024-01-01 10:00:00'),
 Timestamp('2024-01-01 12:00:00')]

In [43]:
# Re-assigning the values to datetime column
selected_rentals["datetime"] = datetime_obj
type(selected_rentals["datetime"].iloc[0]) # Simple str object is convert to Timestamp

pandas._libs.tslibs.timestamps.Timestamp

In [44]:
# How the updated item looks like
selected_rentals.datetime[0]

Timestamp('2024-01-01 08:00:00')

In [45]:
# Initiating an empty list
prev_week_list = []

# Adding rental values from the previous week
for i in selected_rentals.month.unique():
    for j in selected_rentals[selected_rentals.month == i].ID.unique():
        for k in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j)].day.unique():
            for l in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k)].hour.unique():
                if (i == 1) & (k <= 7):
                    prev_week_list.append(np.nan)
                else:
                    time_obj = selected_rentals.loc[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k) & (selected_rentals.hour == l), "datetime"].iloc[0]
                    delta_obj = time_obj - timedelta(7)        # This is a timestamp object
                    prev_week_list.append(selected_rentals.loc[(selected_rentals.month == delta_obj.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj.day) & (selected_rentals.hour == delta_obj.hour), "#_rentals"].iloc[0])
  
print(len(selected_rentals), len(prev_week_list))

169400 169400


In [46]:
# Having a look at the generated values
print(
    prev_week_list[39:48], # The first 49 values are going to be Nan
    prev_week_list[49:59] # Data is available from this section onwards
)

[nan, nan, nan, nan, nan, nan, nan, nan, nan] [0, 0, 0, 0, 0, 0, 0, 3, 0, 6]


This method will still produce some *NaN* values for the first week of Jan 2024, since no previous week data is available. But this problem is only limited to the 1st week of January 2024.

In [47]:
# Creating column for value assignment
selected_rentals["prev_week"] = prev_week_list
selected_rentals.head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1,name_of_day,weekend,is_holiday,#_rentals_lag_2,prev_day,prev_week
0,0,2024-01-01 08:00:00,2024,1,1,8,0,POINT (-73.9383 40.7923272),,Monday,0,1,,,
1,0,2024-01-01 10:00:00,2024,1,1,10,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,,,
2,0,2024-01-01 12:00:00,2024,1,1,12,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0,,
3,0,2024-01-01 14:00:00,2024,1,1,14,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0,,
4,0,2024-01-01 16:00:00,2024,1,1,16,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0,,


In [48]:
# Verifying if the values are correct
print(
    selected_rentals.loc[(selected_rentals.ID==55)&(selected_rentals.month==2)&(selected_rentals.day==19)&(selected_rentals.hour == 16), "prev_week"].iloc[0],
    selected_rentals.loc[(selected_rentals.ID==55)&(selected_rentals.month==2)&(selected_rentals.day==12)&(selected_rentals.hour == 16), "#_rentals"].iloc[0]
)

print(
    selected_rentals.loc[(selected_rentals.ID==1881)&(selected_rentals.month==3)&(selected_rentals.day==4)&(selected_rentals.hour == 12), "prev_week"].iloc[0],
    selected_rentals.loc[(selected_rentals.ID==1881)&(selected_rentals.month==2)&(selected_rentals.day==26)&(selected_rentals.hour == 12), "#_rentals"].iloc[0]
)

# Works well!

1.0 1
1.0 1


In [49]:
# creating dummies for ID
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "prev_week"]], columns = ["ID"], drop_first=False)

# Evaluating change in prediction performance: Feb
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

# Evaluating change in prediction performance: Mar
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.497626642857143 0.3182000393016835
4.097686583333334 0.4190120049321757


The performance has considerably improved for the February data. However it seems to go down a little for the month of March. This suggests that we might need to some **hyperparameter tuning**. For now, this feature is kept as **optional**.

### Revisiting the *prev_day* feature
It is seen in this notebook above that for this feature, there are some *NaN* values present for the first day of every month. This problem is fixed here and the prediction performance is re-evaluated.

In [50]:
# Initiating an empty list
prev_day_list = []

# Adding rental values from the previous day
for i in selected_rentals.month.unique():
    for j in selected_rentals[selected_rentals.month == i].ID.unique():
        for k in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j)].day.unique():
            for l in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k)].hour.unique():
                if (i == 1) & (k == 1):
                    prev_day_list.append(np.nan)
                else:
                    time_obj = selected_rentals.loc[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k) & (selected_rentals.hour == l), "datetime"].iloc[0]
                    delta_obj = time_obj - timedelta(1)        # This is a timestamp object
                    prev_day_list.append(selected_rentals.loc[(selected_rentals.month == delta_obj.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj.day) & (selected_rentals.hour == delta_obj.hour), "#_rentals"].iloc[0])
  
print(len(selected_rentals), len(prev_day_list))

169400 169400


In [51]:
# Re-assigning values in the prev_day column
selected_rentals["prev_day"] = prev_day_list

In [52]:
# creating dummies for ID
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "prev_day"]], columns = ["ID"], drop_first=False)

# Evaluating change in prediction performance: Feb
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

# Evaluating change in prediction performance: Mar
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.4757876249999997 0.3241616355071344
3.8568653214285717 0.4531567007936965


### What does performance in combined feature space look like?
Combining all the relavant features to see what r2 score is being produced, for both Feb and Mar data.

In [53]:
# creating dummies for ID and name_of_day
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "name_of_day", "weekend", "#_rentals_lag_2", "prev_day", "prev_week"]], columns = ["ID", "name_of_day"], drop_first=False)

# Evaluating change in prediction performance: Feb
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

# Evaluating change in prediction performance: Mar
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.184174625 0.4037659000226975
3.683919130952381 0.4776777710161154


The performance for both months is still under 0.5. This means that spatio-temporal features are also required for better performance. In addition, tuning might gain more performance. Both of these steps should lead to an r2 score of > 0.5, but this remains to be seen!
<br><br>
After doing some research if only spatial and temporal information is used, then even after using deep learning methods such as LSTM the expected prediction performance is likely to lie in the window ***R2 score: between 0.5 and 0.7***. <br> It is a false belief that good performance can only be achieved with deep learning frameworks. In reality, traditional models can work equally well or even outperform deep learning based methods.

### Adding even more features
#### Rolling Average
Up till now, we have considered time lag values in hour, day and week. Now we can also consider a rolling average of rentals over previous seven days for a given hour. For example, for *hour 10* what was the average for this hour for the last 7 days. The average is rolling with each passing day.

In [54]:
# Initiating an empty list
roll_avg_list = []

# Adding avg rental values from previous 7 days
for i in selected_rentals.month.unique():
    for j in selected_rentals[selected_rentals.month == i].ID.unique():
        for k in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j)].day.unique():
            for l in selected_rentals[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k)].hour.unique():
                if (i == 1) & (k <= 7):
                    roll_avg_list.append(np.nan) # Values will be missing for the first week of Jan
                else:
                    time_obj = selected_rentals.loc[(selected_rentals.month == i) & (selected_rentals.ID == j) & (selected_rentals.day == k) & (selected_rentals.hour == l), "datetime"].iloc[0]
                    delta_obj_1 = time_obj - timedelta(1)        # This is a timestamp object
                    delta_obj_2 = time_obj - timedelta(2)        # This is a timestamp object
                    delta_obj_3 = time_obj - timedelta(3)        # This is a timestamp object
                    delta_obj_4 = time_obj - timedelta(4)        # This is a timestamp object
                    delta_obj_5 = time_obj - timedelta(5)        # This is a timestamp object
                    delta_obj_6 = time_obj - timedelta(6)        # This is a timestamp object
                    delta_obj_7 = time_obj - timedelta(7)        # This is a timestamp object
                    rental_1 = selected_rentals.loc[(selected_rentals.month == delta_obj_1.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj_1.day) & (selected_rentals.hour == delta_obj_1.hour), "#_rentals"].iloc[0]
                    rental_2 = selected_rentals.loc[(selected_rentals.month == delta_obj_2.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj_2.day) & (selected_rentals.hour == delta_obj_2.hour), "#_rentals"].iloc[0]
                    rental_3 = selected_rentals.loc[(selected_rentals.month == delta_obj_3.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj_3.day) & (selected_rentals.hour == delta_obj_3.hour), "#_rentals"].iloc[0]
                    rental_4 = selected_rentals.loc[(selected_rentals.month == delta_obj_4.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj_4.day) & (selected_rentals.hour == delta_obj_4.hour), "#_rentals"].iloc[0]
                    rental_5 = selected_rentals.loc[(selected_rentals.month == delta_obj_5.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj_5.day) & (selected_rentals.hour == delta_obj_5.hour), "#_rentals"].iloc[0]
                    rental_6 = selected_rentals.loc[(selected_rentals.month == delta_obj_6.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj_6.day) & (selected_rentals.hour == delta_obj_6.hour), "#_rentals"].iloc[0]
                    rental_7 = selected_rentals.loc[(selected_rentals.month == delta_obj_7.month) & (selected_rentals.ID == j) & (selected_rentals.day == delta_obj_7.day) & (selected_rentals.hour == delta_obj_7.hour), "#_rentals"].iloc[0]
                    roll_avg_list.append(np.mean([rental_1, rental_2, rental_3, rental_4, rental_5, rental_6, rental_7])) # rolling average
  
print(len(selected_rentals), len(roll_avg_list))

169400 169400


In [55]:
# Assigning it to a column in the dataset
selected_rentals["roll_avg"] = roll_avg_list

In [56]:
# creating dummies for ID
selected_rentals_dum = pd.get_dummies(selected_rentals[["#_rentals", "year", "month", "day", "hour", "ID", "#_rentals_lag_1", "roll_avg"]], columns = ["ID"], drop_first=False)

# Evaluating change in prediction performance: Feb
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 2), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

# Evaluating change in prediction performance: Mar
# training data
X_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_train = selected_rentals_dum.loc[(selected_rentals_dum["day"] <=25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

X_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3),list(selected_rentals_dum.columns)[1:]]
y_test = selected_rentals_dum.loc[(selected_rentals_dum["day"] >25) & (selected_rentals_dum["month"] == 3), "#_rentals"]

# training the model
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=2) 
forest.fit(X_train, y_train)

# testing performance
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
y_test_pred = forest.predict(X_test)
print(mean_squared_error(y_test, y_test_pred),r2_score(y_test, y_test_pred))

2.3313080892857143 0.3636015342019311
3.8761801904761906 0.45041815385635486


Rolling average results in considerable increase from baseline performance. This means it's a very strong and important feature in predicting rentals!

In [57]:
selected_rentals.head()

Unnamed: 0,#_rentals,datetime,year,month,day,hour,ID,coordinates,#_rentals_lag_1,name_of_day,weekend,is_holiday,#_rentals_lag_2,prev_day,prev_week,roll_avg
0,0,2024-01-01 08:00:00,2024,1,1,8,0,POINT (-73.9383 40.7923272),,Monday,0,1,,,,
1,0,2024-01-01 10:00:00,2024,1,1,10,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,,,,
2,0,2024-01-01 12:00:00,2024,1,1,12,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0,,,
3,0,2024-01-01 14:00:00,2024,1,1,14,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0,,,
4,0,2024-01-01 16:00:00,2024,1,1,16,0,POINT (-73.9383 40.7923272),0.0,Monday,0,1,0.0,,,


In [58]:
# Exporting the df with temporal features for later use:
selected_rentals.to_csv('rentals_solving_the_problem_with_lag_var.csv', index=False)