## Modeling and Validation / CrossValidation

In [40]:
from pathlib import Path

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
import numpy as np

In [48]:
data = pd.read_parquet(Path("data") / "processed_data.parquet")
data.head()

Unnamed: 0,counter_id,counter_name,site_id,site_name,bike_count,date,counter_installation_date,counter_technical_id,latitude,longitude,...,hour_day_of_week_Saturday_interaction,hour_day_of_week_Sunday_interaction,hour_day_of_week_Thursday_interaction,hour_day_of_week_Tuesday_interaction,hour_day_of_week_Wednesday_interaction,site_mean_count,site_std_count,is_weekend,is_rush_hour,is_holiday
0,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 02:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,0.0,0.0,0.0,2.0,0.0,21.785157,35.345153,0.0,0,0
1,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,1.0,2020-09-01 03:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,0.0,0.0,0.0,3.0,0.0,21.785157,35.345153,0.0,0,0
2,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,0.0,2020-09-01 04:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,0.0,0.0,0.0,4.0,0.0,21.785157,35.345153,0.0,0,0
3,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,4.0,2020-09-01 15:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,0.0,0.0,0.0,15.0,0.0,21.785157,35.345153,0.0,1,0
4,100007049-102007049,28 boulevard Diderot E-O,100007049,28 boulevard Diderot,9.0,2020-09-01 18:00:00,2013-01-18,Y2H15027244,48.846028,2.375429,...,0.0,0.0,0.0,18.0,0.0,21.785157,35.345153,0.0,1,0


In [49]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496827 entries, 0 to 496826
Data columns (total 39 columns):
 #   Column                                  Non-Null Count   Dtype         
---  ------                                  --------------   -----         
 0   counter_id                              496827 non-null  category      
 1   counter_name                            496827 non-null  category      
 2   site_id                                 496827 non-null  int64         
 3   site_name                               496827 non-null  category      
 4   bike_count                              496827 non-null  float64       
 5   date                                    496827 non-null  datetime64[us]
 6   counter_installation_date               496827 non-null  datetime64[us]
 7   counter_technical_id                    496827 non-null  category      
 8   latitude                                496827 non-null  float64       
 9   longitude                            

#### Define Features and Target

In [50]:
X = data.drop(columns=['bike_count', 'log_bike_count'])  
y = data['log_bike_count']

In [51]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

Training set size: (397461, 37)
Test set size: (99366, 37)


In [52]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,       # 20% of data goes to the test set
    random_state=42      # For reproducibility
)

In [53]:
X_train = X_train.drop(columns=['counter_id', 'counter_name', 'site_name', 'date', 'counter_installation_date', 'counter_technical_id'], errors='ignore')
X_test = X_test.drop(columns=['counter_id', 'counter_name', 'site_name', 'date', 'counter_installation_date', 'counter_technical_id'], errors='ignore')

### Linear Model 

In [54]:
from sklearn.linear_model import LinearRegression

# Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred_log = model.predict(X_test)

# Reverse the log transformation for predictions and actuals
y_pred = np.expm1(y_pred_log)
y_actual = np.expm1(y_test)

# Evaluate RMSE
rmse = root_mean_squared_error(y_actual, y_pred)
print(f"Baseline RMSE (original scale): {rmse}")

Baseline RMSE (original scale): 1.3227741291029847e-07


In [55]:
print(f"Actual values (first 10): {y_actual[:10]}")
print(f"Predicted values (first 10): {y_pred[:10]}")

Actual values (first 10): 71518     182.0
327824     19.0
326993      4.0
367127    291.0
293142     63.0
238328      0.0
14989     150.0
447479    167.0
79889     185.0
4728        0.0
Name: log_bike_count, dtype: float64
Predicted values (first 10): [ 1.82000000e+02  1.90000000e+01  3.99999999e+00  2.91000001e+02
  6.29999999e+01 -2.72892908e-09  1.50000000e+02  1.67000000e+02
  1.85000000e+02 -3.67352815e-10]


The output of RMSE: 1.32e-07 (or effectively almost 0) is concerning because it’s too low, suggesting that the model might be overfitting. In comparison to the starting kit, which reported an RMSE of around 0.73 on the validation set, your result indicates that something may be going wrong.

Possible Reasons for the Very Low RMSE:

- Data Leakage: The model might have unintentionally used information from the target variable (y) or validation set during training.
- Unrealistic Feature Scaling: If the target variable (log_bike_count) or features have been improperly scaled or manipulated, the RMSE might lose its real-world interpretability.
- Improper Validation Split: Ensure that the train-test split reflects the temporal nature of the data, as shown in the starting kit, rather than using train_test_split() randomly.
- Issue in Target Transformation: Since log_bike_count involves a logarithmic transformation, make sure that the inverse transformation (np.expm1()) is correctly applied for both the predictions (y_pred) and the true values (y_actual) before evaluating RMSE.