# **Group Assignment** - Bike Sharing

- `instant`: record index
- `dteday` : date
- `season` : season (1:winter, 2:spring, 3:summer, 4:fall)
- `yr` : year (0: 2011, 1:2012)
- `mnth` : month ( 1 to 12)
- `hr` : hour (0 to 23)
- `holiday` : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- `weekday` : day of the week
- `workingday` : if day is neither weekend nor holiday is 1, otherwise is 0.
+ `weathersit` : 
	- 1: Clear, Few clouds, Partly cloudy
	- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
	- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
	- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- `temp` : Normalized temperature in Celsius. The values are divided to 41 (max)
- `atemp`: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- `hum`: Normalized humidity. The values are divided to 100 (max)
- `windspeed`: Normalized wind speed. The values are divided to 67 (max)
- `casual`: count of casual users
- `registered`: count of registered users
- `cnt`: count of total rental bikes including both casual and registered

In [60]:
#------------------------------#
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import numpy as np
import pandas as pd
from pandas import read_csv
#------------------------------#
from scipy.stats import uniform, randint
from sklearn.preprocessing import OneHotEncoder

#------------------------------#
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold, RandomizedSearchCV,train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from lightgbm import LGBMRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
#------------------------------# 
import joblib #Stores model
#------------------------------#
import warnings
import pandas as pd
warnings.filterwarnings('ignore')

# Set the option to display all columns
pd.set_option('display.max_columns', None)

In [61]:
data=read_csv('bike_sharing_hourly.csv')

In [62]:
data.info() #no nulls

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


In [63]:
data = data.astype({
    'season': 'object',
    'yr': 'object',
    'holiday': 'object',
    'weekday': 'object',
    'workingday': 'object',
    'weathersit': 'object',
    'mnth': 'object'
})

## PART I: Exploratory Data Analysis

####  First, we remove the python `instant` column, since `instant = index + 1` and change the `dteday`to datetime format

In [64]:
data.drop('instant', axis=1, inplace=True)
data.dteday = pd.to_datetime(data.dteday)

In [65]:
data.head()

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


In [66]:
data['dteday'] = data['dteday'] + pd.to_timedelta(data['hr'], unit='h')

In [67]:
data.head()

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2011-01-01 00:00:00,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1


#### Unnormalize values

In [68]:
columns = ['temp', 'atemp', 'hum', 'windspeed']  # Replace with the actual column names
max_values = [41, 50, 100,67] 
data[columns] = data[columns].mul(max_values)

#### All `mnth` column values match the month in the `dteday` column.

In [69]:
all((data['dteday'].dt.month ==  data['mnth']))

True

In [70]:
data.head()

Unnamed: 0,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,2011-01-01 00:00:00,1,0,1,0,0,6,0,1,9.84,14.395,81.0,0.0,3,13,16
1,2011-01-01 01:00:00,1,0,1,1,0,6,0,1,9.02,13.635,80.0,0.0,8,32,40
2,2011-01-01 02:00:00,1,0,1,2,0,6,0,1,9.02,13.635,80.0,0.0,5,27,32
3,2011-01-01 03:00:00,1,0,1,3,0,6,0,1,9.84,14.395,75.0,0.0,3,10,13
4,2011-01-01 04:00:00,1,0,1,4,0,6,0,1,9.84,14.395,75.0,0.0,0,1,1


#### We can say that `temp` and `atemp`  are extremely highly correlated. We can drop `atemp` for now.

In [71]:
def plot_temperature_by_month_year(month, year):
    filtered_data = data[(data['mnth'] == month) & (data['yr'] == year)]
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=filtered_data['dteday'], y=filtered_data['temp'], name='Temperature'))
    fig.add_trace(go.Scatter(x=filtered_data['dteday'], y=filtered_data['atemp'], name='Feeling Temperature'))
    
    fig.update_layout(title='Temperature and Feeling Temperature Over Time',
                      xaxis_title='Date',
                      yaxis_title='Temperature (Celsius)')
    
    fig.show()
plot_temperature_by_month_year(2, 0)

####  We realised about an error in the meaning for season values. Actually mean: {1:winter, 2:spring, 3:summer, 4:fall}

In [72]:
start_date = pd.to_datetime('2011-12-21  00:00:00')
end_date = pd.to_datetime('2012-01-01  23:00:00	')
filtered_data = data[(data['dteday'] >= start_date) & (data['dteday'] <= end_date)]
print(filtered_data.season.unique())

[1]


#### As category 4 of `weathersit` only has 3 recods, we decide do merge it with category 3

In [73]:
data.weathersit.value_counts()

weathersit
1    11413
2     4544
3     1419
4        3
Name: count, dtype: int64

In [74]:
data.loc[data['weathersit'] == 4, 'weathersit'] = 3

#### Humidity vs cnt (no correlation)

In [75]:
def plot_humidity_vs_count(data, month, year):
    filtered_data = data[(data['mnth'] == month) & (data['yr'] == year)]
    
    fig = go.Figure(data=go.Scatter(x=filtered_data['hum'], y=filtered_data['cnt'], mode='markers'))
    fig.update_layout(title='Humidity vs Total Count', xaxis_title='Humidity', yaxis_title='Counts')
    fig.show()
plot_humidity_vs_count(data, 2, 1)

#### Wind vs Cnt

In [76]:
def plot_wind_by_month_year(data, month, year):
    filtered_data = data[(data['mnth'] == month) & (data['yr'] == year)]
    
    fig = go.Figure(data=go.Scatter(x=filtered_data['windspeed'], y=filtered_data['cnt'], mode='markers'))
    fig.update_layout(title='Wind vs Users', xaxis_title='Wind', yaxis_title='Total User')
    fig.show()
plot_wind_by_month_year(data, 9, 1)

It can be seen a slight difference between high windspeed and low windspeeds. We thought to split into two categories, but it was more cons than pros.

#### Boxplot of bike usage depending on the weather situation

In [77]:
def plot_bike_counts_by_weather(data, month, year):
    filtered_data = data[(data['mnth'] == month) & (data['yr'] == year)]
    
    fig = go.Figure()
    
    for category in filtered_data['weathersit'].unique():
        fig.add_trace(go.Box(
            x=filtered_data[filtered_data['weathersit'] == category]['weathersit'],
            y=filtered_data[filtered_data['weathersit'] == category]['cnt'],
            name=f"Weathersit {category}"
        ))

    fig.update_layout(
        title="Distribution of bike counts for each category of weather situation",
        xaxis_title="Weather situation",
        yaxis_title="Counts"
    )

    fig.show()
plot_bike_counts_by_weather(data, 1, 0)

#### Visual representation of the demand depending on whether they are casual or regitered customers

In [39]:
def generate_stacked_area_chart(month, year):
    filtered_data = data[(data['dteday'].dt.month == month) & (data['dteday'].dt.year == year)]
    fig = px.area(filtered_data, x='dteday', y=['casual', 'registered'], title=f'Stacked Area Chart - Casual, Registered {month}/{year}')
    fig.update_layout(xaxis_title='Date', yaxis_title='Counts')
    fig.show()

In [40]:
generate_stacked_area_chart(7, 2011)

#### Percentage distribution of records over month

In [41]:
mean_casual = data['casual'].mean()
mean_regular = data['registered'].mean()

print("Mean amount for casual users:", round(mean_casual, 0))
print("Mean amount for regular users:", round(mean_regular,0))

Mean amount for casual users: 36.0
Mean amount for regular users: 154.0


In [42]:
data['year'] = data['dteday'].dt.year
data['month'] = data['dteday'].dt.month

monthly_data = data.groupby(['year', 'month']).agg(total_users=('cnt', 'sum'),
                                                 casual_users=('casual', 'sum'))

monthly_data['casual_percentage'] = (monthly_data['casual_users'] / monthly_data['total_users']) * 100

monthly_data.reset_index(inplace=True)

fig = px.line(monthly_data, x='month', y='casual_percentage', color='year', line_group='year',
              labels={'casual_percentage': 'Casual User (%)', 'month': 'Month', 'year': 'Year'}) 
fig.update_traces(mode='markers+lines')
fig.add_shape(type='line',
              x0=0,
              x1=13,
              y0=19.88,
              y1=19.88,
              line=dict(color='blue', width=1.5, dash='dash'))
fig.add_shape(type='line',
              x0=0,
              x1=13,
              y0=18.18,
              y1=18.18,
              line=dict(color='red', width=1.5, dash='dash'))
fig.show()
data.drop(columns=['year','month'], axis=1,inplace=True)

In addittion to this, from now on the datetime dtype wont be used anymore, since the predictions will be made at most montly.

In [43]:
data.drop(columns=['dteday','yr','atemp'], axis=1,inplace=True)

#### Here we split the dataframe

In [44]:
data_casual= data.drop(columns=['registered','cnt'], axis=1)
data_regular=data.drop(columns=['casual','cnt'], axis=1)

#### Correlation matrices

In [45]:
px.scatter_matrix(data_casual, color="casual", height=1200, template="none")

In [46]:
fig = px.imshow(data_casual.corr())
fig.update_layout(title="Correlation Matrix - Casual user")
fig.update_layout(width=1200, height=600)
fig.show()

In [47]:
fig = px.imshow(data_regular.corr())
fig.update_layout(title="Correlation Matrix - Regular user")
fig.update_layout(width=1200, height=600)
fig.show()

## PART II: Prediction Model (CASUAL)

#### Dataframe split between casuals and registered

In [48]:
X_casual = data_casual.drop(columns=['casual','holiday'], axis=1)
y_casual = data_casual["casual"]

In [49]:
X_casual.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   season      17379 non-null  object 
 1   mnth        17379 non-null  object 
 2   hr          17379 non-null  int64  
 3   weekday     17379 non-null  object 
 4   workingday  17379 non-null  object 
 5   weathersit  17379 non-null  object 
 6   temp        17379 non-null  float64
 7   hum         17379 non-null  float64
 8   windspeed   17379 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 1.2+ MB


#### Train-test split

In [50]:
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_casual, y_casual, test_size=0.2, random_state=99)

#### Column transformation

In [51]:
def transform(X):
    categorical_columns = X.select_dtypes(["O"]).columns
    numerical_columns = X.select_dtypes(["int","float"]).columns
    preprocessing = ColumnTransformer(
    [
        (
            "ohe",
            OneHotEncoder(sparse_output=False),
            categorical_columns
        ),
        (
            "scaler",
            MinMaxScaler(),
            numerical_columns
        )
    ],
    remainder="passthrough"  # this can be "drop", "passthrough", or another Estimator
)
    
    return preprocessing

In [52]:
preprocessing=transform(X_casual)

#### Error prediction 

In [53]:
def my_prediction(model, X_train, y_train, X_test, y_test):
    y_test_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    print('Metrics for the train set')
    print("------------------------")    
    print(f"R2: {r2_score(y_train, y_train_pred)}")
    print(f"RMSE: {mean_squared_error(y_train, y_train_pred,squared=False)}")
    print(f"MAE: {mean_absolute_error(y_train, y_train_pred)}")
    
    print()
    
    print('Metrics for the test set')
    print("------------------------")
    print(f"R2: {r2_score(y_test, y_test_pred)}")
    print(f"RMSE: {mean_squared_error(y_test, y_test_pred,squared=False)}")
    print(f"MAE: {mean_absolute_error(y_test, y_test_pred)}")
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=y_test.values,
        y=y_test_pred,
        mode="markers",
        name="Samples"
    ))
    fig.add_trace(go.Scatter(
        x=y_test.values,
        y=y_test.values,
        mode="lines",
        name="Reference"
    ))
    fig.update_layout(title="True -vs- Predicted Outcome", xaxis_title="True", yaxis_title="Predicted", template="none")
    return fig

#### Pipeline: Model Training

In [54]:
class ModelTrainerRandomizedSearchCV:
    def __init__(self, X, y):
        self.X = X
        self.y = y
        self.pipeline = Pipeline(steps=
                                 [("preprocessing", preprocessing),
                                  ('model', None)]
                                 )
        self.random_state = 42

    def train_evaluate_model(self, parameters):
        kfold = KFold(n_splits=5, random_state=self.random_state, shuffle=True)
        
        random_search = RandomizedSearchCV(self.pipeline, parameters, cv=kfold, scoring='neg_root_mean_squared_error', 
                                           refit=True, n_jobs=-1, n_iter=50, return_train_score=True,error_score='raise').fit(self.X, self.y)

        pipeline = random_search.best_estimator_
        
        mean_score_train = random_search.cv_results_['mean_train_score'][random_search.best_index_]
        print(f'The mean score error on the training is {-mean_score_train}\n')
        
        
        mean_score_test = random_search.cv_results_['mean_test_score'][random_search.best_index_]
        print(f'The mean score error on the validation is {-mean_score_test}\n')
      
        return pipeline


    def linearmodel(self):
        parameters = [{
            'model': [LinearRegression()]
            }]
        return self.train_evaluate_model(parameters)
    
    def params_random_forest(self):
        parameters = [{
            'model': [RandomForestRegressor(random_state=self.random_state)],   
            'model__n_estimators': randint(10, 200),   
            'model__max_depth': randint(2, 32),
            'model__min_samples_split' :  randint(2, 10),
            'model__min_samples_leaf' :  randint(1, 10),
            'model__max_features': ['sqrt',None],
            'model__bootstrap': [True, False]
        }]
        return self.train_evaluate_model(parameters)
    
    def params_light(self):
        parameters = [
            {
                'model': [LGBMRegressor(random_state=self.random_state)],
                'model__learning_rate': uniform(0.01, 0.1),
                'model__n_estimators': randint(100, 300),
                'model__num_leaves': randint(20, 40),
                'model__lambda_l1': randint(0,100),
                'model__lambda_l2': randint(0,100),
            }
        ]
        return self.train_evaluate_model(parameters)

#### Feature Importance

In [55]:
def feat_importance(model,X_train):
    feature_importance = model.feature_importances_

  # Get the names of the features
    X_train_transformed= pd.DataFrame(preprocessing.fit_transform(X_train), columns=preprocessing.get_feature_names_out())
    feature_names = X_train_transformed.columns

    feature_importance_dict = dict(zip(feature_names, feature_importance))

    sorted_feature_importance = dict(sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True))

    fig = go.Figure()
    fig.add_trace(go.Bar(x=list(sorted_feature_importance.keys()), y=list(sorted_feature_importance.values())))

    fig.update_layout(
        title='Feature Importance Plot',
        xaxis_title='Feature',
        yaxis_title='Feature Importance'
    )
    fig.show()

#### Random Forest Regressor

In [376]:
%%time
model_trainer = ModelTrainerRandomizedSearchCV(X_train_c, y_train_c)
pipeline_casual_rf = model_trainer.params_random_forest()

The mean score error on the training is 2.5899002179040993

The mean score error on the validation is 17.033340249320435

CPU times: total: 7.12 s
Wall time: 8min 50s


In [377]:
pipeline_casual_rf.fit(X_train_c,y_train_c)
joblib.dump(pipeline_casual_rf, 'models/model_casual_rf.pkl')

['models/model_rf.pkl']

In [378]:
my_prediction(pipeline_casual_rf, X_train_c, y_train_c, X_test_c, y_test_c)

Metrics for the train set
------------------------
R2: 0.9969132601920505
RMSE: 2.7550679869230295
MAE: 1.3303312189779863

Metrics for the test set
------------------------
R2: 0.9005857335242461
RMSE: 15.17154371306727
MAE: 9.141815158916883


In [379]:
feat_importance(pipeline_casual_rf[-1])

#### Linear model

In [56]:
%%time
model_trainer = ModelTrainerRandomizedSearchCV(X_train_c, y_train_c)
pipeline_casual_linear = model_trainer.linearmodel()

The mean score error on the training is 35.635331589262265

The mean score error on the validation is 35.71187320518861

CPU times: total: 203 ms
Wall time: 8.47 s


In [57]:
pipeline_casual_linear.fit(X_train_c,y_train_c)
joblib.dump(pipeline_casual_linear, 'models/model_casual_linear.pkl')

['models/model_casual_linear.pkl']

In [58]:
my_prediction(pipeline_casual_linear,X_train_c, y_train_c, X_test_c, y_test_c)

Metrics for the train set
------------------------
R2: 0.4832825326368133
RMSE: 35.64584222129212
MAE: 24.369380709199454

Metrics for the test set
------------------------
R2: 0.46317817785859183
RMSE: 35.25498201435602
MAE: 24.489355581127732


#### LightGBM Regressor

In [428]:
%%time
model_trainer = ModelTrainerRandomizedSearchCV(X_train_c, y_train_c)
pipeline_casual_lightgbm= model_trainer.params_light()

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001754 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 238
[LightGBM] [Info] Number of data points in the train set: 13903, number of used features: 32
[LightGBM] [Info] Start training from score 36.049198
The mean score error on the training is 12.990663493846307

The mean score error on the validation is 17.065403861957478

CPU times: total: 3.92 s
Wall time: 1min 47s


In [429]:
pipeline_casual_lightgbm.fit(X_train_c, y_train_c)
joblib.dump(pipeline_casual_lightgbm, 'models/model_casual_lightgmr.pkl')

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.012580 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 238
[LightGBM] [Info] Number of data points in the train set: 13903, number of used features: 32
[LightGBM] [Info] Start training from score 36.049198


['models/model_casual_lightgmr.pkl']

In [430]:
my_prediction(pipeline_casual_lightgbm, X_train_c, y_train_c, X_test_c, y_test_c)



Metrics for the train set
------------------------
R2: 0.9285737919938785
RMSE: 13.252905648370048
MAE: 8.105966890246847

Metrics for the test set
------------------------
R2: 0.8989287111256398
RMSE: 15.297459725621563
MAE: 9.128319497535633


In [431]:
feat_importance(pipeline_casual_lightgbm[-1])

There are some things to mention about these found models.
1. After manipulating the data several times, removing and additing features whether the feature importance plots indicated so, we found that the models that perfomed the best were the ones that kept as many variables as possible. The plots indicated that features not important were `season`, `weather situation` or `weekday`.
   
2. The Linear model is the one performing the worst, surely because of the different features not sharing a clear linear relationship with the target variable.
   
3. RF presents overfitting (doing very good in the training set but lowering the score on the validation/predictive set). For the same RMSE/MAPE Light GBR does not show such over-fitting. It is because of this last things, that we choose the latter for our predictive model.

## PART II: Prediction Model (Regular)

In [36]:
X_regular = data_regular.drop(columns=['registered','holiday'])
y_regular = data_regular["registered"]

In [37]:
X_regular.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   season      17379 non-null  object 
 1   mnth        17379 non-null  object 
 2   hr          17379 non-null  int64  
 3   weekday     17379 non-null  object 
 4   workingday  17379 non-null  object 
 5   weathersit  17379 non-null  object 
 6   temp        17379 non-null  float64
 7   hum         17379 non-null  float64
 8   windspeed   17379 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 1.2+ MB


#### Train-test split

In [38]:
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_regular, y_regular, test_size=0.2, random_state=99)

#### Column transformation

In [39]:
preprocessing=transform(X_regular)

#### LightGBM Regressor

In [40]:
%%time
model_trainer = ModelTrainerRandomizedSearchCV(X_train_r, y_train_r)
pipeline_regular_lightgbr = model_trainer.params_light()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000558 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 238
[LightGBM] [Info] Number of data points in the train set: 13903, number of used features: 32
[LightGBM] [Info] Start training from score 154.717903
The mean score error on the training is 46.69034461801641

The mean score error on the validation is 56.43656635370526

CPU times: total: 3 s
Wall time: 1min 38s


In [41]:
pipeline_regular_lightgbr.fit(X_train_r, y_train_r)
pass
joblib.dump(pipeline_regular_lightgbr, 'models/model_lightgmr.pkl')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000688 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 238
[LightGBM] [Info] Number of data points in the train set: 13903, number of used features: 32
[LightGBM] [Info] Start training from score 154.717903


['models/model_lightgmr.pkl']

In [42]:
my_prediction(pipeline_regular_lightgbr, X_train_r, y_train_r, X_test_r, y_test_r)

Metrics for the train set
------------------------
R2: 0.9023289667768881
RMSE: 47.4011167167559
MAE: 31.856098284334998

Metrics for the test set
------------------------
R2: 0.8778299764231937
RMSE: 52.43337478450099
MAE: 34.938460421499485


In [43]:
feat_importance(pipeline_regular_lightgbr[-1], X_train_r)