# Machine Learning (Scikit-learn, XGBoost, LightGBM, CatBoost): Regressor

In this document, the performance of 8 different Machine Learning (ML) algorithms are
compared to solve the regression problem, this is, to predict the value of a continuous variable, in this case, it is the fuel consumption of a car. The following algorithms are implemented:

Algorithm 1: Lasso (Linear regressions with L1 regularization).

Algorithm 2: Ridge (Linear regressions with L2 regularization).

Algorithm 3: DecisionTreeRegressor.

Algorithm 4: RandomForestRegressor.

Algorithm 5: GradientBoostingRegressor.

Algorithm 6: XGBRegressor (XGBoost library).

Algorithm 7: LGBMRegressor (LightGBM library).

Algorithm 8: CatBoostRegressor (CatBoost library).

## Exploratory data analysis (EDA)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import joblib
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from lightgbm import LGBMRegressor

In [None]:
# Read DataFrame
df = pd.read_csv('../data/raw/auto_cons_us.csv', sep=',', header=0)

# Show DataFrame
df.head(3)

Column 'Fuel consumption' will be renamed.

In [None]:
# Rename column
df = df.rename(columns={'Fuel consumption': 'target'})

# DataFrame information
df.info()

Format of all columns is correct.

## Data preprocessing
Data preprocessing consist of:
1. Filling null values and dropping duplicates.
2. Processing outliers and multicollinearity.
3. Converting categorical variables into binary ones.
4. Standardizing (scaling) the data.

### Fill null values and drop duplicates

In [None]:
# Verify the number of null values per column
df.isna().sum()

There are a few null values, they can be deleted.

In [None]:
# Delete null values
df = df.dropna()

# Verify the number of null values per column
df.isna().sum()

In [None]:
# Delete duplicate rows
df = df.drop_duplicates().reset_index(drop=True)

# Show duplicate rows
print(df.duplicated().sum())

### Process outliers and multicollinearity
A box plot is shown to verify if there exist outliers (no considerable outliers).

In [None]:
# Box plot
df.drop('target',axis=1).plot(kind='box', figsize=[8,4],
title='Distribution of numeric features', xlabel='Features', ylabel='Value')
plt.xticks(rotation=45)
plt.show()

In order to see multicollinearity, a heatmap is shown (excluding column 'Origin').

In [None]:
# Correlation matrix
cm = df.drop('Origin',axis=1).corr()

# Heatmap
plt.figure(figsize = (12,12))
sns.heatmap(cm, annot=True, square=True, cmap='coolwarm', fmt='.3f',
annot_kws={"size": 10}, linewidths=0.5, linecolor='black')
plt.show()

Column 'target' is correlated to all columns and '# of cylinders' is correlated to	'Engine displacement'.

**Remark:** Processed data is now saved.

In [None]:
# Save processed data
df.to_csv('../data/processed/auto_cons_us_processed.csv', index=False)

# DataFrame information
df.info()

### Convert categorical variables

In [None]:
# Show the columns of the DataFrame
print(df.columns)

# Convert categorical variables to dummy variables
df = pd.get_dummies(df)
print(df.head(3))

Two columns were added.

### Data standarization

In [None]:
# Obtain characteristic matrix (x) and objective variable (y)
x = df.drop('target', axis=1)
y = df['target']

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=0)

# Train the StandardScaler by using 'x_train', then transform 'x_train' and 'x_test'
scaler = StandardScaler()
x_train_st = scaler.fit_transform(x_train)
x_test_st = scaler.transform(x_test)

## Models belonging to Scikit-Learn library
Five different models will be created.

In [None]:
# Define the regression models
models = [Lasso(), Ridge(), DecisionTreeRegressor(),
          RandomForestRegressor(), GradientBoostingRegressor()]

def mape(y_true, y_pred):
    '''Function to calculate Mean Absolute Percentage Error (MAPE)'''
    y_error = y_true - y_pred
    y_error_abs = np.array([abs(x) for x in y_error])
    y_true_abs = np.array([abs(x) for x in y_true])
    perc_error_abs = y_error_abs / y_true_abs
    mape = perc_error_abs.sum() / len(y_true)
    return mape

def make_prediction(m, x_train, y_train, x_test, y_test):
    m.fit(x_train, y_train)
    y_pred = m.predict(x_test)
    print('MAE:{:.2f} MSE:{:.2f} MAPE:{:.2f} R2:{:.2%}'
          .format(mean_absolute_error(y_test, y_pred),
                    mean_squared_error(y_test, y_pred),
                    mape(y_test, y_pred),
                    r2_score(y_test, y_pred)))

In [None]:
# Iterate through the models and make predictionsRandomForestRegress
for i in models:
    print(i)
    make_prediction(i, x_train_st, y_train, x_test_st, y_test)

**Conclusion:** 'Random Forest Regressor' and 'Gradient Boosting Regressor' obtained better results. Both will be compared with the gradient boosting models in the next section.

**Remark:** The column names and feature importances coefficients (for Random Forest Regressor)
are displayed to show which features most impact the algorithm's verdict.

In [None]:
# Show features
print(x.columns)

# Feature weights
print(models[3].feature_importances_)

Definitively, 'Weight' (feature importances coefficient = 0.5) is the most important factor that impacts the fuel consumption.

## Models with Gradient Boosting that do not belong to Scikit-Learn library

In [None]:
# Read DataFrame
df = pd.read_csv('../data/processed/auto_cons_us_processed.csv', sep=',', header=0)

# Show DataFrame
print(df.head(3))

Ensemble models like RandomForestRegressor and XGBRegressor work better with label encoding, so label encoding will be implemented in this section.

In [None]:
label_encoders = {}

# Apply LabelEncoder to categorical columns
for col in df.select_dtypes(include=['object']).columns:
    encoder = LabelEncoder()
    df[col] = encoder.fit_transform(df[col])
    label_encoders[col] = encoder

# Save label encoders
joblib.dump(label_encoders, '../models/rg_label_enc.pkl')

In [None]:
# Obtain characteristic matrix (x) and objective variable (y)
x = df.drop('target', axis=1)
y = df['target']

# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=0)

# Train the StandardScaler by using 'x_train', then transform 'x_train' and 'x_test'
scaler = StandardScaler()
x_train_st = scaler.fit_transform(x_train)
x_test_st = scaler.transform(x_test)

# Save scaler
joblib.dump(scaler, '../models/rg_scaler.pkl')

In [None]:
# Define the regression models
models = [RandomForestRegressor(criterion='squared_error',
                                max_depth=30,n_estimators=100, random_state=0),
          GradientBoostingRegressor(loss='squared_error', random_state=0),
          XGBRegressor(objective='reg:squarederror',
                       n_estimators=100, learning_rate=0.1, random_state=0),
          LGBMRegressor(objective='regression', metric='rmse', verbose=0,
                        n_estimators=10, learning_rate=0.1, random_state=0),
          CatBoostRegressor(loss_function="RMSE", iterations=50, verbose=10,
                            depth=10, cat_features=None, random_state=0),
          ]

In [None]:
# Iterate through the models and make predictionsRandomForestRegress
for i in models:
    print(i)
    make_prediction(i, x_train_st, y_train, x_test_st, y_test)

GradientBoostingRegressor obtained better performance, it will be used to make predictions in production in the next section.

In [None]:
# Save the trained model
joblib.dump(models[1], '../models/rg_GradientBoosting_model.joblib')
print("Model saved successfully.")