# Quick Evaluation of Regression Algorithms
This notebook is meant to provide a quick evaluation of the publicly available regression algorithms (regressor) that can be used to train a model.

In [1]:
import numpy as np
import pandas as pd

# load the training data
training_data = pd.read_csv('data/training_data.csv')
training_data.head()

Unnamed: 0,year,month,shop_id,shop_t,shop_t-1,shop_t-11,categ_id,categ_t,categ_t-1,categ_t-11,item_id,t,t-1,t-2,t-5,t-11,t+1
0,2014,0,2,890.0,1322.0,488.0,40,76.0,93.0,40.0,32,1.0,0.0,0.0,0.0,0.0,0.0
1,2014,0,2,890.0,1322.0,488.0,37,44.0,55.0,21.0,33,1.0,1.0,2.0,0.0,0.0,0.0
2,2014,0,2,890.0,1322.0,488.0,37,44.0,55.0,21.0,99,1.0,0.0,0.0,0.0,0.0,0.0
3,2014,0,2,890.0,1322.0,488.0,73,4.0,3.0,10.0,482,2.0,1.0,2.0,0.0,1.0,1.0
4,2014,0,2,890.0,1322.0,488.0,73,4.0,3.0,10.0,485,1.0,1.0,0.0,0.0,0.0,1.0


We will use a small subset of the available features for this evaluation.

In [2]:
features_to_use = ['year','month','shop_id','categ_id','item_id','t']
categ_features = ['year','month','shop_id','categ_id','item_id']

y = training_data['t+1']
X = training_data[features_to_use]
X.head()

Unnamed: 0,year,month,shop_id,categ_id,item_id,t
0,2014,0,2,40,32,1.0
1,2014,0,2,37,33,1.0
2,2014,0,2,37,99,1.0
3,2014,0,2,73,482,2.0
4,2014,0,2,73,485,1.0


In [3]:
from sklearn.model_selection import train_test_split

# Split the 'features' and 'income' data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Validation set has {} samples.".format(X_valid.shape[0]))

Training set has 711895 samples.
Validation set has 177974 samples.


In [4]:
from sklearn.metrics import mean_squared_error

# calculates the root mean squared error
def calc_RMSE(actuals, predictions):
    return np.sqrt(mean_squared_error(actuals, predictions))

In [5]:
# helper function to assess the performance of the algorithm on the testing and training sets
def assess_model(regressor):
    regressor_model = regressor.fit(X_train, y_train)

    predictions_train = regressor_model.predict(X_train)
    print('Training RMSE:', calc_RMSE(y_train, predictions_train))

    predictions_valid = regressor_model.predict(X_valid)
    print('Validation RMSE:', calc_RMSE(y_valid, predictions_valid))

## Standard Regression Algorithms
Let's start with the basic regression algorithms: LinearRegression and DecisisonTreeRegressor.

In [6]:
from sklearn.linear_model import LinearRegression

algo = LinearRegression()
assess_model(algo)

Training RMSE: 6.652138218634412
Validation RMSE: 6.390189175834743


In [8]:
from sklearn.tree import DecisionTreeRegressor

algo = DecisionTreeRegressor()
assess_model(algo)

Training RMSE: 0.0
Validation RMSE: 6.420117462616004


The training RMSE is zero but the validation RMSE is high.  Interesting.

## Regression Algorithms in the sklearn.ensemble Package
Let's move on the algorithms available in the sklearn ensemble package.

In [9]:
from sklearn.ensemble import AdaBoostRegressor

algo = AdaBoostRegressor()
assess_model(algo)

Training RMSE: 4.923568341793056
Validation RMSE: 6.28408682611635


In [10]:
from sklearn.ensemble import BaggingRegressor

algo = BaggingRegressor()
assess_model(algo)

Training RMSE: 2.631988421732797
Validation RMSE: 5.438492565864821


In [12]:
from sklearn.ensemble import ExtraTreesRegressor

algo = ExtraTreesRegressor()
assess_model(algo)

Training RMSE: 0.0
Validation RMSE: 5.428250604616416


In [13]:
from sklearn.ensemble import GradientBoostingRegressor

algo = GradientBoostingRegressor()
assess_model(algo)

Training RMSE: 4.7093941510127735
Validation RMSE: 5.65825705694143


In [14]:
from sklearn.ensemble import RandomForestRegressor

algo = RandomForestRegressor()
assess_model(algo)

Training RMSE: 2.6289030423632918
Validation RMSE: 5.417305498170443


It looks the algorithms that used the boosting method did between the algorithms that used the averaging methods.

## Other Notable Algorithms
Let's try other notable regressors.  XGBoost has been used to win many a Kaggle competitions.  Light GBM came from Microsoft.

In [15]:
import xgboost as xgb

algo = xgb.XGBRegressor()
assess_model(algo)

Training RMSE: 4.789713818327151
Validation RMSE: 5.64767217677085


In [16]:
from lightgbm.sklearn import LGBMRegressor

algo = LGBMRegressor()
assess_model(algo)

Training RMSE: 4.409052280719458
Validation RMSE: 5.193557247410162


LGBMRegressor (Light GBM) had the lowest validation RMSE. 