# Regression Approach

This notebook is dedicated to modelling the problem as a regression problem. Different regression techniques are evaluated, and then the best-performing one is optimized in terms of hyper-parameters

In [None]:
# Change directory for cleaner path handling
%%capture
%cd ..

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import tree, linear_model, dummy, kernel_ridge, gaussian_process
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.feature_selection import RFECV

from src.evaluation import compare_models, evaluate_model

In [4]:
df = pd.read_csv("data/processed/tracks.csv")
targets = df['views']
features = [
    'danceability',
    'energy',
    'key',
    'loudness',
    'mode',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
    'time_signature',
    'duration_ms',
    'popularity'
]

data = df[features]

## Model comparison
Different regression models are compared with their default parameters to establish a baseline. The evaluation includes standard scaling as part of a pipeline, so no further pre-processing is needed.

In [5]:
models = [
    ("Baseline", dummy.DummyRegressor(strategy='mean')),
    ("Linear Regression", linear_model.LinearRegression()),
    ("Polynomial Regression", Pipeline([
        ('poly', PolynomialFeatures(degree=3)),
        ('linear', linear_model.LinearRegression())
    ])),
    ("Decision Tree", tree.DecisionTreeRegressor()),
    ("Kernel Ridge", kernel_ridge.KernelRidge()),
    ("Gaussian Process", gaussian_process.GaussianProcessRegressor()),
]
metrics = [
    'r2',
    'neg_mean_absolute_error',
    'neg_root_mean_squared_error',
]
compare_models(models, metrics, data, targets, regression=True)

Evaluating Baseline
Evaluating Linear Regression
Evaluating Polynomial Regression
Evaluating Decision Tree
Evaluating Kernel Ridge
Evaluating Gaussian Process


Unnamed: 0,model,fit_time,score_time,test_r2,test_neg_mean_absolute_error,test_neg_root_mean_squared_error
0,Baseline,0.006748,0.001785,-0.00049,-6919707.0,-13881260.0
1,Linear Regression,0.011114,0.002221,0.029208,-6719111.0,-13674200.0
2,Polynomial Regression,0.516653,0.014402,-2.891988,-7783316.0,-23318220.0
3,Decision Tree,0.148316,0.003164,-0.749538,-7938193.0,-18355140.0
4,Kernel Ridge,3.706343,0.060602,-0.090724,-5855657.0,-14492400.0
5,Gaussian Process,9.938632,1.379045,-2.742904,-12709160.0,-26265990.0


None of the selected models seem to be able to model the data, resulting in negative R2 scores, and significantly large errors, except linear regression.

### Feature Selection
In an attempt to improve prediction metrics, feature selection is performed via 5-fold cross validation using Recursive Feature Elimination

In [6]:
train_X, test_X, train_y, test_y = train_test_split(data, targets, test_size=.2,random_state=1)
model = Pipeline([
    ('std', StandardScaler()),
    ('reg',linear_model.LinearRegression())
])
selector = RFECV(model, cv=KFold(shuffle=True, random_state=1),scoring="r2", importance_getter='named_steps.reg.coef_')
selector.fit(train_X.values,train_y)
selected_features = np.array(features)[selector.ranking_ == 1]
selected_features

array(['danceability', 'instrumentalness', 'popularity'], dtype='<U16')

In [7]:
compare_models(models, metrics, data[selected_features], targets, regression=True)

Evaluating Baseline
Evaluating Linear Regression
Evaluating Polynomial Regression
Evaluating Decision Tree
Evaluating Kernel Ridge
Evaluating Gaussian Process


Unnamed: 0,model,fit_time,score_time,test_r2,test_neg_mean_absolute_error,test_neg_root_mean_squared_error
0,Baseline,0.00366,0.001998,-0.00049,-6919707.0,-13881260.0
1,Linear Regression,0.003873,0.001878,0.029636,-6711314.0,-13671220.0
2,Polynomial Regression,0.009773,0.00239,0.042456,-6554856.0,-13578940.0
3,Decision Tree,0.028774,0.002477,-0.78148,-7946711.0,-18481650.0
4,Kernel Ridge,3.998189,0.06259,-0.090376,-5829469.0,-14490110.0
5,Gaussian Process,9.721217,1.32272,-462.395484,-19232800.0,-257320700.0


Using the best performing features, polynomial regression seems to achieve higher R2 scores. By further adding L2 regularization, we can further increase the R2 score.

In [60]:
model = Pipeline([
    ('std', StandardScaler()),
    ('poly', PolynomialFeatures(degree=3)),
    ('reg',linear_model.Ridge())
])
params = {
    'reg__alpha' : [0.1,1,10,100,1000,10000],
}
optimizer = GridSearchCV(model, params ,scoring='r2')
optimizer.fit(train_X[selected_features],train_y)

GridSearchCV(estimator=Pipeline(steps=[('std', StandardScaler()),
                                       ('poly', PolynomialFeatures(degree=3)),
                                       ('reg', Ridge())]),
             param_grid={'reg__alpha': [0.1, 1, 10, 100, 1000, 10000]},
             scoring='r2')

In [61]:
optimizer.best_params_

{'reg__alpha': 1000}

In [62]:
optimizer.best_score_

0.043466959443468414

In [63]:
optimizer.score(test_X[selected_features],test_y)

0.03957779938682138