# Regression Approach

This notebook is dedicated to modelling the problem as a regression problem. Different regression techniques are evaluated, and then the best-performing one is optimized in terms of hyper-parameters

In [1]:
# Change directory for cleaner path handling
%cd ..

C:\Users\georg\Documents\msc-project


In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import tree, linear_model, dummy, kernel_ridge, gaussian_process
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from src.evaluation import compare_models, evaluate_model

## Feature selection

In [3]:
df = pd.read_csv("data/interim/tracks.csv")
targets = df['views']
features = [
    'danceability',
    'energy',
    'key',
    'loudness',
    'mode',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
    'time_signature',
    'duration_ms',
    'popularity'
]

data = df[features]

## Model comparison
Different regression models are compared with their default parameters to establish a baseline

In [5]:
models = [
    ("Baseline", dummy.DummyRegressor(strategy='mean')),
    ("Linear Regression", linear_model.LinearRegression()),
    ("Polynomial Regression", Pipeline([
        ('poly', PolynomialFeatures(degree=3)),
        ('linear', linear_model.LinearRegression())
    ])),
    ("Decision Tree", tree.DecisionTreeRegressor()),
    ("Kernel Ridge", kernel_ridge.KernelRidge()),
    ("Gaussian Process", gaussian_process.GaussianProcessRegressor()),
]
metrics = [
    'r2',
    'neg_mean_absolute_error',
    'neg_root_mean_squared_error',
]
compare_models(models, metrics, data, targets)

Evaluating Baseline
Evaluating Linear Regression
Evaluating Polynomial Regression
Evaluating Decision Tree
Evaluating Kernel Ridge
Evaluating Gaussian Process


Unnamed: 0,model,fit_time,score_time,test_r2,test_neg_mean_absolute_error,test_neg_root_mean_squared_error
0,Baseline,0.006547,0.002088,-0.267814,-6421567.0,-12305560.0
1,Linear Regression,0.009212,0.002394,-0.266371,-6291845.0,-12215500.0
2,Polynomial Regression,0.458402,0.013294,-130.267059,-8215892.0,-47034460.0
3,Decision Tree,0.128865,0.002877,-2.913785,-7631578.0,-17721190.0
4,Kernel Ridge,2.570465,0.046525,-0.216374,-5277402.0,-12569480.0
5,Gaussian Process,6.05797,1.021435,-16950.368704,-166738600.0,-856029100.0


## Model Optimization
The best performing model's hyperparameters will be optimized using Grid Search on the hyperparameter space

In [6]:
params = [
    {'kr__alpha': [0.1,1,10]},
    {'kr__kernel':['rbf','sigmoid','chi2'],'kr__gamma':[0.1,1,10]},
    {'kr__kernel':['polynomial'],'kr__degree':[2,3,4]}
]

model = Pipeline([
    ("std",StandardScaler()),
    ("kr",kernel_ridge.KernelRidge())
])

optimizer = GridSearchCV(
    model,
    param_grid=params,
    scoring='r2',
    n_jobs=3,
    verbose=4
)
optimizer.fit(data, targets)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


15 fits failed out of a total of 75.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "c:\users\georg\documents\msc-project\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\users\georg\documents\msc-project\venv\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "c:\users\georg\documents\msc-project\venv\lib\site-packages\sklearn\kernel_ridge.py", line 197, in fit
    K = self._get_kernel(X)
  File "c:\users\georg\documents\msc-project\venv\lib\site-packages\sklearn\kernel_ridge.py"

GridSearchCV(estimator=Pipeline(steps=[('std', StandardScaler()),
                                       ('kr', KernelRidge())]),
             n_jobs=3,
             param_grid=[{'kr__alpha': [0.1, 1, 10]},
                         {'kr__gamma': [0.1, 1, 10],
                          'kr__kernel': ['rbf', 'sigmoid', 'chi2']},
                         {'kr__degree': [2, 3, 4],
                          'kr__kernel': ['polynomial']}],
             scoring='r2', verbose=4)

In [7]:
optimizer.best_params_

{'kr__alpha': 10}

In [9]:
scores = evaluate_model(kernel_ridge.KernelRidge(alpha=10),metrics,data,targets)

In [10]:
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_r2,test_neg_mean_absolute_error,test_neg_root_mean_squared_error
0,2.534089,0.044117,-0.156519,-8783641.0,-20639730.0
1,2.50628,0.045136,-0.253605,-5960049.0,-11266650.0
2,2.563355,0.044612,-0.071045,-5806377.0,-15931810.0
3,2.56035,0.04464,-0.037648,-3114419.0,-10281090.0
4,2.562359,0.044121,-0.561338,-2712654.0,-4725487.0
