# Regression Approach

This notebook is dedicated to modelling the problem as a regression problem. Different regression techniques are evaluated, and then the best-performing one is optimized in terms of hyper-parameters

In [1]:
# Change directory for cleaner path handling
%cd ..

C:\Users\georg\Documents\msc-project


In [54]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import tree, linear_model, dummy, kernel_ridge, gaussian_process
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, KFold
from sklearn.feature_selection import RFECV

from src.evaluation import compare_models, evaluate_model

In [51]:
df = pd.read_csv("data/processed/tracks.csv")
targets = df['views']
features = [
    'danceability',
    'energy',
    'key',
    'loudness',
    'mode',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
    'time_signature',
    'duration_ms',
    'popularity'
]

data = df[features]

## Model comparison
Different regression models are compared with their default parameters to establish a baseline. The evaluation includes standard scaling as part of a pipeline, so no further pre-processing is needed.

In [79]:
models = [
    ("Baseline", dummy.DummyRegressor(strategy='mean')),
    ("Linear Regression", linear_model.ElasticNet()),
    ("Polynomial Regression", Pipeline([
        ('poly', PolynomialFeatures(degree=3)),
        ('linear', linear_model.LinearRegression())
    ])),
    ("Decision Tree", tree.DecisionTreeRegressor()),
    ("Kernel Ridge", kernel_ridge.KernelRidge()),
    ("Gaussian Process", gaussian_process.GaussianProcessRegressor()),
]
metrics = [
    'r2',
    'neg_mean_absolute_error',
    'neg_root_mean_squared_error',
]
compare_models(models, metrics, data, targets, regression=True)

None of the selected models seem to be able to model the data, resulting in negative R2 scores, and significantly large errors, except linear regression.

### Feature Selection
In an attempt to improve prediction metrics, feature selection is performed via 5-fold cross validation using Recursive Feature Elimination

In [70]:
train_X, test_X, train_y, test_y = train_test_split(data, targets, test_size=.2,random_state=1)
model = Pipeline([
    ('std', StandardScaler()),
    ('reg',linear_model.ElasticNet(l1_ratio=1))
])
selector = RFECV(model, cv=KFold(shuffle=True, random_state=1),scoring="r2", importance_getter='named_steps.reg.coef_')
selector.fit(train_X.values,train_y)
selected_features = np.array(features)[selector.ranking_ == 1]
selected_features

RFECV(cv=KFold(n_splits=5, random_state=1, shuffle=True),
      estimator=Pipeline(steps=[('std', StandardScaler()),
                                ('reg', ElasticNet(l1_ratio=1))]),
      importance_getter='named_steps.reg.coef_', scoring='r2')

In [80]:
compare_models(models, metrics, data[selected_features], targets, regression=True)

Evaluating Baseline
Evaluating Linear Regression
Evaluating Polynomial Regression
Evaluating Decision Tree
Evaluating Kernel Ridge
Evaluating Gaussian Process


Unnamed: 0,model,fit_time,score_time,test_r2,test_neg_mean_absolute_error,test_neg_root_mean_squared_error
0,Baseline,0.003571,0.001978,-0.00049,-6919707.0,-13881260.0
1,Linear Regression,0.003758,0.001694,0.025061,-6722681.0,-13703490.0
2,Polynomial Regression,0.010015,0.002384,0.042456,-6554856.0,-13578940.0
3,Decision Tree,0.028568,0.002386,-0.793068,-7931063.0,-18544750.0
4,Kernel Ridge,3.65552,0.061206,-0.090376,-5829469.0,-14490110.0
5,Gaussian Process,9.32273,1.339202,-462.395484,-19232800.0,-257320700.0
