# Regression Approach

This notebook is dedicated to modelling the problem as a regression problem. Different regression techniques are evaluated, and then the best-performing one is optimized in terms of hyper-parameters

In [None]:
# Change directory for cleaner path handling
%cd ..

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import tree, linear_model, dummy, kernel_ridge, gaussian_process
from sklearn.preprocessing import PolynomialFeatures,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from src.evaluation import compare_models, evaluate_model

In [3]:
df = pd.read_csv("data/processed/tracks.csv")
targets = df['views']
features = [
    'danceability',
    'energy',
    'key',
    'loudness',
    'mode',
    'speechiness',
    'acousticness',
    'instrumentalness',
    'liveness',
    'valence',
    'tempo',
    'time_signature',
    'duration_ms',
    'popularity'
]

data = df[features]

## Model comparison
Different regression models are compared with their default parameters to establish a baseline. The evaluation includes standard scaling as part of a pipeline, so no further pre-processing is needed.

In [4]:
models = [
    ("Baseline", dummy.DummyRegressor(strategy='mean')),
    ("Linear Regression", linear_model.LinearRegression()),
    ("Polynomial Regression", Pipeline([
        ('poly', PolynomialFeatures(degree=3)),
        ('linear', linear_model.LinearRegression())
    ])),
    ("Decision Tree", tree.DecisionTreeRegressor()),
    ("Kernel Ridge", kernel_ridge.KernelRidge()),
    ("Gaussian Process", gaussian_process.GaussianProcessRegressor()),
]
metrics = [
    'r2',
    'neg_mean_absolute_error',
    'neg_root_mean_squared_error',
]
compare_models(models, metrics, data, targets)

Evaluating Baseline
Evaluating Linear Regression
Evaluating Polynomial Regression
Evaluating Decision Tree
Evaluating Kernel Ridge
Evaluating Gaussian Process


Unnamed: 0,model,fit_time,score_time,test_r2,test_neg_mean_absolute_error,test_neg_root_mean_squared_error
0,Baseline,0.004766,0.001787,-0.250216,-7019707.0,-12900200.0
1,Linear Regression,0.009721,0.002183,-0.229132,-6822186.0,-12763790.0
2,Polynomial Regression,0.465134,0.013007,-11.271168,-8038891.0,-22971940.0
3,Decision Tree,0.136889,0.002778,-2.719684,-8317683.0,-18769930.0
4,Kernel Ridge,2.626323,0.047413,-0.2372,-5941487.0,-13225910.0
5,Gaussian Process,7.335853,0.998132,-9621.142582,-155802400.0,-741379300.0


None of the selected models seem to be able to model the data, resulting in negative R2 scores, and significantly large errors.