# Non Neural Network Predictive Model
* We can now move on with our **predictive models**.
* In this notebook we will try some regressors without using Neural Networks.

## Import packages
* To begin with, we will import the packages, that we will use in the following segments of the project:
    * [pandas](https://pandas.pydata.org/)
    * [NumPy](https://www.numpy.org/)
    * [statsmodels](https://www.statsmodels.org/stable/index.html)
    * [scikit-learn](https://scikit-learn.org/stable/)
    * [XGBoost](https://xgboost.readthedocs.io/en/stable/)
    * [LightGBM](https://lightgbm.readthedocs.io/en/latest/)
* Note that the prementioned packages **must be locally installed too** in order to be used.

In [51]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import sklearn.metrics
from sklearn.cross_decomposition import PLSRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
import sklearn.metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import LinearSVR
from sklearn.model_selection import cross_validate

import xgboost as xgb
import lightgbm as lgb

## Data Load
* We will use the files created in the <code>data_preparation</code> notebook.
* This files can be found in the <code>data</code> folder.
* Especially, we will load the following:
  * <code>tracks.csv</code>
  * <code>analyses_downsampled.csv</code>

In [2]:
tracks = pd.read_csv('data/tracks.csv', index_col=0)
analyses_ds = pd.read_csv('data/analyses_downsampled.csv', index_col=0)

* We will merge these two files, in order to have the data into a single one.

In [3]:
full_tracks = tracks.join(analyses_ds)
full_tracks.head()

Unnamed: 0_level_0,danceability,energy,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,...,timbre_7_std_mid,timbre_8_std_mid,timbre_9_std_mid,timbre_10_std_mid,timbre_11_std_mid,timbre_12_std_mid,loudness_start_std_mid,loudness_max_std_mid,loudness_max_time_std_mid,duration_std_mid
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0AJfVstyUONRJhyYgTkHMR,0.253,0.385,-8.031,0,0.0341,0.729,0.0,0.0428,0.646,203.512,...,31.028081,21.500966,18.934411,19.095865,14.053563,16.839561,4.294521,3.070072,0.046489,0.10483
1V3NH8aYfFyRjfNOHgoX3G,0.219,0.222,-16.941,1,0.039,0.978,0.000149,0.0837,0.0632,72.999,...,22.268559,11.564204,19.853243,9.968599,12.719616,23.051455,5.631639,5.568326,0.077225,0.259447
4kfEy7h6bWBS5OaDpccTap,0.846,0.651,-5.481,0,0.128,0.634,0.0,0.125,0.581,140.048,...,28.513058,29.713144,17.411532,23.796638,22.653034,18.470448,5.112536,2.455965,0.034787,0.084622
7qgM0MdETmhd3mORTqvJRm,0.398,0.255,-11.205,1,0.0373,0.944,3e-06,0.114,0.386,83.883,...,23.401036,22.397868,17.243494,16.434974,19.959498,13.918116,5.835958,5.094191,0.06642,0.140004
1HtgEaqki4dFclmVJlJDx1,0.612,0.579,-7.264,1,0.0269,0.606,9e-06,0.353,0.81,136.641,...,24.076927,18.603327,15.858383,13.746648,16.996667,16.838443,4.014044,3.318366,0.037726,0.137322


## Dataset Generation
* Here, it is necessary to define our dataset.
* Specifically, we will use two Datasets, the first one will contain **all** the variables that we have, and for the second, we will choose the **important** ones.
* We will lastly add another very simple model, that will contain **only the audio features of the songs**, and we will use it only as a base to evaluate the other models.

### All Variables Dataset
* First, we will create the dataset that contains all the variables that we have.
* So we will use the <code>full_tracks</code> dataset.
* We have to split our dataset into *train* and *test*.
* The size of the test set will be the **20%** of the size of the whole dataset.

In [4]:
all_X, y = (full_tracks.drop(columns='valence'), full_tracks['valence'])

all_X_train, all_X_test, y_train, y_test = train_test_split(all_X, y, test_size=0.2, 
                                                    random_state=1)

### Important Variables Dataset
* Then, we will create another dataset, that will contain only the variables that seem to have an impact on valence.
* We will start by adding the audio features that affected valence the most, as we saw from the statistical analysis.
* Specifically, we saw that **danceability**, **energy** and **acousticness** are important, so of course we will include them in the dataset.
* Also, we will include **loudness**, **tempo** and **duration_ms** that have a less but still significant impact on *valence*.

In [5]:
important_audio_features = ['danceability', 'energy', 'acousticness', 'loudness', 'tempo', 'duration_ms']
important_features_df = full_tracks[important_audio_features]
important_features_df.head()

Unnamed: 0_level_0,danceability,energy,acousticness,loudness,tempo,duration_ms
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0AJfVstyUONRJhyYgTkHMR,0.253,0.385,0.729,-8.031,203.512,161000
1V3NH8aYfFyRjfNOHgoX3G,0.219,0.222,0.978,-16.941,72.999,219707
4kfEy7h6bWBS5OaDpccTap,0.846,0.651,0.634,-5.481,140.048,223921
7qgM0MdETmhd3mORTqvJRm,0.398,0.255,0.944,-11.205,83.883,246573
1HtgEaqki4dFclmVJlJDx1,0.612,0.579,0.606,-7.264,136.641,195531


* Then, we have to include also data from the deep analysis.
* In order to have more **comprehensive** and also more **concentrated** information, we will use [PLSRegression](https://scikit-learn.org/dev/modules/generated/sklearn.cross_decomposition.PLSRegression.html#sklearn.cross_decomposition.PLSRegression). We will use this algorithm for dimensionality reduction. The reason, why we chose it here instead of PCA, is because it is a supervised algorithm and we want to capture in our dimensions, as much information as we can, to expain the variance of *valence*, and not generally the variance of our predictors.
* But to do that correctly we first have to split our dataset into test and train, in order to train NCA only on the train data.
* Here, we will use the split that we did previously.

In [6]:
train_ids = all_X_train.index
test_ids = all_X_test.index
train_analyses_df = analyses_ds.filter(items=train_ids, axis=0)
test_analyses_df = analyses_ds.filter(items=test_ids, axis=0)

* Now we can execute our dimensionality reduction.
* We will create 30 components.

In [70]:
plsr = PLSRegression(n_components=30)
plsr.fit(train_analyses_df, y_train)
train_analyses_reduced = pd.DataFrame(plsr.transform(train_analyses_df), index=train_ids)
test_analyses_reduced = pd.DataFrame(plsr.transform(test_analyses_df), index=test_ids)

train_analyses_reduced.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3c5i1PV78s4Mgz5DPvg5p8,0.2182,4.120789,2.410577,1.749678,-0.705838,-0.810013,-0.819584,1.090276,-0.550811,-0.620698,...,0.657423,0.899749,1.343165,-0.272157,0.926544,0.902467,0.833886,0.130493,-0.631656,1.06517
5W3Y5I9y5ps6M6rl09e7zg,-2.189106,1.951263,2.06027,-5.580001,-0.494353,-2.002322,-0.378216,-2.279076,-0.231867,1.033581,...,0.246576,-0.527923,-1.003374,1.470143,-0.455528,-0.96152,-0.121035,-0.818958,0.21047,0.18177
1FssTKSXO3JLiQc3udfK5I,-6.102108,0.417266,1.352917,-1.263815,-1.273376,-1.772931,-0.769974,-1.044958,0.33371,2.249149,...,-0.226232,0.699581,-0.254394,-0.419115,-0.136135,0.270881,-0.109923,-0.626718,0.428838,-0.043215
0WWz2AaqxLoO0fa9ou6Fqc,0.507525,3.218161,4.13461,0.876526,-1.179215,0.180405,-0.398905,-0.77257,0.946695,-0.624516,...,0.280837,-0.014383,0.110718,0.283696,-0.457087,0.564088,1.585374,0.003861,0.83312,0.242135
5EeQ940KC0N4JCNLPIdRd1,-0.153389,0.908295,0.966328,1.847866,-4.404077,1.116358,0.391466,0.611845,0.195411,-0.911948,...,-0.885772,-0.880681,0.072334,-0.12132,0.07463,-0.893758,-0.231688,0.229336,0.500625,0.129391


* We can see that the names of the columns are integers. A warning may be raised later because of that so we will transform them to strings.

In [71]:
def make_columns_strings(df):
    df.columns = map(lambda x: 'col_' + str(x), list(df.columns))

make_columns_strings(train_analyses_reduced)
make_columns_strings(test_analyses_reduced)

train_analyses_reduced.head()

Unnamed: 0_level_0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,col_20,col_21,col_22,col_23,col_24,col_25,col_26,col_27,col_28,col_29
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3c5i1PV78s4Mgz5DPvg5p8,0.2182,4.120789,2.410577,1.749678,-0.705838,-0.810013,-0.819584,1.090276,-0.550811,-0.620698,...,0.657423,0.899749,1.343165,-0.272157,0.926544,0.902467,0.833886,0.130493,-0.631656,1.06517
5W3Y5I9y5ps6M6rl09e7zg,-2.189106,1.951263,2.06027,-5.580001,-0.494353,-2.002322,-0.378216,-2.279076,-0.231867,1.033581,...,0.246576,-0.527923,-1.003374,1.470143,-0.455528,-0.96152,-0.121035,-0.818958,0.21047,0.18177
1FssTKSXO3JLiQc3udfK5I,-6.102108,0.417266,1.352917,-1.263815,-1.273376,-1.772931,-0.769974,-1.044958,0.33371,2.249149,...,-0.226232,0.699581,-0.254394,-0.419115,-0.136135,0.270881,-0.109923,-0.626718,0.428838,-0.043215
0WWz2AaqxLoO0fa9ou6Fqc,0.507525,3.218161,4.13461,0.876526,-1.179215,0.180405,-0.398905,-0.77257,0.946695,-0.624516,...,0.280837,-0.014383,0.110718,0.283696,-0.457087,0.564088,1.585374,0.003861,0.83312,0.242135
5EeQ940KC0N4JCNLPIdRd1,-0.153389,0.908295,0.966328,1.847866,-4.404077,1.116358,0.391466,0.611845,0.195411,-0.911948,...,-0.885772,-0.880681,0.072334,-0.12132,0.07463,-0.893758,-0.231688,0.229336,0.500625,0.129391


* Following, we will merge our important audio features with the reduced dimensions of deep analysis.

In [77]:
imp_X_train = train_analyses_reduced.join(important_features_df)
imp_X_test = test_analyses_reduced.join(important_features_df)
imp_X_train.head()

Unnamed: 0_level_0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,col_26,col_27,col_28,col_29,danceability,energy,acousticness,loudness,tempo,duration_ms
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3c5i1PV78s4Mgz5DPvg5p8,0.2182,4.120789,2.410577,1.749678,-0.705838,-0.810013,-0.819584,1.090276,-0.550811,-0.620698,...,0.833886,0.130493,-0.631656,1.06517,0.218,0.939,0.000134,-4.524,174.393,243320
5W3Y5I9y5ps6M6rl09e7zg,-2.189106,1.951263,2.06027,-5.580001,-0.494353,-2.002322,-0.378216,-2.279076,-0.231867,1.033581,...,-0.121035,-0.818958,0.21047,0.18177,0.865,0.394,0.0832,-12.235,115.936,521693
1FssTKSXO3JLiQc3udfK5I,-6.102108,0.417266,1.352917,-1.263815,-1.273376,-1.772931,-0.769974,-1.044958,0.33371,2.249149,...,-0.109923,-0.626718,0.428838,-0.043215,0.834,0.775,0.00694,-6.144,133.979,220227
0WWz2AaqxLoO0fa9ou6Fqc,0.507525,3.218161,4.13461,0.876526,-1.179215,0.180405,-0.398905,-0.77257,0.946695,-0.624516,...,1.585374,0.003861,0.83312,0.242135,0.491,0.783,0.00521,-8.997,127.437,328220
5EeQ940KC0N4JCNLPIdRd1,-0.153389,0.908295,0.966328,1.847866,-4.404077,1.116358,0.391466,0.611845,0.195411,-0.911948,...,-0.231688,0.229336,0.500625,0.129391,0.379,0.901,0.0157,-5.777,169.97,261960


### Simple Audio Features Model
* As we mentioned before we will use another model that contains only the audio features, to evaluate our results.
* *Reminder: Before starting our statistical analysis we created the <code>independent</code> collection, that contains the names of the audio features, without valence.*

In [9]:
audio_features = tracks.columns.drop('valence')
audio_features

Index(['danceability', 'energy', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'tempo', 'duration_ms',
       'time_signature', 'key_0', 'key_1', 'key_2', 'key_3', 'key_4', 'key_5',
       'key_6', 'key_7', 'key_8', 'key_9', 'key_10', 'key_11'],
      dtype='object')

* In our simple model, we will include these columns only.



In [10]:
simple_X_train = all_X_train[audio_features]
simple_X_test = all_X_test[audio_features]
simple_X_train.head()

Unnamed: 0_level_0,danceability,energy,loudness,mode,speechiness,acousticness,instrumentalness,liveness,tempo,duration_ms,...,key_2,key_3,key_4,key_5,key_6,key_7,key_8,key_9,key_10,key_11
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3c5i1PV78s4Mgz5DPvg5p8,0.218,0.939,-4.524,0,0.0866,0.000134,0.00229,0.11,174.393,243320,...,0,0,0,0,0,0,0,0,1,0
5W3Y5I9y5ps6M6rl09e7zg,0.865,0.394,-12.235,1,0.0528,0.0832,0.00231,0.302,115.936,521693,...,0,0,0,0,0,0,0,0,0,1
1FssTKSXO3JLiQc3udfK5I,0.834,0.775,-6.144,0,0.18,0.00694,2.4e-05,0.476,133.979,220227,...,0,0,0,0,0,0,0,0,0,0
0WWz2AaqxLoO0fa9ou6Fqc,0.491,0.783,-8.997,0,0.0484,0.00521,0.0361,0.0951,127.437,328220,...,0,0,0,0,0,0,1,0,0,0
5EeQ940KC0N4JCNLPIdRd1,0.379,0.901,-5.777,1,0.0522,0.0157,8.4e-05,0.0779,169.97,261960,...,0,0,1,0,0,0,0,0,0,0


* Now we are ready to run our Machine Learning Algorithms.

## K-Nearest Neighbors Regression
* The first algorithm that we will use is a [KNN Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html).
* Let's set up the scene.
* We have to remember that our data must be scaled to use KNN algorithms.
* So, we will create a <code>Pipeline</code>.

In [12]:
knn_regr = make_pipeline(StandardScaler(), KNeighborsRegressor(n_jobs=-1))

* Let's create an initial solution with our simple dataset.

In [13]:
knn_regr.fit(simple_X_train, y_train)
simple_knn_pred = knn_regr.predict(simple_X_test)
sklearn.metrics.mean_absolute_error(y_test, simple_knn_pred)

0.15917493790344578

### All-Dataset
* We will first use our dataset that contains all the variables.
* To tune our algorithm we will use [Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

In [16]:
parameters = [
  {
   'kneighborsregressor__n_neighbors': [5, 10, 20, 50],
   'kneighborsregressor__weights': ['uniform', 'distance'],
   'kneighborsregressor__p': [1, 2]
  }
]

knn_grid = GridSearchCV(knn_regr, parameters, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
knn_grid.fit(all_X_train, y_train)
all_knn_r = knn_grid.best_estimator_
all_knn_r

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kneighborsregressor',
                 KNeighborsRegressor(n_neighbors=10, p=1, weights='distance'))])

* Let's see how well our Regressor operates on the test set.

In [19]:
all_knn_pred = all_knn_r.predict(all_X_test)
sklearn.metrics.mean_absolute_error(y_test, all_knn_pred)

0.12206381476315419

* The result is much better than the simple model.

#### Weights
* Here, an interesting idea would be to add **weights**, that are proportional to the impact that each variable has on valence.
* The point here is that we want dimensions that have a significant impact on valence to affect the distance between the various points more.
* A simple way to do that is to use the **standardized coefficients** of linear regression, as weights.
*  But, we can't use the coefficients that we calculated before, because the dataset we used then, also contains our current test set.
* So we have to calculate them again for our train dataset.
* *Note: In order to simplify things, here we will include key_11 in the model, as we found that keys do not affect valence either way.*
* To do all of this, we will create a custom scikit-learn **Transformer**.
* It will weight each cell of every row, using the appropriate coefficient (simple mutiplication).

In [47]:
class WeightAdjustingTransformer(BaseEstimator, TransformerMixin):
    
    standarized_coeffs = []

    def __init__(self):
        pass
    
    def fit(self, X, y):
        # turn X and y into appropriate dataframes
        X_df = pd.DataFrame(X, columns=['x_' + str(i) for i in range(X.shape[1])])
        y_df = pd.DataFrame(y)
        y_df.index = X_df.index
        y_df.columns = ['y']
        # get the standarized coefficients
        coeff_dict, pvalues_dict, sorted_metrics = self.get_standarized_coefficients(X_df.join(y_df), X_df.columns, 'y')
        # put the correct values in the static variable <code>standarized_coeffs</code>.
        WeightAdjustingTransformer.standarized_coeffs = np.array([coeff_dict[x] for x in X_df.columns])
        return self
    
    def transform(self, X, y=None):
        return np.array([self.adjust_row(row) for row in X])
    
    @classmethod
    def adjust_row(cls, row):
        return np.array(row * cls.standarized_coeffs)
    
    @staticmethod
    def get_standarized_coefficients(df, independent, target='valence'):
      standarized_df = WeightAdjustingTransformer.get_standarized_dataframe(df, independent)
      standarized_df = pd.concat((standarized_df, df[target]), axis=1) # add target column
      standard_lr = WeightAdjustingTransformer.get_lr_res(standarized_df, independent, target)
      coefs = standard_lr.params[1:] # don't include the Intercept
      pvalues = standard_lr.pvalues[1:]
      coef_dict = dict(zip(independent, coefs))
      pvalues_dict = dict(zip(independent, pvalues))
      return coef_dict, \
             pvalues_dict, \
             sorted(coef_dict, key=lambda x: abs(coef_dict[x]), reverse=True)

    @staticmethod
    def get_standarized_dataframe(df_source, columns):
      scaler = StandardScaler()
      scaler.fit(df_source[columns])
      standarized_df = pd.DataFrame(scaler.transform(df_source[columns]), 
                                    index=df_source.index, 
                                    columns=columns)
      return standarized_df
    
    @staticmethod
    def get_lr_res(df, independent, target='valence'):
      formula = f'{target} ~ ' + '+'.join(independent)
      model = smf.ols(formula=formula, data=df)
      return model.fit()

* We can now create our **Weighted KNN Regressor** pipeline.

In [32]:
weighted_knn_regr = make_pipeline(StandardScaler(), WeightAdjustingTransformer(), KNeighborsRegressor())

* Let's see how well this new idea performs on the simple dataset.

In [162]:
weighted_knn_regr.fit(simple_X_train, y_train)
simple_weighted_knn_pred = weighted_knn_regr.predict(simple_X_test)
sklearn.metrics.mean_absolute_error(y_test, simple_weighted_knn_pred)

0.14881601927447952

* Seems to work better.
* We will do Grid Search again.

In [6]:
weighted_knn_grid = GridSearchCV(weighted_knn_regr, parameters, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
weighted_knn_grid.fit(all_X_train, y_train)
all_weighted_knn_r = weighted_knn_grid.best_estimator_
all_weighted_knn_r

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('weightadjustingTransformer', WeightAdjustingTransformer()),
                ('kneighborsregressor',
                 KNeighborsRegressor(n_neighbors=10, p=1, weights='distance'))])

* We will evaluate on the test set again.

In [165]:
all_weighted_knn_pred = all_weighted_knn_r.predict(all_X_test)
sklearn.metrics.mean_absolute_error(y_test, all_weighted_knn_pred)

0.11269524398187614

* The results are pretty good, realtively to the previous attempts.

### Important-Dataset
* We will use KNN for our Dataset with the "important" variables too.

In [7]:
knn_grid.fit(imp_X_train, y_train)
imp_knn_r = knn_grid.best_estimator_
imp_knn_r

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('kneighborsregressor',
                 KNeighborsRegressor(n_jobs=-1, n_neighbors=10, p=1,
                                     weights='distance'))])

* Let's see how this one performs on the test set.

In [173]:
imp_knn_pred = imp_knn_r.predict(imp_X_test)
sklearn.metrics.mean_absolute_error(y_test, imp_knn_pred)

0.11922329934561932

* The results here are better in comparison to the model with all the variables.

#### Weights
* We will use weights for this dataset too.

In [9]:
weighted_knn_grid.fit(imp_X_train, y_train)
imp_weighted_knn_r = weighted_knn_grid.best_estimator_
imp_weighted_knn_r

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('weightadjustingtransformer', WeightAdjustingTransformer()),
                ('kneighborsregressor',
                 KNeighborsRegressor(n_neighbors=20, p=1, weights='distance'))])

* And we will test this regressor.

In [179]:
weighted_imp_knn_pred = imp_weighted_knn_r.predict(imp_X_test)
sklearn.metrics.mean_absolute_error(y_test, weighted_imp_knn_pred)

0.1067694786474835

* The results here are considerably better than any model that we tried before.

## Linear Support Vector Regression.
* The next Machine Lerning Algorithm that we will use is the [Linear Support Vector Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR).
* Let's create one.
* We have to remember that SVMs are sensitive to scaling too.

In [25]:
lsvr = make_pipeline(StandardScaler(), LinearSVR(max_iter=10000))

* We will create a base solution using our simple dataset.

In [26]:
lsvr.fit(simple_X_train, y_train)
simple_lsvr_pred = lsvr.predict(simple_X_test)
sklearn.metrics.mean_absolute_error(y_test, simple_lsvr_pred)

0.15661181039930386

* The first results of SVR are not particularly good.
* We will see how it performs with our more detailed datasets.

### All-Dataset
* Again we will use Grid Search.

In [29]:
parameters = [
  {
   'linearsvr__epsilon': [0, 0.1, 0.3],
   'linearsvr__C': [0.5, 1, 2],
  }
]

lsvr_grid = GridSearchCV(lsvr, parameters, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
lsvr_grid.fit(all_X_train, y_train)
all_lsvr_r = lsvr_grid.best_estimator_
all_lsvr_r



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearsvr', LinearSVR(C=0.5, epsilon=0, max_iter=10000))])

* Let's see the results on test.

In [30]:
all_lsvr_pred = all_lsvr_r.predict(all_X_test)
sklearn.metrics.mean_absolute_error(y_test, all_lsvr_pred)

0.12218039898841296

#### Weights
* As we have already created the <code>WeightAdjustingTransformer</code>. We can use it again.

In [52]:
weighted_lsvr = make_pipeline(StandardScaler(), WeightAdjustingTransformer(), LinearSVR())

* We will run Grid Search on this new LSVR and the dataset with all columns.

In [53]:
weighted_lsvr_grid = GridSearchCV(weighted_lsvr, parameters, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
weighted_lsvr_grid.fit(all_X_train, y_train)
all_weighted_lsvr_r = weighted_lsvr_grid.best_estimator_
all_weighted_lsvr_r



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('weightadjustingtransformer', WeightAdjustingTransformer()),
                ('linearsvr', LinearSVR(C=2, epsilon=0))])

* Let's see the results on test.

In [54]:
all_weighted_lsvr_pred = all_weighted_lsvr_r.predict(all_X_test)
sklearn.metrics.mean_absolute_error(y_test, all_weighted_lsvr_pred)

0.12194024193998934

* Here, the improvement is only very small.

### Important-Dataset
* We will apply the same procedure for the important dataset too.

In [78]:
lsvr_grid.fit(imp_X_train, y_train)
imp_lsvr_r = lsvr_grid.best_estimator_
imp_lsvr_r



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearsvr', LinearSVR(C=1, epsilon=0, max_iter=10000))])

* And on test.

In [79]:
imp_lsvr_pred = imp_lsvr_r.predict(imp_X_test)
sklearn.metrics.mean_absolute_error(y_test, imp_lsvr_pred)

0.12395255787752489

* The results are worst in this case.

#### Weights

In [80]:
weighted_lsvr_grid.fit(imp_X_train, y_train)
imp_weighted_lsvr_r = weighted_lsvr_grid.best_estimator_
imp_weighted_lsvr_r

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('weightadjustingtransformer', WeightAdjustingTransformer()),
                ('linearsvr', LinearSVR(C=2, epsilon=0))])

* Let's test again.

In [82]:
imp_weighted_lsvr_pred = imp_weighted_lsvr_r.predict(imp_X_test)
sklearn.metrics.mean_absolute_error(y_test, imp_weighted_lsvr_pred)

0.12392248046461102

* The result here is identical, and worse than the model with all columns.

## Tree-Based Algorithms
* In this section, we will use tree-based algorithms, and specifically:
  * [XGBoost](https://xgboost.readthedocs.io/en/stable/)
  * [LightGBM](https://lightgbm.readthedocs.io/en/latest/)
* The specific implementations are very strong and work well in practice.

## XGBoost
* We will start by creating an XGBoost Regressor.

In [53]:
xgbr = xgb.XGBRegressor()

* Let's get a base error estimation, using the simple dataset.

In [56]:
xgbr.fit(simple_X_train, y_train)
simple_xgb_pred = xgbr.predict(simple_X_test)
sklearn.metrics.mean_absolute_error(y_test, simple_xgb_pred)



0.14064690888568981

### All-Dataset
* Here the runtime is very large, so instead of a simple Grid Search, we will use [Randomized Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html).

In [71]:
parameters = [
  {
   'booster': ['gbtree', 'dart'],
   'eta': [0.1, 0.3, 0.6],
   'gamma': [0.5, 3, 10, 20],
   'max_depth': [3, 6, 10]
  }
]

xgbr_random_grid = RandomizedSearchCV(xgbr, parameters, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
xgbr_random_grid.fit(all_X_train, y_train)
all_xgbr_r = xgbr_random_grid.best_estimator_
all_xgbr_r



XGBRegressor(eta=0.3, gamma=0.5, max_depth=6)

* Let's see how well this one performs at the test set.

In [72]:
all_xgbr_pred = all_xgbr_r.predict(all_X_test)
sklearn.metrics.mean_absolute_error(y_test, all_xgbr_pred)

0.09842982709213195

* The results seem pretty good.

### Important-Dataset

In [73]:
xgbr_random_grid.fit(imp_X_train, y_train)
imp_xgbr_r = xgbr_random_grid.best_estimator_
imp_xgbr_r



XGBRegressor(booster='dart', eta=0.1, gamma=0.5, max_depth=6)

* Test again.

In [74]:
imp_xgbr_pred = imp_xgbr_r.predict(imp_X_test)
sklearn.metrics.mean_absolute_error(y_test, imp_xgbr_pred)

0.10414892631559375

* Pretty good results again, but not as good as the dataset with all the variables.

## LightGBM
* We will use a LightGBM Regressor next.
* Let's create it.

In [75]:
lgbmr = lgb.LGBMRegressor()

* Again we will get a base error for this estimator.

In [76]:
lgbmr.fit(simple_X_train, y_train)
simple_lgbm_pred = lgbmr.predict(simple_X_test)
sklearn.metrics.mean_absolute_error(y_test, simple_lgbm_pred)

0.1333431255291485

* This is the best we have seen yet, so the first indication is very promising.
* Let's do our Grid Search.

In [163]:
parameters = [
  {
   'boosting_type': ['gbdt', 'dart'],
   'num_leaves': [10, 31, 60, 90],
   'max_depth': [-1, 5 , 20],
   'learning_rate': [0.05, 0.1, 0.2, 0.3]
  }
]

lgbmr_grid = GridSearchCV(lgbmr, parameters, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
lgbmr_grid.fit(all_X_train, y_train)
all_lgbmr_r = lgbmr_grid.best_estimator_
all_lgbmr_r

LGBMRegressor(learning_rate=0.2, num_leaves=90)

* And have a look at how well it performs.

In [255]:
all_lgbmr_pred = all_lgbmr_r.predict(all_X_test)
sklearn.metrics.mean_absolute_error(y_test, all_lgbmr_pred)

0.09164649690604323

* Wow! this is much lower than any value found previously.
* Is there any chance that we can **tune it a bit more**?
* Here the idea is to give our Regressor a bit **more estimators** to fit. But in order to get advantage of it, it is necessary to **increase the number of leaves** and **reduce the learning rate**.
* After experimenting a bit we concluded that some parameters that work relatively well are:
  * <code>learning_rate</code>: 0.05
  * <code>num_leaves</code>: 200
  * <code>n_estimators</code>: 1500


In [85]:
fine_all_lgbmr_r = lgb.LGBMRegressor(learning_rate=0.05, num_leaves=200, n_estimators=1500)

* So, let's check its performance. 
* First, we will do a cross validation on the train test.

In [86]:
cv_results = cross_validate(fine_all_lgbmr_r, all_X_train, y_train, cv=5, scoring='neg_mean_absolute_error')
print('Results: {0}, Average: {1}'.format(-cv_results['test_score'], -cv_results['test_score'].mean()))

Results: [0.08575149 0.08835177 0.08765204 0.08921213 0.08784873], Average: 0.08776323228900765


* In comparison let's see how well our initial LBGM Regressor performed.

In [92]:
cv_results = cross_validate(all_lgbmr_r, all_X_train, y_train, cv=5, scoring='neg_mean_absolute_error')
print('Results: {0}, Average: {1}'.format(-cv_results['test_score'], -cv_results['test_score'].mean()))

Results: [0.09085612 0.09369718 0.0933885  0.09414988 0.09293705], Average: 0.09300574470905838


* The results of the fine-tuned are considerably better.
* And finally, let's see the perfomance on the test set.

In [93]:
fine_all_lgbmr_r.fit(all_X_train, y_train)
fine_all_lgbmr_pred = fine_all_lgbmr_r.predict(all_X_test)
sklearn.metrics.mean_absolute_error(y_test, fine_all_lgbmr_pred)

0.08537355791917575

* The results are the best we have seen.

### Important-Dataset
* Is there any chance we can get better results with the important dataset?
* Let's perform Grid Search again.

In [278]:
lgbmr_grid.fit(imp_X_train, y_train)
imp_lgbmr_r = lgbmr_grid.best_estimator_
imp_lgbmr_r

LGBMRegressor(max_depth=20, num_leaves=90)

* Let's test the performance.

In [280]:
imp_lgbmr_pred = imp_lgbmr_r.predict(imp_X_test)
sklearn.metrics.mean_absolute_error(y_test, imp_lgbmr_pred)

0.09876587876028549

* The error we get here is singificantly higher than the previous model, so we won't try to tune it even more.

### Conlusions - Non Neural Network Machine Learning
* The best results were achieved with the **LGBM Regressor**.
* The Minimum Mean Absolute Error for the test set was: **0.0853**.
* But let's dive a little deeper and try to interpret how this results were calculated.
* To do that, we will use the **feature importances** of our best regressor. 

In [94]:
importances = fine_all_lgbmr_r.feature_importances_
indices = np.argsort(importances)[::-1]
feature_cols = all_X_test.columns
    
for f in range(10):
    print("%d. feature %d %s (%f)" % (f + 1, indices[f], feature_cols[indices[f]], importances[indices[f]]))

1. feature 9 duration_ms (4058.000000)
2. feature 8 tempo (3534.000000)
3. feature 7 liveness (3388.000000)
4. feature 4 speechiness (3191.000000)
5. feature 134 duration_std_mid (3166.000000)
6. feature 78 duration_std_start (2993.000000)
7. feature 0 danceability (2938.000000)
8. feature 108 pitches_2_std_mid (2917.000000)
9. feature 79 pitches_1_means_mid (2853.000000)
10. feature 97 timbre_7_means_mid (2849.000000)


* The three most important predictors according to the feature importances are:
  * <code>duration_ms</code>
  * <code>tempo</code>
  * <code>liveness</code>
* This may be a little surprising, but we have to remember that each of these is combined with all the others to get good results.
* We can't be sure, but for example we can see that <code>duration_std_mid</code> and <code>duration_std_start</code> are also very high. So the combination of the duration of the  song with the standard deviation of the duration of each segments, may be important for the calculation of valence.
* Also, <code>liveness</code> and <code>speechiness</code> are columns that we didn't include in our important dataset, which is something that may explain the better results we get with this one.
* Lastly, *pitch 1* appears twice in the top feature importances, which shows us that it must be one of the most important features that we get from the deep analysis.
* Additionally, <code>loudness</code> does not appear a single time here, which may show that does not affect valence after all.
* In general a more high level interpretation of the feature importances gives an indication that *valence* has a lot to do with the rhythm of the song (variance of the duration of the segments, tempo, danceability, all relate to the rhythm).