# Comparing Models

- Run the models 3 times with different random sees (24, 45, 93) at T/T-split to get 3 different R2-scores
- Take the STD of the 3 R2-scores and make error bars for the different models.

T/T-split with random seed was done in notebook 12, so get the first metrics from there.
This notebook will provide metrics for random seed 2 and 3. 

I have tried to clean the code a little bit to be able to make it run faster.
There are less explenation in this notebook than in notebook nr 12. That is bc I do the same thing 2 more times. To get more explenations to things please look at notebook 12.

## Import stuff

In [None]:
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
import sklearn.metrics
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('seaborn-bright')

%matplotlib inline

## Load data

In [None]:
data = pd.read_csv('./data_top10c_more_lyrics.csv')
data.head(3)

## Fix a little bit with the data

#### Get dummies for the Country column + include in the data frame

In [None]:
country_dum = pd.get_dummies(data['Country'])

data_c = pd.concat([data, country_dum], axis=1)

data_c.head(3)

#### Get dummies for the Position column  + include in the data frame

**Gradient #3 - TF-IDF + top 3 coefs**

In [None]:
# define X and y
X_train26 = indep_train_tvec[['Energy', 'Acousticness', 'Tempo']]
y_train26 = dep_train
X_test26 = indep_test_tvec[['Energy', 'Acousticness', 'Tempo']]
y_test26 = dep_test

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'loss': ('ls', 'lad', 'huber', 'quantile'),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'verbose' : np.arange(0, 1)}

grad6 = GradientBoostingRegressor(random_state=24)

get_best_hype(grad6, params, X_train26, y_train26)

In [None]:
# chose model and use best hyperparameters (from gridsearchCV)
grad6 = GradientBoostingRegressor(random_state=24, n_estimators=70, loss='huber', max_depth=2, max_features='auto',
                                 verbose=0)

# call function
evaluate_model(grad6, X_train26, X_test26, y_train26, y_test26)

## Compare model visualization

R2-score and STD of 3 R2-scores with different random seed in the Train/Test split

https://matplotlib.org/1.2.1/examples/pylab_examples/errorbar_demo.html

https://stackoverflow.com/questions/22364565/python-pylab-scatter-plot-error-bars-the-error-on-each-point-is-unique

### Independet varables (X):
### Spotify API, Countries, Position, AvgStreams, AvgPosition, TextBlob, NLP

Spotify API: Acousticness, Energy, Instrumentalness, Mode, Tempo, Valence<BR />
Countries: au, ca, de, fr, gb, it, nl, us<BR />
Position: 1-200<BR />
TextBlob: Polarity, Subjectivity<BR />
NLP: CountVec / TF-IDF

**LinearRegressor**
Random_State(24, 45, 93)

Get the STD

In [None]:
# All features - cvec
print(np.std([0.013486410677653882,0.018647294642352596,0.027289480032708258]))

# Top 10 features - cvec
print(np.std([0.15703025345048383,0.0027594332154159407,0.010353867606340716]))

# Top 3 features - cvec
print(np.std([0.14873571822331177,-0.007729319613470675,0.005940599633448396]))

# All features - tvec
print(np.std([0.021622186821517175,0.06084852472490698,0.04931998355658407]))

# Top 10 features - tvec
print(np.std([0.15924513606808388,0.1421096962323608,0.16564254674555934]))

# Top 3 features - tvec
print(np.std([0.15789602776517975,0.1214683266365274,0.14426340969288443]))

Get average R2-score

In [None]:
def get_best_hype(model, params, X_train, y_train):  
    # standardize
    ss = StandardScaler()
    ss.fit(X_train)
    X_train_s = ss.transform(X_train)
     
    # Best Hyperparameters
    rs = RandomizedSearchCV(model, params, n_iter=27)
    
    # fit
    rs.fit(X_train_s, y_train)
     
    return {'best_score': rs.best_score_,'best_params': rs.best_params_} 

def evaluate_model(model, X_train, X_test, y_train, y_test):
    # standardize the predictors
    ss = StandardScaler()
    ss.fit(X_train)
    X_train_s = ss.transform(X_train)
    X_test_s = ss.transform(X_test)
    
    # fit
    model.fit(X_train_s, y_train)
    
    # Evaluate: predict
    y_pred = model.predict(X_test_s)
    y_true = y_test
    
    mean_square_error = np.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    
    # Evaluate: score
    score = model.score(X_test_s, y_test)
    
    return {'Score (R^2)': score.mean(), 'MSE': mean_square_error}

**Ada #1 - CountVec + all coefs**

In [None]:
# Declare indep and dep
X_train23 = indep_train_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train23 = dep_train
X_test23 = indep_test_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test23 = dep_test

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'loss': ('linear', 'square', 'exponential')}

ada2 = AdaBoostRegressor(random_state=24) 
# base_estimator = DecisionTreeRegressor() (default)
# learning_rate: (default=1.0)

get_best_hype(ada2, params, X_train23, y_train23)

In [None]:
# chose model and use best hyperparameters (from gridsearchCV)
ada2 = AdaBoostRegressor(n_estimators=10, loss='linear', random_state=24)
# I'm going with the default base_estimator, bc to choose a rfr is taking too long time

# call function
evaluate_model(ada2, X_train23, X_test23, y_train23, y_test23)

**Gradient #1 - TF-IDF + all coefs**

In [None]:
## Declare indep and dep
X_train24 = indep_train_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train24 = dep_train
X_test24 = indep_test_tvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test24 = dep_test

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'loss': ('ls', 'lad', 'huber', 'quantile'),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'verbose' : np.arange(0, 1)}

grad4 = GradientBoostingRegressor(random_state=24) 
# learning_rate: (default=1.0)
# init = BaseEstimator, default = None (loss.init_estimator)

get_best_hype(grad4, params, X_train24, y_train24)

In [None]:
# chose model and use best hyperparameters (from gridsearchCV)
grad4 = GradientBoostingRegressor(random_state=24, n_estimators=70, loss='ls', max_depth=5, max_features='auto',
                                 verbose=0)
# if loss = huber or quantile we need a value for alpha

# call function
evaluate_model(grad4, X_train24, X_test24, y_train24, y_test24)

In [None]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train24.columns,grad4.feature_importances_ ))).abs().sort_values(ascending=False).head(15)

**Gradient #2 - TF-IDF + top 10 coefs**

In [None]:
# define X and y
X_train25 = indep_train_tvec[['Energy', 'Acousticness', 'Subjectivity', 'baby', 'Instrumentalness', 'Tempo', 
                              'girl', 'moi', 'avg_Position', 'hab']]
y_train25 = dep_train
X_test25 = indep_test_tvec[['Energy', 'Acousticness', 'Subjectivity', 'baby', 'Instrumentalness', 'Tempo', 
                              'girl', 'moi', 'avg_Position', 'hab']]
y_test25 = dep_test

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'loss': ('ls', 'lad', 'huber', 'quantile'),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'verbose' : np.arange(0, 1)}

grad5 = GradientBoostingRegressor(random_state=24)

get_best_hype(grad5, params, X_train25, y_train25)

In [None]:
# chose model and use best hyperparameters (from gridsearchCV)
grad5 = GradientBoostingRegressor(random_state=24, n_estimators=90, loss='lad', max_depth=8, max_features='log2',
                                 verbose=0)

# call function
evaluate_model(grad5, X_train25, X_test25, y_train25, y_test25)

In [None]:
position_dum = pd.get_dummies(data['Position'])

data_c_p = pd.concat([data_c, position_dum], axis=1)

data_c_p.head(3)

#### Group by ID, sum()

In [None]:
data_cp_groupbyID = data_c_p.groupby('ID').sum()
data_cp_groupbyID.head(3)

#### Make 2 new columns for average Streams and average Position

In [None]:
# Declare indep and dep
X_train10 = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train10 = dep_train
X_test10 = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test10 = dep_test

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'loss': ('linear', 'square', 'exponential')}

ada = AdaBoostRegressor(random_state=24) 
# base_estimator = DecisionTreeRegressor() (default)
# learning_rate: (default=1.0)

get_best_hype(ada, params, X_train10, y_train10)

In [None]:
# chose model and use best hyperparameters (from gridsearchCV)
ada = AdaBoostRegressor(n_estimators=10, loss='exponential', random_state=24)
# I'm going with the default base_estimator, bc to choose a rfr is taking too long time

# call function
evaluate_model(ada, X_train10, X_test10, y_train10, y_test10)

**Gradient #1 - CountVec + all coefs**

In [None]:
# Declare indep and dep
X_train11 = indep_train_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_train11 = dep_train
X_test11 = indep_test_cvec.drop(['Track Name', 'Artist', 'ID', 'Lyrics', 'Valence'], axis=1)
y_test11 = dep_test

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'loss': ('ls', 'lad', 'huber', 'quantile'),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'verbose' : np.arange(0, 1)}


grad = GradientBoostingRegressor(random_state=24) 
# learning_rate: (default=1.0)
# init = BaseEstimator, default = None (loss.init_estimator)

get_best_hype(grad, params, X_train11, y_train11)

In [None]:
# chose model and use best hyperparameters (from gridsearchCV)
grad = GradientBoostingRegressor(random_state=24, n_estimators=30, loss='lad', max_depth=17 , max_features='auto',
                                 verbose=0)

# call function
evaluate_model(grad, X_train11, X_test11, y_train11, y_test11)

In [None]:
# Look at the feature importance with coef_
pd.Series(dict(zip(X_train11.columns,grad.feature_importances_ ))).abs().sort_values(ascending=False).head(15)

**Gradient #2 - CountVec + top 10 coefs**

In [None]:
# define X and y
X_train12 = indep_train_cvec[['Energy', 'Acousticness', 'Tempo', 'Subjectivity', 'Polarity','avg_Streams', 
                              'Instrumentalness', 'avg_Position', 'baby', 'know']]
y_train12 = dep_train
X_test12 = indep_test_cvec[['Energy', 'Acousticness', 'Tempo', 'Subjectivity', 'Polarity','avg_Streams', 
                              'Instrumentalness', 'avg_Position', 'baby', 'know']]
y_test12 = dep_test

# RandomizedSearch
params = {'n_estimators': np.arange(10, 100, 10),
        'loss': ('ls', 'lad', 'huber', 'quantile'),
        'max_depth': np.arange(2, 20, 3),
        'max_features' : ('auto', 'sqrt', 'log2'),
        'verbose' : np.arange(0, 1)}

grad2 = GradientBoostingRegressor(random_state=24)

get_best_hype(grad2, params, X_train12, y_train12)

In [None]:
# chose model and use best hyperparameters (from gridsearchCV)
grad2 = GradientBoostingRegressor(random_state=24, n_estimators=90, loss='huber', max_depth=2, max_features='auto',
                                 verbose=0)

# call function
evaluate_model(grad2, X_train12, X_test12, y_train12, y_test12)

**Gradient #3 - CountVec + top 3 coefs**