# Spotify Popularity Predictor (39%)

The goal of this challenge is to create a model that predicts the popularity of a song based on its features.

The dataset contains a list of tracks with the following characteristics:
- `acousticness`: whether the track is acoustic
- `danceability`: describes how suitable a track is for dancing
- `duration_ms`: duration of the track in milliseconds
- `energy`: represents a perceptual measure of intensity and activity
- `explicit`: whether the track has explicit lyrics
- `id`: id for the track
- `instrumentalness`: predicts whether a track contains no vocals
- `key`: the key the track is in
- `liveness`: detects the presence of an audience in the recording
- `loudness`: the overall loudness of a track in decibels
- `mode`: modality of a track
- `name`: name of the track
- `popularity`: popularity of the track
- `release_date`: release date
- `speechiness`: detects the presence of spoken words in a track
- `tempo`: overall estimated tempo of a track in beats per minute
- `valence`: describes the musical positiveness conveyed by a track
- `artist`: artist who performed the track

# Model

## Data collection

📝 **Load the `spotify_popularity_train.csv` dataset from the provided URL**
- Display the first few rows
- Perform the basic cleaning operations (remove redundant lines, as well as those with missing values)
- Store the result in a `DataFrame` named `data`

In [1]:
url = "https://wagon-public-datasets.s3.amazonaws.com/certification_paris_2021Q1/spotify_popularity_train.csv"

In [2]:
import pandas as pd

In [3]:
data = pd.read_csv(url)

In [4]:
data.dropna(inplace=True)

In [5]:
data.drop_duplicates(inplace=True)

In [6]:
data = data.reset_index().drop(columns='index')

🧪 **Run the following cell to save your results**

In [7]:
from nbresult import ChallengeResult

ChallengeResult(
    "data_cleaning",
    shape=data.shape).write()

## Simple model

📝 **Which sklearn's scoring [metric](https://scikit-learn.org/stable/modules/model_evaluation.html) should we use if we want to:**
- **Strongly penalize** largest errors
- Measure errors **in the same unit** than `popularity` 
- Is better when greater (metric_good_model > metric_bad_model)

👉 Store its exact name as `string` in the variable `scoring` below

🚨 You must use this metric for the rest of the challenge

In [23]:
scoring = "neg_root_mean_squared_error"

**📝 Let's build a first simple linear model using only the numerical features in our dataset to start with**
- Build `X_simple` keeping only numerical features
- Build `y` your target containing the `popularity`

In [7]:
X_simple = data.drop(columns=['id', 'name', 'artist', 'release_date', 'popularity'])
y = data['popularity']

### Holdout evaluation

**📝 Create the 4 variables `X_train_simple` `y_train`, `X_test_simple`, `y_test` with a 50% split with random sampling**

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train_simple, X_test_simple, y_train, y_test = train_test_split(X_simple, y, test_size=0.5)

**📝 Fit and evaluate a basic linear model (do not fine tune it) with this holdout method**
- Store your model true performance in a float variable `score_simple_holdout`

In [10]:
from sklearn.linear_model import LinearRegression

In [11]:
model = LinearRegression()

In [12]:
model.fit(X_train_simple, y_train)

LinearRegression()

In [13]:
model.score(X_test_simple, y_test)

0.2905958969342868

In [14]:
score_simple_holdout = model.score(X_test_simple, y_test)

### Cross-validation evaluation

📝 **Let's be sure our score is representative**: 
- 5-times cross validate a basic linear model on the whole numeric dataset (`X_simple`, `y`)
- Do not fine tune your model
- Store your mean performance in a variable `score_simple_cv_mean` as a `float`
- Store the standard deviation of your performances in a float variable `score_simple_cv_std`

In [15]:
from sklearn.model_selection import cross_validate

In [16]:
model = LinearRegression()

In [17]:
model.fit(X_simple, y)

LinearRegression()

In [18]:
cv_results = cross_validate(model, X_simple, y, cv=5, scoring=['neg_root_mean_squared_error'])

In [19]:
cv_results

{'fit_time': array([0.02853131, 0.07288718, 0.04941654, 0.04913688, 0.04779673]),
 'score_time': array([0.01571178, 0.01795745, 0.01088333, 0.01157951, 0.01131749]),
 'test_neg_root_mean_squared_error': array([-18.41383767, -18.45203104, -18.30188799, -18.43790327,
        -18.19713279])}

In [20]:
score_simple_cv_mean = cv_results['test_neg_root_mean_squared_error'].mean()

In [21]:
score_simple_cv_std = cv_results['test_neg_root_mean_squared_error'].std()

🧪 **Run the following cell to save your results**

In [24]:
from nbresult import ChallengeResult

ChallengeResult(
    "simple_model",
    scoring=scoring,
    shape_train = X_train_simple.shape,
    score_simple_holdout=score_simple_holdout,
    score_simple_cv_mean=score_simple_cv_mean,
    score_simple_cv_std=score_simple_cv_std,
).write()

## Feature engineering

(From now on, we will stop using train/test split but cross-validation on the whole dataset instead)  

Let's try to improve performance using the feature `release_date`

**📝 Create `X_engineered` by adding a new column `year` to `X`, containing the release year of the track as `integer`**

In [25]:
X = data.drop(columns='popularity')

In [26]:
X_engineered = X.copy()

In [27]:
X_engineered.reset_index(inplace=True)

In [28]:
X_engineered.drop(columns='index', inplace=True)

In [29]:
X_engineered.release_date = pd.to_datetime(X_engineered.release_date, format="%Y/%m/%d")

In [30]:
X_engineered['release_date'][0].year

1971

In [32]:
X_engineered['release_year'] = X_engineered['release_date']

In [33]:
for i in range(len(X_engineered)):
    
    X_engineered['release_year'][i] = X_engineered['release_year'][i].year

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_engineered['release_year'][i] = X_engineered['release_year'][i].year


In [34]:
X_engineered['year'] = X_engineered['release_year']

In [35]:
X_engineered.drop(columns="release_year", inplace=True)

📝 **Let's see how this impact the performance of our model.**
- Retrain the same simple linear model on numerical values only, adding the new feature `year`
- Save the mean cross-validated performance metric in a variable named `score_engineered` as a `float`
- Do not fine tune the model yet

In [36]:
X_simple_2 = X_engineered.drop(columns=['id', 'name', 'artist', 'release_date'])

In [37]:
model = LinearRegression()

In [38]:
model.fit(X_simple_2, y)

LinearRegression()

In [39]:
cv_results = cross_validate(model, X_simple_2, y, cv=5, scoring=[scoring])

In [40]:
cv_results

{'fit_time': array([0.07592726, 0.07020521, 0.06992221, 0.1041441 , 0.06935501]),
 'score_time': array([0.0164175 , 0.01378155, 0.01358843, 0.01538897, 0.01327038]),
 'test_neg_root_mean_squared_error': array([-17.4412204 , -17.4317139 , -17.2453717 , -17.32187397,
        -17.06965388])}

In [41]:
score_engineered = cv_results['test_neg_root_mean_squared_error'].mean()

🧪 **Run the following cell to save your results**

In [42]:
from nbresult import ChallengeResult

ChallengeResult("feature_engineering",
    cols = X_engineered.columns,
    years = X_engineered.get("year"),
    score_engineered=score_engineered
).write()

## Pipelining

Let's now look for maximum performance by creating a solid preprocessing pipeline.

**📝 Create a sklearn preprocessing [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and store it as `preproc`**

- Feel free to add any preprocessing steps you think of
- You may want to integrate your feature engineering for `year`
- You may also further improve it using the `ArtistPopularityTransformer` class given to you below
- Don't add any model to it yet

🚨 Advice: It is better for you to have a working pipeline (even simple one) rather than NO pipeline at all

In [43]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion

In [71]:
X = data.copy()

In [45]:
def year_only(df, column):
    df[column] = pd.to_datetime(df[column], format="%Y/%m/%d")
    df['year'] = df[column]
    for i in range(len(df)):
        df['year'][i] = df[column][i].year
    df.drop(columns=column, inplace=True)
    return df  

In [72]:
X = year_only(X, 'release_date')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['year'][i] = df[column][i].year


In [73]:
X = X.drop(columns=['id', 'name', 'popularity', 'artist'])

In [355]:
# 👉 Do not hesitate to reload a clean new dataset if you need a fresh start.
#X = X_simple_2
#y = data[['popularity']] 

In [50]:
from sklearn.preprocessing import MinMaxScaler

In [59]:
# We are giving you below a custom transformer that you may want to use in your pipeline (make sure you understanding it)

from sklearn.base import BaseEstimator, TransformerMixin

class ArtistPopularityTransformer(BaseEstimator, TransformerMixin):
    """
    Compute, as a new feature of the test set, the mean popularity of 
    all songs made by the artist on the train set.
    """

    def __init__(self):
        pass

    def fit(self, X, y=None):
        """
        process artist mean popularity from artists songs popularity
        process song global mean popularity
        """

        # process artist popularity
        self.artist_popularity = y.groupby(X.artist).agg("mean")
        self.artist_popularity.name = "artist_popularity"

        # process mean popularity
        self.mean_popularity = y.mean()

        return self

    def transform(self, X, y=None):
        """
        apply artist mean popularity vs song global mean popularity to songs
        """

        # inject artist popularity
        X_copy = X.merge(self.artist_popularity, how="left", left_on="artist", right_index=True)

        # fills popularity of unknown artists with song global mean popularity
        X_copy.replace(np.nan, self.mean_popularity, inplace=True)

        return X_copy[["artist_popularity"]]

In [329]:
type(X['year'][0])

int

In [365]:
X.head(1)

Unnamed: 0,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence,year
0,0.654,0.499,219827,0.19,0,0.00409,7,0.0898,-16.435,1,0.0454,149.46,0.43,1971


In [75]:
preproc = ColumnTransformer([
     ('num_encoder', MinMaxScaler(), ['acousticness', 'danceability', 'duration_ms', 'energy',\
                                     'explicit', 'instrumentalness', 'key', 'liveness', 'loudness',\
                                     'mode', 'speechiness', 'tempo', 'valence', 'year'])
])


In [64]:
#feature_engineering = ColumnTransformer([
#    ('artist_popularity_transformer', ArtistPopularityTransformer(), [X['artist'], y])
#])

In [65]:
#preproc = Pipeline([
#    ('preproc1', preproc1),
#    ('engineering', feature_engineering)
#])

**📝 Store the number of columns/feature after preprocessing your inputs in a variable `col_number`**

In [52]:
X_preproc = pd.DataFrame(preproc.fit_transform(X))

In [53]:
col_number = X_preproc.shape[1]

🧪 **Run the following cells to save your results**

In [55]:
# Visually print your preproc
from sklearn import set_config; set_config(display='diagram')
preproc

In [56]:
# Save your preproc
from nbresult import ChallengeResult

ChallengeResult(
    "preprocessing",
    col_number=col_number,
    first_observation = preproc.fit_transform(X, y)[0]
).write()

## Training

📝 **Time to fine tune your models**

- Add an **estimator** to your pipeline (only from Scikit-learn) 

- Train your pipeline and **fine-tune** (optimize) your estimator to maximize prediction score

- You must try to fine tune at least 2 different models: 
    - create one pipeline with a **linear model** of your choice
    - create one pipeline with an **ensemble model** of your choice

Then, 

- Save your two best 5-time cross-validated scores as _float_: `score_linear` and `score_ensemble`

- Save your two best trained pipelines as _Pipeline_ objects: `pipe_linear` and `pipe_ensemble`

### Linear

In [100]:
from sklearn.linear_model import Ridge, ElasticNet

In [138]:
linear_pipe = Pipeline([
    ('preprocessing', preproc),
    ('linear_regression', Ridge())
])

In [139]:
linear_pipe.fit(X,y)

In [82]:
cv_results = cross_validate(pipe_linear, X, y, cv=5, scoring=[scoring])

In [83]:
cv_results

{'fit_time': array([0.13554263, 0.1370616 , 0.14158034, 0.17317319, 0.18955541]),
 'score_time': array([0.02642107, 0.02631593, 0.0343008 , 0.04089046, 0.04140639]),
 'test_neg_root_mean_squared_error': array([-17.4411798 , -17.43167301, -17.24536567, -17.32181964,
        -17.06973384])}

In [95]:
model2 = Ridge(max_iter=1000)

In [90]:
grid = {'alpha': [0.01, 0.1, 1],
        'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}

In [91]:
from sklearn.model_selection import GridSearchCV


In [96]:
search = GridSearchCV(model2, grid, 
                      scoring = scoring,
                      cv=5)

In [97]:
search.fit(X,y)



In [98]:
search.best_params_

{'alpha': 1, 'solver': 'auto'}

In [99]:
search.best_score_

-17.30196630645738

In [87]:
model2.get_params()

{'alpha': 1.0,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': None,
 'normalize': False,
 'random_state': None,
 'solver': 'auto',
 'tol': 0.001}

In [101]:
model3 = ElasticNet()

In [103]:
from scipy import stats

In [109]:
grid = {'l1_ratio': [0, 0.25, 0.5, 0.75, 1],
       'alpha': [0.001, 0.01, 0.1, 1]}

In [111]:
from sklearn.model_selection import RandomizedSearchCV

In [114]:
search2 = RandomizedSearchCV(model3, grid, 
                      scoring = scoring,
                      cv=5, n_iter=1000)

In [115]:
search2.fit(X,y)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [117]:
search2.best_params_

{'l1_ratio': 1, 'alpha': 0.001}

In [149]:
linear_pipe.fit(X,y)

In [150]:
cv_results = cross_validate(linear_pipe, X, y, cv=5, scoring=[scoring])

In [151]:
pipe_linear = cv_results[f'test_{scoring}'].mean()

In [153]:
model = Ridge()

In [154]:
model.fit(X,y)

In [155]:
cv_results = cross_validate(model, X, y, cv=5, scoring=[scoring])

In [145]:
score_linear = cv_results[f'test_{scoring}'].mean()

In [152]:
pipe_linear

-17.30195439188277

In [156]:
score_linear

-17.30196630645738

### Ensemble

In [158]:
from sklearn.ensemble import RandomForestRegressor

In [159]:
forest = RandomForestRegressor(n_estimators=100)

In [160]:
cv_results = cross_validate(forest, X, y, scoring = scoring, cv=10)

In [164]:
score_ensemble = cv_results['test_score'].mean()

In [165]:
ensemble_pipe = Pipeline([
    ('preprocessing', preproc),
    ('linear_regression', RandomForestRegressor(n_estimators=100))
])

In [166]:
cv_results = cross_validate(ensemble_pipe, X, y, scoring = scoring, cv=5)

In [168]:
pipe_ensemble = cv_results['test_score'].mean()

🧪 **Run the following cells to save your results**

In [169]:
# Print below your best pipe for correction purpose
from sklearn import set_config; set_config(display='diagram')
pipe_linear

-17.30195439188277

In [170]:
# Print below your best pipe for correction purpose
pipe_ensemble

-12.832238448162144

In [171]:
from nbresult import ChallengeResult

ChallengeResult("model_tuning",
    scoring = scoring,
    score_linear=score_linear,
    score_ensemble=score_ensemble).write()

## API 

Time to put a pipeline in production!

👉 Go to https://github.com/lewagon/data-certification-api and follow instructions

**This final part is independent from the above notebook**