Scikit-learn comes with a number of useful built-in data transformation functions that allow you to [impute missing values](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute), [scale numerical data](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), etc. Eventually, however, you will want to manipulate your data in a way that is not supported by the built-in offerings. Fortunately, it's not too hard to construct your own transformer that can easily be integrated with the greater sklearn workflow.

Scikit-learn is an incredibly powerful data processing and machine learning library straight out of the box. 
The majority of tasks most people will want to do are supported by their built in methods, which is great. 
However, there will inevitably come a day when you want to do something that is not possible straight out of the box.
Fortunately, sklearn is also incredibly [well designed](https://arxiv.org/abs/1309.0238) and makes creating your own methods and integrating them into your workflow super straightforward.

## How the sausage gets made

- Include a brief intro to sklearn's API design

For this exercise, I'm going to be using data about my Spotify Discover Weekly playlists. 
If you don't know about [Discover Weekly](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe) (who are you?!), you'll probably want to acquaint yourself. 
I've been collecting data on my weekly playlists since fall of 2019 with vague plans to turn it into some sort of project. 
The [spotipy](https://spotipy.readthedocs.io/en/latest/) library makes it easy to access all of the amazing data that Spotify makes available through its API.


In [6]:
import pandas as pd
import pathlib

project_dir = pathlib.Path().cwd().parent
song_features = pd.read_pickle(project_dir / 'data/raw/dw_combined.pkl')

cols = ['song_length_ms', 'tempo', 'tempo_confidence',
        'instrumentalness', 'liveness', 'loudness', 'speechiness', 
        'valence', 'acousticness', 'danceability', 
        'energy', 'popularity']

song_features = song_features[cols]

In [7]:
song_features.head()

Unnamed: 0,song_length_ms,tempo,tempo_confidence,instrumentalness,liveness,loudness,speechiness,valence,acousticness,danceability,energy,popularity
0,217131,151.858,0.094,0.0205,0.323,-13.417,0.0555,0.501,0.976,0.553,0.281,35
1,255800,125.58,0.361,0.0309,0.142,-12.015,0.0306,0.218,0.938,0.421,0.374,38
2,188000,127.38,0.231,0.0158,0.258,-14.418,0.05,0.296,0.791,0.512,0.205,28
3,448349,146.544,0.046,0.121,0.946,-11.329,0.0637,0.268,0.902,0.286,0.506,36
4,291080,80.059,0.126,0.0062,0.0878,-12.572,0.0316,0.334,0.849,0.491,0.262,37


Generally, when we want to make a custom transformer, we should have some reason to believe that a specific combination or transformation of one or more variables will be a good predictor for the target variable. We don't have a super obvious such relationship here, so I'm going to make some hypotheses without any evidence in their favor. Don't do this in the wild. 

My first hypothesis is that faster soongs will be more popular. But I want to adjust for uncertainty in song speed. To do this I can use `tempo_confidence` to weight `tempo`, essentially "slowing down" tracks that have lower degree of tempo certainty.

The second hypothesis is a bit more far-fetched, so I am only going to include it as an optional argument in my transformer. I'm going to say more danceable songs will be more popular, so long as they are not too long (people get tired, right?). So if we divide `song_length_ms` by `danceability`, the resulting feature (`fatigue_factor`) should have an inverse relationship with popularity. We can alway see if this brazen assumption holds after we make the transformation.

In [11]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
tempo_ix, tempo_conf_ix, dance_ix, length_ix = [
    list(song_features.columns).index(col)
    for col in ('tempo', 'tempo_confidence', 'danceability', 'song_length_ms')]

class CustomFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_fatigue_factor = True): # no *args or **kwargs
        self.add_fatigue_factor = add_fatigue_factor
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        weighted_tempo = X[:, tempo_ix] * X[:, tempo_conf_ix]
        if self.add_fatigue_factor:
            fatigue_factor = X[:, length_ix] / X[:, dance_ix]
            return np.c_[X, weighted_tempo, fatigue_factor]
        else:
            return np.c_[X, weighted_tempo]

features_adder = CustomFeaturesAdder(add_fatigue_factor=True)
music_plus = features_adder.transform(song_features.values)

In [13]:
pd.DataFrame(music_plus).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,217131.0,151.858,0.094,0.0205,0.323,-13.417,0.0555,0.501,0.976,0.553,0.281,35.0,14.274652,392642.0
1,255800.0,125.58,0.361,0.0309,0.142,-12.015,0.0306,0.218,0.938,0.421,0.374,38.0,45.33438,607601.0
2,188000.0,127.38,0.231,0.0158,0.258,-14.418,0.05,0.296,0.791,0.512,0.205,28.0,29.42478,367187.5
3,448349.0,146.544,0.046,0.121,0.946,-11.329,0.0637,0.268,0.902,0.286,0.506,36.0,6.741024,1567654.0
4,291080.0,80.059,0.126,0.0062,0.0878,-12.572,0.0316,0.334,0.849,0.491,0.262,37.0,10.087434,592831.0
