Scikit-learn comes with a number of useful built-in data transformation functions that allow you to [impute missing values](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute), [scale numerical data](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing), etc. Eventually, however, you will want to manipulate your data in a way that is not supported by the built-in offerings. Fortunately, it's not too hard to construct your own transformer that can easily be integrated with the greater sklearn workflow.

Scikit-learn is an incredibly powerful data processing and machine learning library straight out of the box. 
The majority of tasks most people will want to do are supported by their built in methods, which is great. 
However, there will inevitably come a day when you want to do something that is not possible straight out of the box.
Fortunately, sklearn is also incredibly [well designed](https://arxiv.org/abs/1309.0238) and makes creating your own methods and integrating them into your workflow super straightforward.

## How the sausage gets made

- Include a brief intro to sklearn's API design

For this exercise, I'm going to be using data about my Spotify Discover Weekly playlists. 
If you don't know about [Discover Weekly](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe) (who are you?!), you'll probably want to acquaint yourself. 
I've been collecting data on my weekly playlists since fall of 2019 with vague plans to turn it into some sort of project. 
The [spotipy](https://spotipy.readthedocs.io/en/latest/) library makes it easy to access all of the amazing data that Spotify makes available through its API.


In [2]:
import pandas as pd
import pathlib

project_dir = pathlib.Path().cwd().parent
df = pd.read_pickle(project_dir / 'data/raw/dw_combined.pkl')

cols = ['song_length_ms', 'key', 'mode', 'time_signature', 
        'instrumentalness', 'liveness', 'loudness', 'speechiness', 
        'valence', 'acousticness', 'tempo', 'danceability', 
        'energy', 'popularity']

df = df[cols]

The above code loads the Discover Weekly data and selects the columns that we will use to build our custom transformer.

Let's say we think that it

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)