# TPOT Explained with Movies

In this example, we will walk through a simplified machine learning problem to explore the TPOT tool. 
For more information on TPOT check out their [documentation](https://epistasislab.github.io/tpot/). 
For more information about this example please refer to my [medium article]().
### What is TPOT?
TPOT is an AutoML tool. This means that it is designed to automate the creation of a machine learning pipeline. TPOT is built on top of [scikit-learn](https://scikit-learn.org/stable/) a popular python library for machine learning. Therefore, the syntax it uses and pipelines it generates should be easy to understand for those with a basic understanding of scikit-learn.

Rather than manually running multiple experiments involving scikit-learn grid searches TPOT will automatically generate many pipelines and compare them. This can reduce the manual time needed to design and evaluate these experiments. 

### Our Scenario
In this simple scenario, we are building a recommendation system for movies based on a streaming service. I used a Kafka service that streamed data about movie files watched by users and movie ratings they submitted. The original data had ~1 million users and ~27 thousand movies. I streamed this data, parsed it, and saved it in a database. 

In this scenario, we are going to use data regarding the user and movie to predict how the user would rate the movie on a scale of 1-5. This can be used in a recommendation service to sort the highest predicted ratings and recommend a movie.

## Let's Get Started!
First, we are going to start by loading the data.

## Loading the Data
The CSV files stored in the repo are a snapshot of our database tables at one point in time.

In [1]:
import pandas as pd

movies_raw = pd.read_csv('../movie-data.csv')
users_raw = pd.read_csv('../users-data.csv')
ratings_raw = pd.read_csv('../ratings-data.csv')

users_raw.set_index('user_id', inplace=True)
movies_raw.set_index('movie_id', inplace=True)

### Movie Data
Below we can see the movie data returned. One thing to note is this data comes from external sources. The data is based on TMDB or IMDB. Therefore, the vote_average is not the ratings from the stream. We can see that several columns are not going to be useful and some will need to be transformed even before using TPOT.

In [2]:
movies_raw.head()

Unnamed: 0_level_0,budget,genres,imdb_id,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,title,tmdb_id,vote_average,vote_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
if+you+could+only+cook+1935,0.0,"['Comedy', 'Romance']",tt0026519,en,If You Could Only Cook,An auto engineer (Herbert Marshall) and a prof...,0.430813,['Columbia Pictures'],['United States of America'],1935-12-30,0.0,72,['English'],If You Could Only Cook,126083,6.8,4
the+tattooed+widow+1998,0.0,[],tt0120299,sv,Den tatuerade änkan,Ester is the perfect grandmother everyone expe...,0.012333,[],[],1998-03-10,0.0,0,[],The Tattooed Widow,80059,0.0,0
pride+and+prejudice+1980,0.0,['Drama'],tt0078672,en,Pride and Prejudice,Mrs. Bennet is determined to find husbands for...,0.073979,"['Australian Broadcasting Corporation', 'Briti...","['United Kingdom', 'Australia']",1980-01-13,0.0,265,['English'],Pride and Prejudice,77172,6.8,14
flodder+1986,0.0,['Comedy'],tt0091060,fr,Flodder,A low-class a-social family ends up in a rich ...,6.70284,['First Floor Features'],['Netherlands'],1986-12-17,0.0,111,['Nederlands'],Flodder,10570,6.6,32
cold+turkey+1971,0.0,['Comedy'],tt0066927,en,Cold Turkey,Reverend Brooks leads the town in a contest to...,0.60908,"['Tandem Productions', 'DFI']",['United States of America'],1971-02-19,11000000.0,99,['English'],Cold Turkey,42493,7.5,4


### User Data

In [3]:
users_raw.head()

Unnamed: 0_level_0,age,gender,occupation
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
800720,42,M,scientist
577694,31,M,college/grad student
207937,28,F,college/grad student
584780,33,M,executive/managerial
149240,34,F,sales/marketing


### Rating Data
This table contains the labels for our data. It tells us the rating a user gave a specific movie.

In [4]:
ratings_raw.head()

Unnamed: 0,movie_id,rating,user_id
0,jackass+the+movie+2002,4,615847
1,mission+impossible+iii+2006,4,284854
2,good+will+hunting+1997,5,792329
3,happy+gilmore+1996,3,709941
4,moulin+rouge+2001,3,531076


## Data Cleaning
Before we pass the data into TPOT we should do some basic cleaning of the data. Currently, TPOT works with numerical data although there is some work being done to add some auto [data cleaning](https://github.com/rhiever/datacleaner/issues/1). Therefore, we need to transform some of our data into a format that TPOT will understand. The best way to do this is with scikit-learn [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) and [column transformers](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html). This makes the transformations a repeatable process, which is important because we are going to need to apply the same transformations when a making a prediction in our hypothetical production system.



### Categorical Features
Some of our features are categorical, such as `genres`. I decided to turn the categorical features into binary features. For example, instead of `genres`, I would have `Action` with a value of `1` if the movie was an action movie and `0` if it was not an action movie.

To do this, I created a pipeline to apply scikit-learn's [MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html?highlight=multilabel%20binarizer#sklearn.preprocessing.MultiLabelBinarizer). First I needed to turn the cells of the columns into arrays instead of strings that resembled arrays. 

In [5]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MultiLabelBinarizer

class MultiLabelStringToArray(BaseEstimator, TransformerMixin):
    """
    This shapes the data to be passed into the MultiLabelBinarizer. It takes
    columns that are array-like strings and turns them into arrays.
    """
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        df = X.copy()
        for column_name in df.columns:
            df[column_name] = self._transform_column_to_array(df[column_name])
        return df

    def _transform_column_to_array(self, pd_column):
        transformed_column = pd_column.copy()
        
        # replace null cells with empty array
        transformed_column.loc[transformed_column.isnull()] = transformed_column.loc[
            transformed_column.isnull()
        ].apply(lambda x: '[]')

        # parse string into array
        transformed_column = transformed_column.apply(self._parse_arraystr)
        return transformed_column

    def _parse_arraystr(self, str):
        """
        Applies a number of rules to turn an array looking string into an array
          - remove brackets
          - remove quotes
          - remove extra spaces
          - deliminate by comma
          - remove empty string entries in the array
        """
        str_without_brackets = str.replace("[","").replace("]","")
        str_without_quotes = str_without_brackets.replace("'","")
        str_without_spaces = str_without_quotes.replace(" ","")
        list_with_empties = str_without_spaces.split(',')
        if '' in list_with_empties:
            while("" in list_with_empties) : 
                list_with_empties.remove("") 
        return np.array(list_with_empties)
    
class MultiLabelBinarizerTransformer(BaseEstimator, TransformerMixin):
    """
    This tranformer creates a MultiLabelBinarizer for every column passed in.
    """
    def __init__(self):
        self.mlbs = {}
    def fit(self, X, y=None):
        """Fit the MultiLabelBinarizer to the data passed in"""
        df = X.copy()
        for column_name in df.columns:
            mlb = MultiLabelBinarizer()
            mlb.fit(df[column_name])
            # Uncomment the following line if you want to print out the values
            # that the MultiLabelbinarizer discovered.
            #print('Column: {NAME} Values: {VALUES}'.format(NAME=column_name, VALUES=mlb.classes_))
            self.mlbs[column_name] = mlb
        return self
    def transform(self, X, y=None):
        """
        Returns a dataframe with the binarized columns. When applied in a
        ColumnTransformer this will effectively remove the original column and 
        replace it with the binary columns
        """
        df = X.copy()
        binarized_cols = pd.DataFrame()
        for column_name in df.columns:
            mlb = self.mlbs.get(column_name)
            new_cols = pd.DataFrame(mlb.transform(df[column_name]),columns=mlb.classes_)
            binarized_cols = pd.concat([binarized_cols, new_cols], axis=1)
        return binarized_cols

### Date Features
We have a `release_date` feature; however, it is currently stored as a string so we want to extract meaningful data from the string. Since I do not believe that the day has much impact on how a user would rate something I am going to leave it out. I think the year could have some impact because users could be more excited by new movies or our streaming service might only contain very popular old movies. Also, I think the month could be helpful. It could discover that users are more likely to rate a movie highly if it is released during "Oscar season".

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin

class ExtractReleaseDateFeatures(BaseEstimator, TransformerMixin):
    """
    This transformer takes a column with a date string formatted as 
    'YYYY-mm-dd', extracts the year and month, and returns a DataFrame with
    those columns.
    """
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        """
        Returns a dataframe with the year and month as integer fields. When 
        applied in a ColumnTransformer this will effectively remove the
        original column and replace it with the new columns.
        """
        df = X.copy()

        # fill nulls values that wont show up in valid data
        df = df.fillna('0000-00-00') 

        df['year'] = df.iloc[:,0].apply(lambda x: str(x)[:4])
        df['month'] = df.iloc[:,0].apply(lambda x: str(x)[5:7])
        df = df.astype({'year':'int64', 'month':'int64'})

        return df.loc[:,['year','month']]

### Column Transformation
Now let's combine all those transformations to create our pipeline. First, we create a pipeline to sequentially execute the steps for our categorical columns. Next, we define a `ColumnTransformer` which will apply the categorical transformations, date transformations, and will pass our other feature columns through into the final data. All other columns not specified here will be dropped.

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Pipeline to create binary columns
multilabel_binarizer_pipeline = Pipeline([
    ('multilabel_str_to_array',MultiLabelStringToArray()),
    ('binarizer', MultiLabelBinarizerTransformer()),
],verbose=True)

MULTILABEL_BINARIZER_COLUMNS = ['genres','production_countries', 'spoken_languages', 'gender', 'occupation']
RELEASE_DATE_COLUMNS = ['release_date']
PASSTHROUGH_COLUMNS = ['age','budget','popularity','revenue', 'runtime', 'vote_average', 'vote_count']

full_data_clean_pipeline = ColumnTransformer([
    ('multilabel_binarizer', multilabel_binarizer_pipeline, MULTILABEL_BINARIZER_COLUMNS),
    ('release_date', ExtractReleaseDateFeatures(), RELEASE_DATE_COLUMNS),
    ('passthrough_columns','passthrough', PASSTHROUGH_COLUMNS)
],remainder='drop',verbose=True)

## Training
With our pipeline setup, we are ready to try it out on some data. First, I combine all our raw data loaded from GitHub into a single DataFrame. Next, we sort the data based on `userid` because we want to train and test our model on different users, to see if what the model learned about one user's preferences apply to other users. Finally, I've decided to drop rows that contain nulls in columns that we are not applying transformations to. I chose to do this because there were only ~650 rows to which this applied. If there were more rows with null values I might consider a different approach because it could mean losing too much data. Another possibility is that there could be hidden meaning in the null values, such as null values being a proxy for old movies where that data could be harder to get. Either way, 650 rows is not even 1/100th of our data set so I'm not going to lose sleep over it.

In [8]:
records = ratings_raw.join(users_raw, on='user_id', how='left')
records = records.join(movies_raw, on='movie_id', how='left')

records = records.sort_values(by=['user_id'])
records = records.dropna(subset=['budget','popularity','revenue','runtime','vote_average','vote_count'])
records.head()

Unnamed: 0,movie_id,rating,user_id,age,gender,occupation,budget,genres,imdb_id,original_language,...,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,title,tmdb_id,vote_average,vote_count
18981,legends+of+the+fall+1994,5,8,30,M,college/grad student,30000000.0,"['Adventure', 'Drama', 'Romance', 'War']",tt0110322,en,...,"['Bedford Falls Productions', 'TriStar Picture...",['United States of America'],1994-12-16,160639000.0,133.0,"['', 'English']",Legends of the Fall,4476.0,7.2,636.0
18869,miracle+on+34th+street+1994,4,8,30,M,college/grad student,0.0,"['Fantasy', 'Drama', 'Family']",tt0110527,en,...,['Twentieth Century Fox Film Corporation'],['United States of America'],1994-11-18,46264400.0,114.0,['English'],Miracle on 34th Street,10510.0,6.4,199.0
20104,four+weddings+and+a+funeral+1994,4,8,30,M,college/grad student,6000000.0,"['Comedy', 'Drama', 'Romance']",tt0109831,en,...,"['Channel Four Films', 'PolyGram Filmed Entert...",['United Kingdom'],1994-03-09,254701000.0,117.0,['English'],Four Weddings and a Funeral,712.0,6.6,654.0
20117,jurassic+park+1993,4,8,30,M,college/grad student,63000000.0,"['Adventure', 'Science Fiction']",tt0107290,en,...,"['Universal Pictures', 'Amblin Entertainment']",['United States of America'],1993-06-11,920100000.0,127.0,"['English', 'Español']",Jurassic Park,329.0,7.6,4956.0
17547,braveheart+1995,5,8,30,M,college/grad student,72000000.0,"['Action', 'Drama', 'History', 'War']",tt0112573,en,...,"['Icon Entertainment International', 'The Ladd...",['United States of America'],1995-05-24,210000000.0,177.0,"['English', 'Français', 'Latin', '']",Braveheart,197.0,7.7,3404.0


Before we start playing around with TPOT we need to grab some train and test data. In this scenario, I'm going to put all my faith in TPOT to come up with the best model so I don't need to create a verification dataset. Let's start with a small amount of data just to see TPOT in action. First, we will fit and transform our dataset with the data cleaning pipeline we built then I'm going to select 10,000 records for both training and testing. We have a lot more data, but the more data there is the longer TPOT takes so, let's just start with 10k.

In [12]:
X_all = full_data_clean_pipeline.fit_transform(records)
y_all = records['rating']

X_train = X_all[:10000]
y_train = y_all[:10000]
X_test = X_all[10000:20000]
y_test = y_all[10000:20000]

[Pipeline]  (step 1 of 2) Processing multilabel_str_to_array, total=   9.9s
[Pipeline] ......... (step 2 of 2) Processing binarizer, total=  20.5s
[ColumnTransformer]  (1 of 3) Processing multilabel_binarizer, total=  31.8s
[ColumnTransformer] .. (2 of 3) Processing release_date, total=   0.7s
[ColumnTransformer]  (3 of 3) Processing passthrough_columns, total=   0.0s


Now the time you've all been waiting for... afternoon tea (or whatever time of day you happen to be reading this). TPOT supports both regression and classification problems. I decided that this would be better as a regression problem because too many movies would tie for the top spot otherwise. 
Let's review some of the configuration options I chose:

* `generations` - This is the number of iterations of pipeline generation that 
TPOT will run for. Alternatively, you could specify a `max_time_minutes` to stop TPOT after a certain amount of time.

* `population_size` - This is the number of pipelines trained during each generation.

* `verbosity` - This just gives us some feedback to let us know that TPOT is boiling away. It can take a long time I find this reassuring to make sure nothing is frozen. 

* `random_state` - This ensures that if we run this a second time we start with the same seed.

* `template` - This describes how I want my pipeline to look. Since I have done little feature engineering I want to start with a Selector to find the best features, then transform those features and finally use a regressor. If I were to not specify a template TPOT would pick whatever combination worked best. In my trials, the shape of the pipeline would end up
`Regressor-Regresssor-Regressor`.

* `n_jobs` - The number of parallel processes to use for evaluation

* `warm_start` - This tells TPOT whether to reuse populations from the last call to fit. This is good if you want to stop and restart the fit process.

* `periodic_checkpoint_folder` - Where to intermittently save pipelines during the training. This can help make sure you get an output even if TPOT suddenly dies or you decide to stop the training early.

For a full list of TPOT's configurations checkout their [documentation](https://epistasislab.github.io/tpot/api/).

The configuration below will train 10,100 pipelines and compare them using 5-fold (another config options, but I just used the default) cross-validation and a negative mean squared error scoring function. It may not generate 10,100 unique pipelines; so, it will skip over any repeat pipelines that are generated. This example generates about 2,500 unique pipelines. 

**Warning:** This training takes about 6 hours to run. If you want to shorten the example you can change the `generations` and `population_size` to 10 and it will only generate 110 pipelines. This shorter process should take around 10 minutes to train.  

In [13]:
from tpot import TPOTRegressor

pipeline_optimizer = TPOTRegressor(generations=10, population_size=10, verbosity=2, random_state=42,
                                   template='Selector-Transformer-Regressor', n_jobs=-1,
                                   warm_start=True, periodic_checkpoint_folder='../tpot-intermediate-save/')
pipeline_optimizer.fit(X_train, y_train)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=110.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: -0.9547903226888407
Generation 2 - Current best internal CV score: -0.9547903226888407
Generation 3 - Current best internal CV score: -0.952147120834435
Generation 4 - Current best internal CV score: -0.94918041475317
Generation 5 - Current best internal CV score: -0.94918041475317
Generation 6 - Current best internal CV score: -0.94918041475317
Generation 7 - Current best internal CV score: -0.9491529361806773
Generation 8 - Current best internal CV score: -0.9491441900056168
Generation 9 - Current best internal CV score: -0.9490030567432297
Generation 10 - Current best internal CV score: -0.9490030567432297
Best pipeline: LassoLarsCV(MaxAbsScaler(VarianceThreshold(input_matrix, threshold=0.0005)), normalize=False)


TPOTRegressor(generations=10,
              log_file=<ipykernel.iostream.OutStream object at 0x103138f70>,
              n_jobs=-1,
              periodic_checkpoint_folder='../tpot-intermediate-save/',
              population_size=10, random_state=42,
              template='Selector-Transformer-Regressor', verbosity=2,
              warm_start=True)

## Evaluation
Just like scikit-learn TPOT comes with a built-in evaluation mechanism. We can use the test data to evaluate our pipeline with the same scoring function that we used in training (we used the default which is negative mean squared error). We can see that our test data gives similar results as the cross validation scores seen during training. It looks like our model is off by almost a whole number in its predictions. This is likely in adequite for our scenario however, we would need to examine what sort of errors the model is making. For exmaple, if the model just estimates one point too low everytime then the model is perfect because we would recommend the correct movie; However, if the direction of error is variable it would cause some unfavorable movies to be recommended (at least personally the difference between a 3 and a 4 on a 5 point scale is enormous). 

In [14]:
pipeline_optimizer.score(X_test, y_test)

-0.9941578822208572

## More Data is Better (maybe)
Since we have so much data and machine learning models are almost always better when trained on more data, let's use everything we've got. I'm going to split the data into 500k rows for training and the remaining ~240k for testings. 

In [15]:
X_train_large = X_all[:500000]
y_train_large = y_all[:500000]
X_test_large = X_all[500000:]
y_test_large = y_all[500000:]

More is not always better. TPOT is already a timely process because there are so many pipelines generated and evaluated using k-fold cross-validation. The larger the dataset that the models are trained on, the longer this process is going to take. If you noticed I added one parameter to the configuration. `config_dict="TPOT light"` tells TPOT that I am using a large data set so it will limit the model search to only model features that are simpler and fast running. Therefore, it finds a pipeline that works well for large datasets.

**Warning**: This training takes somewhere between 12 and 20 hours to complete. Google Colab's runtime may timeout before you are complete. You may need to clone or fork my repo and use the [Jupyter Lab notebook](https://github.com/bialesdaniel/se4ai-i5-tpot/blob/master/notebooks/TPOT-Movies-Explained.ipynb) there to run this.

In [16]:
pipeline_optimizer_large = TPOTRegressor(generations=10, population_size=10, verbosity=2, random_state=42,
                                      template='Selector-Transformer-Regressor', config_dict="TPOT light", n_jobs=-1,
                                     warm_start=True, periodic_checkpoint_folder='../tpot-intermediate-save-large/')
pipeline_optimizer_large.fit(X_train_large, y_train_large)

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=110.0, style=ProgressStyle(de…


Generation 1 - Current best internal CV score: -0.9833348958387484
Generation 2 - Current best internal CV score: -0.9833348958387484
Generation 3 - Current best internal CV score: -0.9833348958387484
Generation 4 - Current best internal CV score: -0.9833348958387484
Generation 5 - Current best internal CV score: -0.9833348958387484
Generation 6 - Current best internal CV score: -0.9833348958387484
Generation 7 - Current best internal CV score: -0.9833348958387484
Generation 8 - Current best internal CV score: -0.9833348958387484
Generation 9 - Current best internal CV score: -0.9833348958387484
Generation 10 - Current best internal CV score: -0.9833323195498851
Best pipeline: ElasticNetCV(StandardScaler(VarianceThreshold(input_matrix, threshold=0.0001)), l1_ratio=0.8, tol=0.01)


TPOTRegressor(config_dict='TPOT light', generations=10,
              log_file=<ipykernel.iostream.OutStream object at 0x103138f70>,
              n_jobs=-1,
              periodic_checkpoint_folder='../tpot-intermediate-save-large/',
              population_size=10, random_state=42,
              template='Selector-Transformer-Regressor', verbosity=2,
              warm_start=True)

In [17]:
pipeline_optimizer_large.score(X_test_large, y_test_large)

-0.9893467038216204

## Conclusions
As we have seen, TPOT is quite easy to use.  The autoML process can save some valuable time and effort in feature engineering and hyperparameter tuning. On the other hand, TPOT is slow. It can take a long time to generate the optimal pipeline. 

Sorry, these conclusions are rather shallow because this notebook is mostly focused on how to set up and use TPOT for our movie recommendation system. For more in-depth analysis of the tool please continue reading the [Medium article]() from which you came.