# TPOT Explained with Movies

In this example we will walk through some a simplified machine learning problem to explore the TPOT tool. 
For more information on TPOT check out their [documentation](https://epistasislab.github.io/tpot/). 
For more information about this example please refer to my [medium article]().
### What is TPOT?
TPOT is an AutoML tool. This means that it is designed to automate the creation of a machine learning pipeline. TPOT is built on top of [scikit-learn](https://scikit-learn.org/stable/) a popular python library for machine learning. Therefor, the syntax it uses and pipelines it generates should be easy to understand for those with a basic understanding of scikit-learn.

Rather than manually running multiple experiments involving scikit-learn grid searches TPOT will automatically generate many pipelines and compare them. This can reduce the manual time needed to design and evaluate these experiments. 

### Our Scenario
In this simple scenario we are building a recommendation system for movies based on a streaming service. I used a Kafka service that streamed data about movie files watched by users and movie ratings they submitted. The original data had ~1 million users and ~27 thousand movies. I streamed this data, parsed it, and saved it in a database. 

In this scenario we are going to use data regarding the user and movie to predict what they user would rate the movie on a scale of 1-5. This can be used in a recommendation service to sort the highest predicted ratings and recommend a movie.

## Let's Get Started!
First we are going to start by loading the data.

## Loading the Data
The csv files stored in the repo are a snapshot of our database tables at one point in time.

In [1]:
import pandas as pd

movies_raw = pd.read_csv('../movie-data.csv')
users_raw = pd.read_csv('../users-data.csv')
ratings_raw = pd.read_csv('../ratings-data.csv')

users_raw.set_index('user_id', inplace=True)
movies_raw.set_index('movie_id', inplace=True)

### Movie Data
Below we can see the movie data returned. One thing to note is this data comes from external sources. The data is based on TMDB or IMDB. Therefore, the vote_average is not the ratings from the stream. We can see that a number of columns are not going to be useful and some will need to be transformed even prior to using TPOT.

In [2]:
movies_raw.head()

Unnamed: 0_level_0,budget,genres,imdb_id,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,title,tmdb_id,vote_average,vote_count
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
if+you+could+only+cook+1935,0.0,"['Comedy', 'Romance']",tt0026519,en,If You Could Only Cook,An auto engineer (Herbert Marshall) and a prof...,0.430813,['Columbia Pictures'],['United States of America'],1935-12-30,0.0,72,['English'],If You Could Only Cook,126083,6.8,4
the+tattooed+widow+1998,0.0,[],tt0120299,sv,Den tatuerade änkan,Ester is the perfect grandmother everyone expe...,0.012333,[],[],1998-03-10,0.0,0,[],The Tattooed Widow,80059,0.0,0
pride+and+prejudice+1980,0.0,['Drama'],tt0078672,en,Pride and Prejudice,Mrs. Bennet is determined to find husbands for...,0.073979,"['Australian Broadcasting Corporation', 'Briti...","['United Kingdom', 'Australia']",1980-01-13,0.0,265,['English'],Pride and Prejudice,77172,6.8,14
flodder+1986,0.0,['Comedy'],tt0091060,fr,Flodder,A low-class a-social family ends up in a rich ...,6.70284,['First Floor Features'],['Netherlands'],1986-12-17,0.0,111,['Nederlands'],Flodder,10570,6.6,32
cold+turkey+1971,0.0,['Comedy'],tt0066927,en,Cold Turkey,Reverend Brooks leads the town in a contest to...,0.60908,"['Tandem Productions', 'DFI']",['United States of America'],1971-02-19,11000000.0,99,['English'],Cold Turkey,42493,7.5,4


### User Data

In [3]:
users_raw.head()

Unnamed: 0_level_0,age,gender,occupation
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
800720,42,M,scientist
577694,31,M,college/grad student
207937,28,F,college/grad student
584780,33,M,executive/managerial
149240,34,F,sales/marketing


### Rating Data
This table contains the labels for our data. It tells us the rating a user gave a specific movie.

In [4]:
ratings_raw.head()

Unnamed: 0,movie_id,rating,user_id
0,jackass+the+movie+2002,4,615847
1,mission+impossible+iii+2006,4,284854
2,good+will+hunting+1997,5,792329
3,happy+gilmore+1996,3,709941
4,moulin+rouge+2001,3,531076


## Data Cleaning
Before we pass the data into TPOT we should do some basic cleaning of the data. TPOT can handle 

In [60]:
import inspect
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MultiLabelBinarizer

class MultiLabelStringToArray(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        df = X.copy()
        for column_name in df.columns:
            transformed_column = df[column_name].copy()
            transformed_column.loc[transformed_column.isnull()] = transformed_column.loc[
                transformed_column.isnull()
            ].apply(lambda x: '[]')
            transformed_column = transformed_column.apply(self._parse_arraystr)
            df[column_name] = transformed_column
        return df
    def _parse_arraystr(self, str):
        str_without_brackets = str.replace("[","").replace("]","")
        str_without_quotes = str_without_brackets.replace("'","")
        str_without_spaces = str_without_quotes.replace(" ","")
        list_with_empties = str_without_spaces.split(',')
        if '' in list_with_empties:
            while("" in list_with_empties) : 
                list_with_empties.remove("") 
        return np.array(list_with_empties)
    
class MultiLabelBinarizerTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.mlbs = {}
    def fit(self, X, y=None):
        df = X.copy()
        for column_name in df.columns:
            mlb = MultiLabelBinarizer()
            mlb.fit(df[column_name])
            #print('Column: {NAME} Values: {VALUES}'.format(NAME=column_name, VALUES=mlb.classes_))
            self.mlbs[column_name] = mlb
        return self
    def transform(self, X, y=None):
        df = X.copy()
        binarized_cols = pd.DataFrame()
        for column_name in df.columns:
            mlb = self.mlbs.get(column_name)
            new_cols = pd.DataFrame(mlb.transform(df[column_name]),columns=mlb.classes_)
            binarized_cols = pd.concat([binarized_cols, new_cols], axis=1)
        return binarized_cols
    
class ExtractReleaseDateFeatures(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        df = X.copy()
        df = df.fillna('0000-00-00')
        df['year'] = df.iloc[:,0].apply(lambda x: str(x)[:4])
        df['month'] = df.iloc[:,0].apply(lambda x: str(x)[5:7])
        df = df.astype({'year':'int64', 'month':'int64'})
        return df.loc[:,['year','month']]
    
multilabel_binarizer_columns = ['genres','production_countries', 'spoken_languages', 'gender', 'occupation']
release_date_columns = ['release_date']
passthrough_cols = ['age','budget','popularity','revenue', 'runtime', 'vote_average', 'vote_count']

multilabel_binarizer_pipeline = Pipeline([
    ('multilabel_str_to_array',MultiLabelStringToArray()),
    ('binarizer', MultiLabelBinarizerTransformer()),
],verbose=True)

release_date_pipeline = Pipeline([
    ('extract_date_features',ExtractReleaseDateFeatures()),
],verbose=True)

full_data_clean_pipeline = ColumnTransformer([
    ('multilabel_binarizer', multilabel_binarizer_pipeline, multilabel_binarizer_columns),
    ('release_date', release_date_pipeline, release_date_columns),
    ('remove_unnecessary_cols','passthrough', passthrough_cols)
],remainder='drop',verbose=True)

In [62]:
records = ratings_raw.join(users_raw, on='user_id', how='left')
records = records.join(movies_raw, on='movie_id', how='left')

records = records.sort_values(by=['user_id'])
records = records.dropna(subset=['budget','popularity','revenue','runtime','vote_average','vote_count'])

In [66]:
from tpot import TPOTRegressor

X_train = full_data_clean_pipeline.fit_transform(records)
y_train = records['rating']

pipeline_optimizer = TPOTRegressor(generations=100, population_size=100, verbosity=3, random_state=42,
                                   template='Selector-Transformer-Regressor', n_jobs=8,
                                   warm_start=True, periodic_checkpoint_folder='../tpot-intermediate-save-3/')
pipeline_optimizer.fit(X_train[:10000],y_train.head(10000))

[Pipeline]  (step 1 of 2) Processing multilabel_str_to_array, total=   9.7s
[Pipeline] ......... (step 2 of 2) Processing binarizer, total=  19.5s
[ColumnTransformer]  (1 of 3) Processing multilabel_binarizer, total=  30.5s
(740732, 1)
[Pipeline]  (step 1 of 1) Processing extract_date_features, total=   0.7s
[ColumnTransformer] .. (2 of 3) Processing release_date, total=   0.7s
[ColumnTransformer]  (3 of 3) Processing remove_unnecessary_cols, total=   0.0s
30 operators have been imported by TPOT.


HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=10100.0, style=ProgressStyle(…

Skipped pipeline #17 due to time out. Continuing to the next pipeline.
Skipped pipeline #19 due to time out. Continuing to the next pipeline.
Skipped pipeline #74 due to time out. Continuing to the next pipeline.
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 2 is required by FeatureAgglomeration..
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Foun

TPOTRegressor(log_file=<ipykernel.iostream.OutStream object at 0x106091f70>,
              n_jobs=8,
              periodic_checkpoint_folder='../tpot-intermediate-save-3/',
              random_state=42, template='Selector-Transformer-Regressor',
              verbosity=3, warm_start=True)

In [65]:
pipeline_optimizer.score(X_train[10001:20000], y_train[10001:20000])

-1.004738489339057

In [None]:

pipeline_optimizer_large = TPOTRegressor(generations=10, population_size=50, verbosity=3, random_state=42,
                                      template='Selector-Transformer-Regressor',config_dict="TPOT light",# n_jobs=4,
                                     warm_start=True, periodic_checkpoint_folder='../tpot-intermediate-save-2/')
pipeline_optimizer_large.fit(X_train.head(500000),y_train.head(500000))

In [None]:
pipeline_optimizer_large.score(X_train[500001:700000], y_train[500001:700000])