# [Computational Social Science] 
## 4-2 TPOT

## Virtual Environment
Remember to always activate your virtual environment first before you install packages or run a notebook! This helps to prevent conflicts between dependencies across different projects and ensures that you are using the correct versions of packages. You must have created anaconda virtual enviornment in the `Anaconda Installation` lab. If you have not or want to create a new virtual environment, follow the instruction in the `Anaconda Installation` lab. 

<br>

If you have already created a virtual enviornment, you can run the following command to activate it: 

<br>

`conda activate <virtual_env_name>`

<br>

For example, if your virtual environment was named as CSS, run the following command. 

<br>

`conda activate CSS`

<br>

To deactivate your virtual environment after you are done working with the lab, run the following command. 

<br>

`conda deactivate`

<br>

Without an extensive background in the statistics and mathematics behind different machine learning models, it can be difficult to determine what the best model for a given dataset is. This also applies to tuning the parameters. As you have probably noticed, the models we've used in this class so far have many different parameters, and it's by no means obvious how to tune them. 

Moreover, testing out many different models, along with many different combinations of parameters, could be extremely time consuming and impractical. 

[TPOT](http://epistasislab.github.io/tpot/) is a tool that automates the model selection and hyperparameter tuning process using [genetic programming](https://en.wikipedia.org/wiki/Genetic_programming). Genetic Programming is a strategy for moving from a population of poorly fit models to a population of well-fit models. The intuition behind genetic programming is that it leverages the theory of [natural selection](https://en.wikipedia.org/wiki/Natural_selection) to more quickly find the optimal model fit. A helpful metaphor for explaining this could be the following: 

Imagine you’re trying to build the best paper airplane ever. You make a bunch of paper airplanes (these are like "programs" or "models" in our case). Then you test them to see which one flies the farthest (this is called "fitness"). The best ones are saved, and you use them to create new airplanes by mixing their designs or making small changes (this is like "mutation" and "crossover" in genetics). You keep repeating this process—-making, testing, and improving planes—-until you have an airplane that flies super far. This is kind of how genetic programming works, except instead of paper airplanes, it’s creating computer programs to solve problems.

TPOT also determines what preprocessing, if any, is necessary, such as PCA or standard scaling. It then exports this model to a file with the scikit-learn code written for you. Although it is in your best interest to learn as much about the theory behind machine learning as possible, tools like TPOT can theoretically do the work for you. 

TPOT can be used for both classification and regression. First let's install tpot:

In [None]:
# uncomment to install
#!pip install tpot

In [2]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from tpot import TPOTRegressor
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split



## Classification

First, let's see how TPOT works with classification. Let's load our census data one last time:

In [3]:
#
# process census data
# --------------------------------


# set random seed 
# ----------
np.random.seed(10)

# Create a list of column names, found in "adult.names"
# ----------
col_names = ['age', 
             'workclass', 
             'fnlwgt',
             'education', 
             'education-num',
             'marital-status', 
             'occupation', 
             'relationship', 
             'race', 
             'sex', 
             'capital-gain',
             'capital-loss', 
             'hours-per-week',
             'native-country', 
             'income-bracket']

# Read table from the data folder
# ----------
census = pd.read_table("../../data/adult.data", sep = ',', names = col_names)

# process target
# ----------
lb_style = LabelBinarizer()
y = census['income-bracket-binary'] = lb_style.fit_transform(census["income-bracket"])

# process features
# ----------
X = census.drop(['income-bracket', 'income-bracket-binary'], axis = 1)
X = pd.get_dummies(X)

In [4]:
# split data 
# ----------
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    train_size=0.75, 
                                                    test_size=0.25)

TPOT has a few key hyperparameters that we need to set.
- **Generations**: The number of iterations that TPOT will go through to search for the best algorithm
- **Population_Size**: The number of possible solutions that TPOT will evaluate

By default, TPOT uses 100 generations and 100 population size. Note the nood to genetics with the parameter names (*generations* and *population_size*). The number of configurations it searches through is defined by generations * population_size, so by default it will search 10,000 different models. The more models you let it search through, the better your ultimate prediction will be. Here we initialize the model with just 2 generations and 2 population:

In [5]:
#
# run TPOT for classification
# --------------------------------

# specify TPOT
# ----------
tpot = TPOTClassifier(generations=2,      # set the number of iterations 
                      population_size=2,  # set number of models
                      random_state = 1)   # set random seed

# fit to training data
# ----------
tpot.fit(X_train, 
         y_train.ravel())

# print results
# ----------
print(tpot.score(X_test, 
                 y_test.ravel()))
# export 
# ----------
tpot.export('tpot_census_pipeline.py')

0.85837120746837


After we fit the model, we can export it, and then check the code that generated the best pipeline:

In [6]:
# Mac users:
# ----------
#!cat tpot_census_pipeline.py

# Windows  users:
# ----------
#!type tpot_census_pipeline.py

## Regression

We can also use TPOT for regression! Let's return to our bike dataset:

In [7]:
#
# process bike data
# --------------------------------

# load bike data
# ----------
bike = pd.read_csv('../../data/day.csv')

# reformat the date column to integers representing the day of the year, 001-366
# ----------
bike['dteday'] = pd.to_datetime(np.array(bike['dteday'])).strftime('%j')

# get rid of the index column
# ----------
bike = bike.drop('instant', axis=1)

In [8]:
# the features used to predict riders
# ----------
X_bike = bike.drop(['casual', 'registered', 'cnt'], axis=1)

# the number of riders
# ----------
y_bike = bike['cnt']

# split data
# ----------
X_bike_train, X_bike_test, y_bike_train, y_bike_test = train_test_split(X_bike, 
                                                                        y_bike,
                                                                        train_size=0.75, 
                                                                        test_size=0.25)

In [9]:
bike.head()

Unnamed: 0,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


Now let's search through some regression models. Again we will use just 4 configurations:

In [10]:
#
# run TPOT for regression
# --------------------------------

# specify TPOT
# ----------
tpot = TPOTRegressor(generations=2,        # set the number of iterations
                     population_size=2,    # set number of models
                     scoring='r2',         # set scoring to r2
                     random_state = 2)     # set random seed



# fit to training data
# ----------
tpot.fit(X_bike_train, 
         y_bike_train.ravel())

# print results
# ----------
print(tpot.score(X_bike_test, 
                 y_bike_test.ravel()))

# export
# ----------
tpot.export('tpot_bike_pipeline.py')

  y_bike_train.ravel())


0.8255022864905334


  y_bike_test.ravel()))


In [11]:
# Mac users: 
# ----------
#!cat tpot_bike_pipeline.py

# Windows users:
# ----------
#!type tpot_bike_pipeline.py

## Challenge

Using either the census or bike dataset, try playing with the TPOT hyperparameters. Note that the more you increase generations and population, the longer it will take the code to run. In fact, the TPOT documentation suggests letting the pipeline run for several hours or even days if you can. 

In [12]:
#
# run TPOT 
# --------------------------------

# specify TPOT
# ----------
tpot = TPOTClassifier(generations=5,             # play with the number of iterations
                      population_size=5,         # play with the number of models
                      scoring = 'f1',            # set scoring to f1
                      random_state = 3)          # set random seed

# fit to training data
# ----------
tpot.fit(X_train, 
         y_train.ravel())

# print
# ----------
print(tpot.score(X_test, 
                 y_test.ravel()))

# export
# ----------
tpot.export('tpot_census_pipeline_new_params.py')

0.7069160997732427
