# **AIDI-1010**

# **Contents:**
**Module: TPOT**

1.   Installing The Module (TPOT)
2.   Example-A (Classification)
3.   Example-B (Regression)
4.   Takeaways & Homework
5.   Offline Examples





# **1 - Installing TPOT (Google Colab)**

In [None]:
#1.1 - Install Linux Dependencies & Module (1.5mins)
!sudo apt-get install build-essential swig
!pip install TPOT
!pip install dask==2021.6.2 dask-glm==0.2.0 dask-ml==1.0.0
!pip install distributed==2021.6.2
!pip install cloudpickle==1.5.0
!pip install dask distributed --upgrade
!pip install tornado==5.1.0
!pip install xgboost==1.1.0
!pip install pipelineprofiler

**Note: Restart Runtime & Re-Run below**

In [None]:
#1.2 - Re-Check Linux Dependencies & Module
!sudo apt-get install build-essential swig
!pip install TPOT
!pip install dask==2021.6.2 dask-glm==0.2.0 dask-ml==1.0.0
!pip install distributed==2021.6.2
!pip install cloudpickle==1.5.0
!pip install dask distributed --upgrade
!pip install tornado==5.1.0
!pip install xgboost==1.1.0
!pip install pipelineprofiler

# **2- ExampleA (Classification)**

In [None]:
#Scenario: Sonar dataset, binary classification. Predict whether sonar returns indicate a rock or simulated mine.

#2.1 - Load Modules
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from tpot import TPOTClassifier


In [None]:
#2.2 - Load Dataset
dfA = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv')

print("Data types: \n",dfA.dtypes,"\n")

In [None]:
print(dfA.info(),"\n")

In [None]:
print("Head records: \n",dfA.head(),"\n")
print("Shape/Data in dataframe: \n",dfA)

In [None]:
#2.3.1 - Split data into input/output elements
datasetA = dfA.values
X2, y2 = datasetA[:,:-1], datasetA[:,-1]

#2.3.2 - Minimally prepare the dataset
X2 = X2.astype('float32')
y2 = LabelEncoder().fit_transform(y2.astype('str'))

In [None]:
#2.4 - Define evaluation procedure
cvA = RepeatedStratifiedKFold(n_splits=10, n_repeats=3,random_state=1)

In [None]:
#2.5 - Define Model
modelTPOTA = TPOTClassifier(generations=5,population_size=50,cv=cvA,scoring='accuracy',verbosity=2,random_state=1,n_jobs=1,max_time_mins=2)

#2.6 - Fit data into Model
modelTPOTA.fit(X2, y2)

#2.7 - Export best model
modelTPOTA.export('TPOT_A_sonar_best_model.py')

#2.8 - Show the best pipeline
modelTPOTA.fitted_pipeline_

**Takeaway1:** Can you reach a higher accuracy? How?

**Takeaway2:** What's the accuracy of newer data (if you created any)? What other useful parameters were you able to find?

**Takeaway3:** Make a prediction on a new row of data

# **3- ExampleB (Regression)**

In [None]:
#Scenario: Auto-Insurance data - predicting the claims again, like last in autosklearn

#3.1 - Load Modules
import pandas as pd
from sklearn.model_selection import RepeatedKFold
from tpot import TPOTRegressor

In [None]:
#3.2 - Load Dataset
dfB = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv')

print("Data types: \n",dfB.dtypes,"\n")
print(dfB.info(),"\n")
print("Head Records: \n",dfB.head(),"\n")
print("Shape/Data in dataframe: \n",dfB)

In [None]:
#3.3 - Split data into input/output elements
datasetB = dfB.values
datasetB = datasetB.astype('float32')
X3, y3 = datasetB[:,:-1], datasetB[:,-1]

In [None]:
#3.4 - Define evaluation procedure
cvB = RepeatedKFold(n_splits=10,n_repeats=3,random_state=1)

In [None]:
#3.5 - Define Model
modelTPOTB = TPOTRegressor(generations=5,population_size=50,scoring='neg_mean_absolute_error',cv=cvB, verbosity=2, random_state=1,n_jobs=1,max_time_mins=5)

#3.6 - Fit data into Model
modelTPOTB.fit(X3,y3)

#4.7 - Export best model
modelTPOTB.export('TPOT_B_insurance_best_model.py')

#4.8 - Show the best pipeline
modelTPOTB.fitted_pipeline_

**Takeaway1:** Can you reach a higher accuracy? How?

**Takeaway2:** What's the accuracy of newer data (if you created any)? What other useful parameters were you able to find?

**Takeaway3:** Make a prediction on a new row of data

# **4- Takeaways & Homework**

https://epistasislab.github.io/tpot/examples/ 


https://towardsdatascience.com/tpot-automated-machine-learning-in-python-4c063b3e5de9

# **5A - (Classification)**

In [None]:
#Credit: https://towardsdatascience.com/tpot-automated-machine-learning-in-python-e56800e69c11

#5A.1 - Loading Modules to prepare Dataset
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
#5A.2 - Import Data
data5A = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
data5A.head()

In [None]:
#5A.3 - Data preparation; Use 6 techniques below..

#5A.3.1 - Drop irrelevant columns
data5A.drop(['Ticket', 'PassengerId'], axis=1, inplace=True)

#5A.3.2 - Remap/Convert Sex column to integer representation: 0s/1s
gender_mapper = {'male':0, 'female':1} #males zero and females ones
data5A['Sex'].replace(gender_mapper,inplace=True) #use replace function

#5A.3.3 - Check if passenger had a unique title (like doctor) or generic (Mr, Mrs) extract it from Name column
data5A['Title'] = data5A['Name'].apply(lambda x: x.split(',')[1].strip().split(' ')[0])
data5A['Title'] = [0 if x in ['Mr.', 'Miss', 'Mrs.'] else 1 for x in data5A['Title']] #set title to zero if title is common
data5A = data5A.rename(columns={'Title': 'Title_Unusual'}) #rename column 
data5A.drop('Name',axis=1,inplace=True) #drop previously declared "Name" column  

#5A.3.4 - Check if cabin information was known - if value is Cabin column is NOT NAN
data5A['Cabin_Known'] = [0 if str(x) == 'nan' else 1 for x in data5A['Cabin']]
data5A.drop('Cabin',axis=1,inplace=True)

#5A.3.5 - Create dummy variables from Embarked column - 
emb_dummies = pd.get_dummies(data5A['Embarked'],drop_first=True,prefix='Embarked') #this will result in another df object which needs to be concatenated to original column
data5A = pd.concat([data5A,emb_dummies],axis=1) #concatenation
data5A.drop('Embarked',axis=1,inplace=True)

#5A.3.6 - Fill Age values with simple mean of the column
data5A['Age'] = data5A['Age'].fillna(int(data5A['Age'].mean()))

data5A.head()

Columns "AGE" and "FARE" have decimals - need to use standard scaler.

In [None]:
#5A.3.7 - Split the Data
X = data5A.drop('Survived',axis=1)
y = data5A['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8) #80% for training and rest for testing

In [None]:
#5A.3.8 - Standard Scaler to fit train data and transform train/test data

ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

In [None]:
#5A.4 - Loading Modules to use TPOT
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from tpot import TPOTClassifier #TPOT's Classifier
import pandas as pd
import PipelineProfiler

In [None]:
#5A.5 - Training Process/Fit the data into model
tpot5A = TPOTClassifier(verbosity=2,max_time_mins=10) #10minutes is usually the norm
tpot5A.fit(X_train_scaled, y_train)

**Question: What do the statistics show?**
- How many runs? how many successful runs? 

In [None]:
#5A.6 - Show the best pipeline
tpot5A.fitted_pipeline_

In [None]:
#5A.7 - Show the accuracy on test set
tpot5A.score(X_test_scaled, y_test)

**Takeaway1:** Can you reach a higher percentage? How? 

**Takeaway2:** What's the accuracy of newer data (if you created any)? What other useful parameters were you able to find?

# **5B - (Classification)**

Scenario:
Predict prevalence of diabetes within 5 years of using Pima Indians Diabetest Dataset. 

Maximum accuracy achieved has been 77.47% thus far.

In [None]:
#5B.1 - Load Modules
import tpot
from tpot import TPOTClassifier
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
import os

In [None]:
#5B.2 - Load Data
df5B = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')
print("Data Types: \n",df5B.dtypes,"\n")
print(df5b.info(),"\n")
print("Head Records: \n",df5B.head(),"\n")
print("Shape/Data in dataframe: \n",df5B)


In [None]:
#5B.3 - Splitting Data into input/output features
data5B = df5B.values
X5B, y5B = data5B[:, :-1], data5B[:, -1]
print(X5B.shape, y5B.shape)

X5B = X5B.astype('float32')
y5B = LabelEncoder().fit_transform(y5B.astype('str'))

In [None]:
#5B.4.1 - Model evaluation definition using StratifiedKfold = 10 folds
cv5B = StratifiedKFold(n_splits=10)

#5B.4.2 - Define TPOTClassifier
modelTPOT5B = TPOTClassifier(generations=5,population_size=50,cv=cv5B,scoring='accuracy',verbosity=2,random_state=1,n_jobs=1,max_time_mins=10)

#5B.4.3 - Fit the data into model
modelTPOT5B.fit(X5B,y5B)

#5B.4.4 - Export best model
modelTPOT5B.export('TPOT_5B_data.py')

In [None]:
#5B.4.5 - Show the best pipeline
modelTPOT5B.fitted_pipeline_

**Takeaway1:** Can you reach a higher accuracy? How?

**Takeaway2:** What's the accuracy of newer data (if you created any)? What other useful parameters were you able to find?

**Takeaway3:** Try experiment with StratifiedKfold = 5 folds

**Takeaway4:** Make a prediction on a new row of data