https://towardsdatascience.com/tpot-automated-machine-learning-in-python-e56800e69c11

*Genetic Programming (GP) is a type of Evolutionary Algorithm (EA), a subset of machine learning. EAs are used to discover solutions to problems humans do not know how to solve, directly. Free of human preconceptions or biases, the adaptive nature of EAs can generate solutions that are comparable to, and often better than the best human efforts.
Inspired by biological evolution and its fundamental mechanisms, GP software systems implement an algorithm that uses random mutation, crossover, a fitness function, and multiple generations of evolution to resolve a user-defined task. GP can be used to discover a functional relationship between features in data (symbolic regression), to group data into categories (classification), and to assist in the design of electrical circuits, antennae, and quantum algorithms. GP is applied to software engineering through code synthesis, genetic improvement, automatic bug-fixing, and in developing game-playing strategies, … and more*

In [1]:
import pandas as pd

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

In [27]:
data = pd.read_csv( 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv' )
                   
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [28]:
data.drop( [ 'Ticket' , 'PassengerId' ] , axis = 1 , inplace = True )

In [29]:
gender_mapper = { 'male' : 0 , 'female' : 1 }

data[ 'Sex' ].replace( gender_mapper , inplace = True )

In [30]:
data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.05,,S


In [31]:
data[ 'Title' ] = data[ 'Name' ].apply( lambda x : x.split( ',' )[ 1 ].strip().split( ' ' )[ 0 ] )

data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
0,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.25,,S,Mr.
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833,C85,C,Mrs.
2,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.925,,S,Miss.
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1,C123,S,Mrs.
4,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.05,,S,Mr.


In [32]:
data[ 'Title' ] = [ 0 if x in [ 'Mr.' , 'Miss.' , 'Mrs.' ] else 1 for x in data[ 'Title' ] ]

data = data.rename( columns = { 'Title' : 'Title_Unusual' } )

data.drop( 'Name' , axis = 1 , inplace = True ) 

data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title_Unusual
0,0,3,0,22.0,1,0,7.25,,S,0
1,1,1,1,38.0,1,0,71.2833,C85,C,0
2,1,3,1,26.0,0,0,7.925,,S,0
3,1,1,1,35.0,1,0,53.1,C123,S,0
4,0,3,0,35.0,0,0,8.05,,S,0


In [33]:
data[ 'Cabin_Known' ] = [ 0 if str( x ) == 'nan' else 1 for x in data[ 'Cabin' ] ]

data.drop( 'Cabin' , axis = 1 , inplace = True )

data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title_Unusual,Cabin_Known
0,0,3,0,22.0,1,0,7.25,S,0,0
1,1,1,1,38.0,1,0,71.2833,C,0,1
2,1,3,1,26.0,0,0,7.925,S,0,0
3,1,1,1,35.0,1,0,53.1,S,0,1
4,0,3,0,35.0,0,0,8.05,S,0,0


In [34]:
emb_dummies = pd.get_dummies( data[ 'Embarked' ] , drop_first = True , prefix = 'Embarked' )

data = pd.concat( [ data , emb_dummies ] , axis = 1 )

data.drop( 'Embarked' , axis = 1 , inplace = True )

data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Title_Unusual,Cabin_Known,Embarked_Q,Embarked_S
0,0,3,0,22.0,1,0,7.25,0,0,0,1
1,1,1,1,38.0,1,0,71.2833,0,1,0,0
2,1,3,1,26.0,0,0,7.925,0,0,0,1
3,1,1,1,35.0,1,0,53.1,0,1,0,1
4,0,3,0,35.0,0,0,8.05,0,0,0,1


In [35]:
data[ 'Age' ] = data[ 'Age' ].fillna( int( data[ 'Age' ].mean() ) )

data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Title_Unusual,Cabin_Known,Embarked_Q,Embarked_S
0,0,3,0,22.0,1,0,7.25,0,0,0,1
1,1,1,1,38.0,1,0,71.2833,0,1,0,0
2,1,3,1,26.0,0,0,7.925,0,0,0,1
3,1,1,1,35.0,1,0,53.1,0,1,0,1
4,0,3,0,35.0,0,0,8.05,0,0,0,1


In [36]:
X = data.drop( 'Survived' , axis = 1 )

y = data[ 'Survived' ]

In [37]:
X_train , X_test , y_train , y_test = train_test_split( X , y , train_size = 0.8 )

In [41]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform( X_train )

X_test_scaled = scaler.transform( X_test )

In [42]:
from tpot import TPOTClassifier

In [43]:
tpot = TPOTClassifier( verbosity = 2 , max_time_mins = 10 )

tpot.fit( X_train_scaled , y_train )



HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', style=ProgressStyle(description_w…

Generation 1 - Current best internal CV score: 0.831370038412292
Generation 2 - Current best internal CV score: 0.831370038412292
Generation 3 - Current best internal CV score: 0.8342164877376146
Generation 4 - Current best internal CV score: 0.8356150891362158
Generation 5 - Current best internal CV score: 0.8384418398502905
Generation 6 - Current best internal CV score: 0.8384418398502905

10.001521966666667 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: LinearSVC(RandomForestClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.7000000000000001, min_samples_leaf=15, min_samples_split=20, n_estimators=100), C=0.5, dual=False, loss=squared_hinge, penalty=l1, tol=0.0001)


TPOTClassifier(generations=1000000, max_time_mins=10, verbosity=2)

In [46]:
# best pipeline

tpot.fitted_pipeline_

Pipeline(steps=[('stackingestimator',
                 StackingEstimator(estimator=RandomForestClassifier(max_features=0.7000000000000001,
                                                                    min_samples_leaf=15,
                                                                    min_samples_split=20))),
                ('linearsvc', LinearSVC(C=0.5, dual=False, penalty='l1'))])

In [47]:
tpot.score( X_test_scaled , y_test )

0.8100558659217877