#Genetic Algorithm

---

![genetic algorithm](https://i.imgur.com/HTg4SSJ.png)


---

>Genetic algorithm is a type of optimization algorithm that are used to find the maximum and minimum of functions of a computational problem. The working represent the concept of evolution.

>The name make it seems very hard, But after going through the blog you will find how easy it is to understand and implement.

>The Term Naive is used is because, Like in nature where there are many factors which effects the natural selection may be habitat,characteristics of the population etc.We try to replicate those here using randomness created by us but more of in a controlled way.

>These algorithms are far powerful and efficient than random search and other exhaustive algorithms yet doesn't require much information about the problem.

---

**The flow of the Algorithm:**

    - Initial population
    - A fitness function for optimization.
    - Selection of which chromosomes to reproduce.
    - Crossover of chromosomes to produce next generation.
    - Mutation of chromosomes in generation. 
    - Again goes back to the fitness function for optimization. 

---
Terms used: 



>***Chromosome:*** Its a numerical representation of parameters in the problem. The values inside  the chromosomes can be binary, symbol or numerical values.
These are the parameter values.

>If a problem has $N_{par}$ dimensions then,

>Chromosome=  [ $p_1,p_2,p_3,...p_{Npar}$ ]

>Where $p_i$ represents a particular value of $i^{th}$ parameter.

>So our problem has to be formed in the shape of these chromosomes.


>***Initial population:*** This is the initial set of chromosomes which is taken at random,

>***Fitness function:*** This is the most important part of the genetic algorithm. Here each chromosome is tested on the fitness function to see how well it solves the problem. We can consider the fitness function as the habitat to which the chromosomes evolve. It determines how the chromosomes evolves over time and can mean the difference between finding an optimal solution and finding no solution at all.

>Fitness function has to say more than what is good or bad chromosomes, It needs to accurately score the chromosomes based on range of fitness value, so that a complete solution can be distinguished from a more complete solution which will determine how the population will move.

___**Selection for reproduction:**___

How can we select chromosomes for reproduction?


One mechanism which is simple is by taking the probability of the fitness score of the chromosomes. 


$P_i =\frac {f_i} {\sum_i f_i}$

Same chromosome can be taken more than once.

**Another mechanism can be a Rank space.**



$P_1=P_c$

$P_2=(1-P_c)*P_c$

$P_3=(1-P_c)^2*P_c$

$P_{N-1}=(1-P_c)^{N-2}*P_c$

$P_N=(1-P_c)^{N-1}$


*where*


$P_c$  Probability constant value for choosing that chromosome.

$(1-Pc)$ Probability constant value for not choosing that chromosome.

$P_1$  Probability of first chromosome being chosen.

$P_2$  Probability of second chromosome being chosen.


___One more mechanism can be thought of:___

Where we not only consider the fitness score. We don't want just only the most fittest one in the population, we also want diversity in our population. 

Where we can choose chromosomes which fall under desired area of fitness and diversity level.

**Crossover:**
>In this simple example we see that how a part of the A chromosome is cross over to the part of the B chromosome.Just like mitosis.

---

![crossover](https://i.imgur.com/IL9wmva.png)

---

**Mutation:**

>It means we are flipping individual bits in the new chromosomes.

---

![mutation](https://i.imgur.com/DNBkU8E.png).

---


>Mutation plays a very important role after the process of selection and crossover the generation is superior compared to the initial population it generated from  like converging to a local optima. Mutation prevents being stuck in this local optimum helps in achieving the global optimum. Mutation helps prevent the algorithm reaching a sub-optimal solution by maintaining diversity.

>Now again this process is repeated where the new generation is again tested . It stops after it replaces the entire initial population.This one iteration is termed as a run.

>To get a better performance from genetic algorithm, there is a paper which tries to calculate value of mutation probability, which maximizes the probability that the genetic algorithm finds the optimum value of the objective function under simple assumptions. It can be accessed [here](https://www.sciencedirect.com/science/article/pii/089571779500035Z).


---
***How the performance is affected in genetic algorithm?***

>It can vary based on how the method used to encode candidates solution into chromosomes. The fitness solution, what it is actually measuring, the probability of mutation, the probability of cross over, the size of the population, the number of runs.

>These values can be adjusted after assessing the algorithm’s performance on a
few trial runs

>Now the most famous examples which is always taken when explaining is the traveling sales man problem, well we are not going to do that here. 
We are going to use it for much practical data set, We will use it in the implementation of the German credit data set.

---

>Well basic objective is to find out whether to give credit to a particular client or not based on certain parameters. These parameters and the data set are available [here](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)).

>This is actually a simple data set with 1000 observations.We have ,AGE,Sex,Job,Owns a house,Savings Account,Checking Account,credit amount,Duration,Purpose,Risk.Now with the 1000 observations its actually less but lets run our normal procedure and find out how well our machine learning model predicts the output.

---



#Implementation

---


In [0]:
#loading the file into google collaboratory
from google.colab import files
uploaded = files.upload()

Saving german_credit_data.csv to german_credit_data (1).csv


In [0]:
print (uploaded['german_credit_data.csv'][:200].decode('utf-8') + '...')

,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,NA,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,mal...


In [0]:
#checking the initial data
import pandas as pd
import io

df = pd.read_csv(io.StringIO(uploaded['german_credit_data.csv'].decode('utf-8')))
df

Unnamed: 0.1,Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,0,67,male,2,own,,little,1169,6,radio/TV,good
1,1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,2,49,male,1,own,little,,2096,12,education,good
3,3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,4,53,male,2,free,little,little,4870,24,car,bad
5,5,35,male,1,free,,,9055,36,education,good
6,6,53,male,2,own,quite rich,,2835,24,furniture/equipment,good
7,7,35,male,3,rent,little,moderate,6948,36,car,good
8,8,61,male,1,own,rich,,3059,12,radio/TV,good
9,9,28,male,3,own,little,moderate,5234,30,car,bad


In [0]:
#dropping the unnecessary column
df=df.drop(['Unnamed: 0'],axis=1)

In [0]:
#target variable set
y=df['Risk']
df=df.drop(['Risk'],axis=1)

In [0]:
y.head()

0    good
1     bad
2    good
3    good
4     bad
Name: Risk, dtype: object

In [0]:
#peaking into the data
df.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,67,male,2,own,,little,1169,6,radio/TV
1,22,female,2,own,little,moderate,5951,48,radio/TV
2,49,male,1,own,little,,2096,12,education
3,45,male,2,free,little,little,7882,42,furniture/equipment
4,53,male,2,free,little,little,4870,24,car


In [0]:
#bucketizing the age variable
for i in range(0,1000):
    if (df.iloc[i,0]<=20):
        df.iloc[i,0]=0
    elif (df.iloc[i,0]<=40 and df.iloc[i,0]>20):
        df.iloc[i,0]=1
    elif (df.iloc[i,0]<=60 and df.iloc[i,0]>40):
        df.iloc[i,0]=2
    else:
        df.iloc[i,0]=3

In [0]:
#making NAN an unknown category
import numpy as np 
df = df.replace(np.nan, 'unkwn')


In [0]:
df.head(40)


Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,3,male,2,own,unkwn,little,1169,6,radio/TV
1,1,female,2,own,little,moderate,5951,48,radio/TV
2,2,male,1,own,little,unkwn,2096,12,education
3,2,male,2,free,little,little,7882,42,furniture/equipment
4,2,male,2,free,little,little,4870,24,car
5,1,male,1,free,unkwn,unkwn,9055,36,education
6,2,male,2,own,quite rich,unkwn,2835,24,furniture/equipment
7,1,male,3,rent,little,moderate,6948,36,car
8,3,male,1,own,rich,unkwn,3059,12,radio/TV
9,1,male,3,own,little,moderate,5234,30,car


In [0]:
#Encoding our data for processing in our model

from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labenc=LabelEncoder()
df['Sex']=labenc.fit_transform(df['Sex'])
df['Housing']=labenc.fit_transform(df['Housing'])
df['Saving accounts']=df['Saving accounts'].astype(str)
df['Saving accounts']=labenc.fit_transform(df['Saving accounts'])
df['Checking account']=df['Checking account'].astype(str)
df['Checking account']=labenc.fit_transform(df['Checking account'])
df['Purpose']=labenc.fit_transform(df['Purpose'])
y=labenc.fit_transform(y)

xout=df.iloc[:,0:9].values
onehotencoder=OneHotEncoder(categorical_features=[2,3,5,8])
xout=onehotencoder.fit_transform(xout).toarray()



In [0]:
xout

array([[0.000e+00, 0.000e+00, 1.000e+00, ..., 4.000e+00, 1.169e+03,
        6.000e+00],
       [0.000e+00, 0.000e+00, 1.000e+00, ..., 0.000e+00, 5.951e+03,
        4.800e+01],
       [0.000e+00, 1.000e+00, 0.000e+00, ..., 0.000e+00, 2.096e+03,
        1.200e+01],
       ...,
       [0.000e+00, 0.000e+00, 1.000e+00, ..., 0.000e+00, 8.040e+02,
        1.200e+01],
       [0.000e+00, 0.000e+00, 1.000e+00, ..., 0.000e+00, 1.845e+03,
        4.500e+01],
       [0.000e+00, 0.000e+00, 1.000e+00, ..., 1.000e+00, 4.576e+03,
        4.500e+01]])

In [0]:
y[:18]

array([1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1])

In [0]:
#Our data is now in the correct format,checking again
df

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,3,1,2,1,4,0,1169,6,5
1,1,0,2,1,0,1,5951,48,5
2,2,1,1,1,0,3,2096,12,3
3,2,1,2,0,0,0,7882,42,4
4,2,1,2,0,0,0,4870,24,1
5,1,1,1,0,4,3,9055,36,3
6,2,1,2,1,2,3,2835,24,4
7,1,1,3,2,0,1,6948,36,1
8,3,1,1,1,3,3,3059,12,5
9,1,1,3,1,0,1,5234,30,1


In [0]:
x=xout

In [0]:
#splitting the data into training and validation set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [0]:
#using our first naive bayes model,
from sklearn.naive_bayes import GaussianNB
cl=GaussianNB()
cl.fit(x_train,y_train)
y_pred=cl.predict(x_test)

In [0]:
#Finding the metrices for accuracy and other measures of the model
from sklearn.metrics import accuracy_score,confusion_matrix,f1_score
from sklearn.model_selection import cross_val_score
cv_acc=cross_val_score(estimator=cl,X=x_train,y=y_train,cv=10)
cv_acc=cv_acc.mean()
print('Cross validation accuracy is {} percentage'.format(cv_acc*100))
print('--------------------------------------------------------------')
training_accs=cl.score(x_train,y_train)
print('Training accuracy is {} percentage'.format(training_accs*100))
print('--------------------------------------------------------------')
validation_accs=cl.score(x_test,y_test)
print('validation accuracy is {} percentage'.format(validation_accs*100))
print('--------------------------------------------------------------')
test_accs=accuracy_score(y_test,y_pred)
print('Testing accuracy is {} percentage'.format(test_accs*100))
print('--------------------------------------------------------------')
f1=f1_score(y_test, y_pred)
print('F1 score is <<{}>>'.format(f1))
print('--------------------------------------------------------------')
cm=confusion_matrix(y_test,y_pred)
print('confusion matrix is ->>',cm)
print('--------------------------------------------------------------')

Cross validation accuracy is 70.12466791686201 percentage
--------------------------------------------------------------
Training accuracy is 72.125 percentage
--------------------------------------------------------------
validation accuracy is 70.0 percentage
--------------------------------------------------------------
Testing accuracy is 70.0 percentage
--------------------------------------------------------------
F1 score is <<0.7916666666666666>>
--------------------------------------------------------------
confusion matrix is ->> [[ 26  32]
 [ 28 114]]
--------------------------------------------------------------


### inference

>Now we have normally used a classifier to solve our problem with an okay accuracy and the model is abel to predict.

>Now we shall bring in Genetic algorithm to improve our model here.

>The library which does this job is TPOT.

>What does it do,well its an automated machine learning tool that optimizes machine learning pipelines using genetic programming. 

**Here is the structure**

![tpot](https://i.imgur.com/eUCHPoL.png)

Yes it does all the job for us.
Tpot provides you with the code fro the best pipeline it found so you can modify it for further optimization.

**Now lets use this in our dataset and find out.**


In [0]:
!pip3 install tpot

Collecting tpot
[?25l  Downloading https://files.pythonhosted.org/packages/c4/e6/a41be0ddb23a411dc78b92f6a90b8129e65856a8248f8f11b2f14d8eeee3/TPOT-0.9.3.tar.gz (888kB)
[K    100% |████████████████████████████████| 890kB 7.0MB/s 
Collecting deap>=1.0 (from tpot)
[?25l  Downloading https://files.pythonhosted.org/packages/af/29/e7f2ecbe02997b16a768baed076f5fc4781d7057cd5d9adf7c94027845ba/deap-1.2.2.tar.gz (936kB)
[K    100% |████████████████████████████████| 942kB 10.6MB/s 
[?25hCollecting update_checker>=0.16 (from tpot)
  Downloading https://files.pythonhosted.org/packages/17/c9/ab11855af164d03be0ff4fddd4c46a5bd44799a9ecc1770e01a669c21168/update_checker-0.16-py2.py3-none-any.whl
Collecting tqdm>=4.11.2 (from tpot)
[?25l  Downloading https://files.pythonhosted.org/packages/93/24/6ab1df969db228aed36a648a8959d1027099ce45fad67532b9673d533318/tqdm-4.23.4-py2.py3-none-any.whl (42kB)
[K    100% |████████████████████████████████| 51kB 18.0MB/s 
[?25hCollecting stopit>=1.1.1 (from tpot)


 ---
 
 >BIG warning this process is going to be really slow, unless you have super fast computers, i tried using my CPU with i3 processor,4 GB ram , to produce 5 generations it took me 8 hours.

>So I used google collaboratory for processing this it took me around 45 minutes for completing 50 generations.
>Its finding best model,also best parameter and the dataset is being mutated, so yes it takes time.

---

In [0]:
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=50, population_size=50, verbosity=2)
tpot.fit(x_train, y_train)


Optimization Progress:   4%|▍         | 100/2550 [00:46<12:56,  3.16pipeline/s]

Generation 1 - Current best internal CV score: 0.7450309094105239


Optimization Progress:   6%|▌         | 150/2550 [01:15<14:43,  2.72pipeline/s]

Generation 2 - Current best internal CV score: 0.7501091351224657


Optimization Progress:   8%|▊         | 200/2550 [01:54<26:45,  1.46pipeline/s]

Generation 3 - Current best internal CV score: 0.7501091351224657


Optimization Progress:  10%|▉         | 250/2550 [02:37<25:07,  1.53pipeline/s]

Generation 4 - Current best internal CV score: 0.7501091351224658


Optimization Progress:  12%|█▏        | 300/2550 [03:48<30:41,  1.22pipeline/s]

Generation 5 - Current best internal CV score: 0.7575547872963788


Optimization Progress:  14%|█▎        | 350/2550 [04:40<29:20,  1.25pipeline/s]

Generation 6 - Current best internal CV score: 0.7575547872963788


Optimization Progress:  16%|█▌        | 400/2550 [06:06<1:01:50,  1.73s/pipeline]

Generation 7 - Current best internal CV score: 0.7575547872963788


Optimization Progress:  18%|█▊        | 450/2550 [07:05<41:37,  1.19s/pipeline]

Generation 8 - Current best internal CV score: 0.7575547872963788


Optimization Progress:  20%|█▉        | 500/2550 [08:15<30:02,  1.14pipeline/s]

Generation 9 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  22%|██▏       | 550/2550 [09:17<40:08,  1.20s/pipeline]

Generation 10 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  24%|██▎       | 600/2550 [10:06<36:14,  1.12s/pipeline]

Generation 11 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  25%|██▌       | 650/2550 [10:59<28:12,  1.12pipeline/s]

Generation 12 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  27%|██▋       | 700/2550 [11:50<31:25,  1.02s/pipeline]

Generation 13 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  29%|██▉       | 750/2550 [12:35<18:40,  1.61pipeline/s]

Generation 14 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  31%|███▏      | 800/2550 [13:17<21:38,  1.35pipeline/s]

Generation 15 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  33%|███▎      | 850/2550 [14:02<23:57,  1.18pipeline/s]

Generation 16 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  35%|███▌      | 900/2550 [14:45<38:34,  1.40s/pipeline]

Generation 17 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  37%|███▋      | 950/2550 [15:40<16:20,  1.63pipeline/s]

Generation 18 - Current best internal CV score: 0.7612814953709129


Optimization Progress:  39%|███▉      | 1000/2550 [16:11<15:01,  1.72pipeline/s]

Generation 19 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  41%|████      | 1050/2550 [16:43<14:01,  1.78pipeline/s]

Generation 20 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  43%|████▎     | 1100/2550 [17:16<10:23,  2.32pipeline/s]

Generation 21 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  45%|████▌     | 1150/2550 [17:49<19:22,  1.20pipeline/s]

Generation 22 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  47%|████▋     | 1200/2550 [18:19<14:06,  1.60pipeline/s]

Generation 23 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  49%|████▉     | 1250/2550 [18:50<12:00,  1.80pipeline/s]

Generation 24 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  51%|█████     | 1300/2550 [19:25<10:51,  1.92pipeline/s]

Generation 25 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  53%|█████▎    | 1350/2550 [19:56<08:55,  2.24pipeline/s]

Generation 26 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  55%|█████▍    | 1400/2550 [20:26<10:26,  1.84pipeline/s]

Generation 27 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  57%|█████▋    | 1450/2550 [20:56<12:59,  1.41pipeline/s]

Generation 28 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  59%|█████▉    | 1500/2550 [21:38<15:02,  1.16pipeline/s]

Generation 29 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  61%|██████    | 1550/2550 [22:11<18:35,  1.12s/pipeline]

Generation 30 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  63%|██████▎   | 1600/2550 [22:45<06:49,  2.32pipeline/s]

Generation 31 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  65%|██████▍   | 1650/2550 [23:26<07:07,  2.11pipeline/s]

Generation 32 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  67%|██████▋   | 1700/2550 [24:16<08:26,  1.68pipeline/s]

Generation 33 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  69%|██████▊   | 1750/2550 [25:02<09:20,  1.43pipeline/s]

Generation 34 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  71%|███████   | 1800/2550 [25:41<10:32,  1.19pipeline/s]

Generation 35 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  73%|███████▎  | 1850/2550 [26:13<06:29,  1.80pipeline/s]

Generation 36 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  75%|███████▍  | 1900/2550 [26:56<23:12,  2.14s/pipeline]

Generation 37 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  76%|███████▋  | 1950/2550 [27:28<07:43,  1.29pipeline/s]

Generation 38 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  78%|███████▊  | 2000/2550 [27:58<03:46,  2.43pipeline/s]

Generation 39 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  80%|████████  | 2050/2550 [28:32<04:01,  2.07pipeline/s]

Generation 40 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  82%|████████▏ | 2100/2550 [29:05<04:14,  1.77pipeline/s]

Generation 41 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  84%|████████▍ | 2150/2550 [29:38<03:10,  2.10pipeline/s]

Generation 42 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  86%|████████▋ | 2200/2550 [30:03<02:43,  2.15pipeline/s]

Generation 43 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  88%|████████▊ | 2250/2550 [30:48<03:41,  1.35pipeline/s]

Generation 44 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  90%|█████████ | 2300/2550 [31:19<01:57,  2.13pipeline/s]

Generation 45 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  92%|█████████▏| 2350/2550 [31:47<01:42,  1.96pipeline/s]

Generation 46 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  94%|█████████▍| 2400/2550 [32:24<01:42,  1.47pipeline/s]

Generation 47 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  96%|█████████▌| 2450/2550 [33:00<01:25,  1.17pipeline/s]

Generation 48 - Current best internal CV score: 0.7613049826165084


Optimization Progress:  98%|█████████▊| 2500/2550 [33:39<00:30,  1.65pipeline/s]

Generation 49 - Current best internal CV score: 0.7613049826165084




Generation 50 - Current best internal CV score: 0.7613049826165084

Best pipeline: GradientBoostingClassifier(input_matrix, learning_rate=0.1, max_depth=2, max_features=0.6500000000000001, min_samples_leaf=4, min_samples_split=19, n_estimators=100, subsample=0.45)


TPOTClassifier(config_dict={'sklearn.naive_bayes.GaussianNB': {}, 'sklearn.naive_bayes.BernoulliNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.naive_bayes.MultinomialNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.tree.DecisionT....3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}}}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        early_stop=None, generations=50, max_eval_time_mins=5,
        max_time_mins=None, memory=None, mutation_rate=0.9, n_jobs=1,
        offspring_size=50, periodic_checkpoint_folder=None,
        population_size=50, random_state=None, scoring=None, subsample=1.0,
        verbosity=2, warm_start=False)

>*Vola! , it gave me the best model to be used, the best parameters to use and  I didn't have to do anything as it automates model selection and parameter optimization using genetic algorithm back engine.*

>*Lets test it if it will improve our model*

>*Defining our new model*

In [0]:
from sklearn.ensemble import GradientBoostingClassifier
gbmodel=GradientBoostingClassifier(learning_rate=0.1, max_depth=2, max_features=0.6500000000000001, min_samples_leaf=4, min_samples_split=19, n_estimators=100, subsample=0.45)
gbmodel.fit(x_train,y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=2,
              max_features=0.6500000000000001, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=4, min_samples_split=19,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=0.45, verbose=0,
              warm_start=False)

In [0]:
y_predgb=gbmodel.predict(x_test)



In [0]:
from sklearn.metrics import accuracy_score,confusion_matrix,f1_score
from sklearn.model_selection import cross_val_score
cv_acc=cross_val_score(estimator=gbmodel,X=x_train,y=y_train,cv=10)
cv_accgb=cv_acc.mean()
print('For Gradient Boosting-Cross validation accuracy is {} percentage'.format(cv_accgb*100))
print('--------------------------------------------------------------')
training_accsgb=gbmodel.score(x_train,y_train)
print('For Gradient Boosting-Training accuracy is {} percentage'.format(training_accsgb*100))
print('--------------------------------------------------------------')
validation_accsgb=gbmodel.score(x_test,y_test)
print('For Gradient Boosting-validation accuracy is {} percentage'.format(validation_accsgb*100))
print('--------------------------------------------------------------')
test_accsgb=accuracy_score(y_test,y_predgb)
print('For Gradient Boosting-Testing accuracy is {} percentage'.format(test_accsgb*100))
print('--------------------------------------------------------------')
f1gb=f1_score(y_test, y_predgb)
print('For Gradient Boosting-F1 score is <<{}>>'.format(f1gb))
print('--------------------------------------------------------------')
cmgb=confusion_matrix(y_test,y_pred)
print('For Gradient Boosting-confusion matrix is ->>',cmgb)
print('--------------------------------------------------------------')

For Gradient Boosting-Cross validation accuracy is 73.62169870292232 percentage
--------------------------------------------------------------
For Gradient Boosting-Training accuracy is 79.5 percentage
--------------------------------------------------------------
For Gradient Boosting-validation accuracy is 76.0 percentage
--------------------------------------------------------------
For Gradient Boosting-Testing accuracy is 76.0 percentage
--------------------------------------------------------------
For Gradient Boosting-F1 score is <<0.8410596026490065>>
--------------------------------------------------------------
For Gradient Boosting-confusion matrix is ->> [[ 26  32]
 [ 28 114]]
--------------------------------------------------------------


>***As we can see it has improved model accuracy and our predictive power represented by F1 score.***


# UNDERSAMPLING
---

We will do the same procedure with undersampling the data to make it balance as the target variable has 1 and 0 in the ratio 70% to 30%.

To find if there is any difference in the result in our model.


Lets do undersampling and find out 

In [0]:
from google.colab import files
uploaded = files.upload()

Saving german_credit_data.csv to german_credit_data (2).csv


In [0]:
import pandas as pd
import io

df2 = pd.read_csv(io.StringIO(uploaded['german_credit_data.csv'].decode('utf-8')))


In [0]:
df2=df2.drop(['Unnamed: 0'],axis=1)
y=df2['Risk']
df2=df2.drop(['Risk'],axis=1)

for i in range(0,1000):
    if (df2.iloc[i,0]<=20):
        df2.iloc[i,0]=0
    elif (df2.iloc[i,0]<=40 and df2.iloc[i,0]>20):
        df2.iloc[i,0]=1
    elif (df2.iloc[i,0]<=60 and df2.iloc[i,0]>40):
        df2.iloc[i,0]=2
    else:
        df2.iloc[i,0]=3


import numpy as np 
df2 = df2.replace(np.nan, 'unkwn')



In [0]:
!pip3 install imblearn

Collecting imblearn
  Downloading https://files.pythonhosted.org/packages/81/a7/4179e6ebfd654bd0eac0b9c06125b8b4c96a9d0a8ff9e9507eb2a26d2d7e/imblearn-0.0-py2.py3-none-any.whl
Collecting imbalanced-learn (from imblearn)
[?25l  Downloading https://files.pythonhosted.org/packages/80/a4/900463a3c0af082aed9c5a43f4ec317a9469710c5ef80496c9abc26ed0ca/imbalanced_learn-0.3.3-py3-none-any.whl (144kB)
[K    100% |████████████████████████████████| 153kB 4.8MB/s 
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.3.3 imblearn-0.0


In [0]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labenc=LabelEncoder()
df2['Sex']=labenc.fit_transform(df2['Sex'])
df2['Housing']=labenc.fit_transform(df2['Housing'])
df2['Saving accounts']=df2['Saving accounts'].astype(str)
df2['Saving accounts']=labenc.fit_transform(df2['Saving accounts'])
df2['Checking account']=df2['Checking account'].astype(str)
df2['Checking account']=labenc.fit_transform(df2['Checking account'])
df2['Purpose']=labenc.fit_transform(df2['Purpose'])
y=labenc.fit_transform(y)

xout2=df2.iloc[:,0:9].values
onehotencoder=OneHotEncoder(categorical_features=[2,3,5,8])
xout2=onehotencoder.fit_transform(xout2).toarray()

x=xout2




In [0]:
#	FOR UNDER SAMPLING 
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
x_res, y_res = rus.fit_sample(x, y)
print('Resampled dataset shape {}'.format(Counter(y_res)))

Resampled dataset shape Counter({0: 300, 1: 300})


In [0]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [0]:
tpot2 = TPOTClassifier(generations=50, population_size=50, verbosity=2)
tpot2.fit(x_train, y_train)


Optimization Progress:   4%|▍         | 100/2550 [00:40<30:07,  1.36pipeline/s]

Generation 1 - Current best internal CV score: 0.750062551271534


Optimization Progress:   6%|▌         | 150/2550 [01:14<19:52,  2.01pipeline/s]

Generation 2 - Current best internal CV score: 0.7538439001523497


Optimization Progress:   8%|▊         | 200/2550 [02:11<28:14,  1.39pipeline/s]

Generation 3 - Current best internal CV score: 0.7538439001523497


Optimization Progress:  10%|▉         | 250/2550 [03:39<56:00,  1.46s/pipeline]

Generation 4 - Current best internal CV score: 0.7538439001523497


Optimization Progress:  12%|█▏        | 300/2550 [04:40<47:22,  1.26s/pipeline]

Generation 5 - Current best internal CV score: 0.7538439001523497


Optimization Progress:  14%|█▎        | 350/2550 [06:19<44:29,  1.21s/pipeline]

Generation 6 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  16%|█▌        | 400/2550 [07:49<53:44,  1.50s/pipeline]

Generation 7 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  18%|█▊        | 450/2550 [09:11<1:01:17,  1.75s/pipeline]

Generation 8 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  20%|█▉        | 500/2550 [10:33<37:16,  1.09s/pipeline]

Generation 9 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  22%|██▏       | 550/2550 [11:58<45:46,  1.37s/pipeline]

Generation 10 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  24%|██▎       | 600/2550 [13:18<43:52,  1.35s/pipeline]

Generation 11 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  25%|██▌       | 650/2550 [14:59<45:38,  1.44s/pipeline]

Generation 12 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  27%|██▋       | 700/2550 [16:58<41:15,  1.34s/pipeline]

Generation 13 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  29%|██▉       | 750/2550 [18:26<35:56,  1.20s/pipeline]

Generation 14 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  31%|███▏      | 800/2550 [20:12<36:27,  1.25s/pipeline]

Generation 15 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  33%|███▎      | 850/2550 [22:11<1:05:23,  2.31s/pipeline]

Generation 16 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  35%|███▌      | 900/2550 [23:34<45:19,  1.65s/pipeline]

Generation 17 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  37%|███▋      | 950/2550 [24:58<41:40,  1.56s/pipeline]

Generation 18 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  39%|███▉      | 1000/2550 [26:17<38:18,  1.48s/pipeline]

Generation 19 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  41%|████      | 1050/2550 [28:02<37:27,  1.50s/pipeline]

Generation 20 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  43%|████▎     | 1100/2550 [29:31<38:25,  1.59s/pipeline]

Generation 21 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  45%|████▌     | 1150/2550 [30:48<25:28,  1.09s/pipeline]

Generation 22 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  47%|████▋     | 1200/2550 [32:24<36:37,  1.63s/pipeline]

Generation 23 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  49%|████▉     | 1250/2550 [33:55<27:22,  1.26s/pipeline]

Generation 24 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  51%|█████     | 1300/2550 [36:05<38:20,  1.84s/pipeline]

Generation 25 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  53%|█████▎    | 1350/2550 [38:00<2:26:04,  7.30s/pipeline]

Generation 26 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  55%|█████▍    | 1400/2550 [39:28<28:00,  1.46s/pipeline]

Generation 27 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  57%|█████▋    | 1450/2550 [41:11<34:25,  1.88s/pipeline]

Generation 28 - Current best internal CV score: 0.7650160650806672


Optimization Progress:  59%|█████▉    | 1500/2550 [42:43<34:34,  1.98s/pipeline]

Generation 29 - Current best internal CV score: 0.7650317883511075


Optimization Progress:  61%|██████    | 1550/2550 [43:55<18:19,  1.10s/pipeline]

Generation 30 - Current best internal CV score: 0.7650317883511075


Optimization Progress:  63%|██████▎   | 1600/2550 [45:48<14:54,  1.06pipeline/s]

Generation 31 - Current best internal CV score: 0.7650317883511075


Optimization Progress:  65%|██████▍   | 1650/2550 [47:33<43:25,  2.90s/pipeline]

Generation 32 - Current best internal CV score: 0.7650317883511075


Optimization Progress:  67%|██████▋   | 1700/2550 [49:24<51:45,  3.65s/pipeline]

Generation 33 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  69%|██████▊   | 1750/2550 [50:58<24:57,  1.87s/pipeline]

Generation 34 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  71%|███████   | 1800/2550 [53:09<22:14,  1.78s/pipeline]

Generation 35 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  73%|███████▎  | 1850/2550 [54:58<17:07,  1.47s/pipeline]

Generation 36 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  75%|███████▍  | 1900/2550 [56:32<18:19,  1.69s/pipeline]

Generation 37 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  76%|███████▋  | 1950/2550 [58:05<14:35,  1.46s/pipeline]

Generation 38 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  78%|███████▊  | 2000/2550 [1:02:55<13:39,  1.49s/pipeline]

Generation 39 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  80%|████████  | 2050/2550 [1:04:36<20:22,  2.45s/pipeline]

Generation 40 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  82%|████████▏ | 2100/2550 [1:06:27<11:22,  1.52s/pipeline]

Generation 41 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  84%|████████▍ | 2150/2550 [1:08:17<16:52,  2.53s/pipeline]

Generation 42 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  86%|████████▋ | 2200/2550 [1:10:17<17:53,  3.07s/pipeline]

Generation 43 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  88%|████████▊ | 2250/2550 [1:12:44<07:54,  1.58s/pipeline]

Generation 44 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  90%|█████████ | 2300/2550 [1:14:36<07:16,  1.75s/pipeline]

Generation 45 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  92%|█████████▏| 2350/2550 [1:16:15<04:17,  1.29s/pipeline]

Generation 46 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  94%|█████████▍| 2400/2550 [1:18:05<05:26,  2.18s/pipeline]

Generation 47 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  96%|█████████▌| 2450/2550 [1:19:42<02:31,  1.51s/pipeline]

Generation 48 - Current best internal CV score: 0.7674772452048908


Optimization Progress:  98%|█████████▊| 2500/2550 [1:21:57<01:23,  1.66s/pipeline]

Generation 49 - Current best internal CV score: 0.7687428708152664




Generation 50 - Current best internal CV score: 0.7687428708152664

Best pipeline: RandomForestClassifier(Normalizer(SelectPercentile(RobustScaler(input_matrix), percentile=64), norm=max), bootstrap=True, criterion=entropy, max_features=0.25, min_samples_leaf=1, min_samples_split=6, n_estimators=100)


TPOTClassifier(config_dict={'sklearn.naive_bayes.GaussianNB': {}, 'sklearn.naive_bayes.BernoulliNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.naive_bayes.MultinomialNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.tree.DecisionT....3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}}}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        early_stop=None, generations=50, max_eval_time_mins=5,
        max_time_mins=None, memory=None, mutation_rate=0.9, n_jobs=1,
        offspring_size=50, periodic_checkpoint_folder=None,
        population_size=50, random_state=None, scoring=None, subsample=1.0,
        verbosity=2, warm_start=False)

In [0]:
from sklearn.ensemble import RandomForestClassifier

rf1=RandomForestClassifier(bootstrap=True, criterion='entropy', max_features=0.25, min_samples_leaf=1, min_samples_split=6, n_estimators=100)
rf1.fit(x_train,y_train)
y_predrf=rf1.predict(x_test)

In [0]:

cv_acc=cross_val_score(estimator=rf1,X=x_train,y=y_train,cv=10)
cv_accrf=cv_acc.mean()
print('For Random Forest-Cross validation accuracy is {} percentage'.format(cv_accrf*100))
print('--------------------------------------------------------------')
training_accsrf=rf1.score(x_train,y_train)
print('For Random Forest-Training accuracy is {} percentage'.format(training_accsrf*100))
print('--------------------------------------------------------------')
validation_accsrf=rf1.score(x_test,y_test)
print('For Random Forest-validation accuracy is {} percentage'.format(validation_accsrf*100))
print('--------------------------------------------------------------')
test_accsrf=accuracy_score(y_test,y_predgb)
print('For Random Forest-Testing accuracy is {} percentage'.format(test_accsrf*100))
print('--------------------------------------------------------------')
f1rf=f1_score(y_test, y_predgb)
print('For Random Forest-F1 score is <<{}>>'.format(f1rf))
print('--------------------------------------------------------------')
cmrf=confusion_matrix(y_test,y_pred)
print('For Random Forest-confusion matrix is ->>',cmrf)
print('--------------------------------------------------------------')

For Random Forest-Cross validation accuracy is 74.7374003750586 percentage
--------------------------------------------------------------
For Random Forest-Training accuracy is 94.375 percentage
--------------------------------------------------------------
For Random Forest-validation accuracy is 74.0 percentage
--------------------------------------------------------------
For Random Forest-Testing accuracy is 76.0 percentage
--------------------------------------------------------------
For Random Forest-F1 score is <<0.8410596026490065>>
--------------------------------------------------------------
For Random Forest-confusion matrix is ->> [[ 26  32]
 [ 28 114]]
--------------------------------------------------------------


###Inference After Undersampling

---

>*Well undersampling does not change our results to much degree, so we can take the above model with normal data.*


>*So, We have understood basic concept of genetic algorithm, How we can use it in our data science problem for automating certain process. Genetic algorithm is used in wide range of optimizing problem. There are many factors where we can experiment. Especially if we write a manual function for solving the same problem, the mutation probability the fitness function, everything can be set and changed to get an optimal solution.*
