<a href="https://colab.research.google.com/github/Zahramashayekhpour/organic-and-nonorganic-fruit-classification/blob/master/Genetic_algorithm_for_feature_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Genetic Algorithm
_________
#### The Genetic Algorithm(GA) is an evolutionary algorithm(EA) inspired by Charles Darwin’s theory of natural selection which espouses Survival of the fittest. As per the natural selection theory, the fittest individuals are selected to produce offsprings. The fittest parents' characteristics are then passed on to their offsprings using cross-over and mutation to ensure better chances of survival. Genetic algorithms are randomized search algorithms that generate high-quality optimization solutions by imitating the biologically inspired natural selection process such as selection, cross-over, and mutation.

### Terminology for Genetic Algorithm
![](https://miro.medium.com/max/695/1*vIrsxg12DSltpdWoO561yA.png)
#### **Population** contains a set of possible solutions for the stochastic search process to begin. GA will iterate over multiple generations till it finds an acceptable and optimized solution. First-generation is randomly generated.
#### **Chromosome** represents one candidate solution present in the generation or population. A chromosome is also referred to as a Genotype. A chromosome is composed of Genes that contain the value for the optimal variables.
#### **Phenotype** is the decoded parameter list for the genotype that is processed by the Genetic Algorithm. Mapping is applied to the genotype to convert to a phenotype.
#### The **Fitness function** or the objective function evaluates the individual solution or phenotypes for every generation to identify the fittest members.
__________
### Different Genetic Operators
#### **Selection** is the process of selecting the fittest solution from a population, and then the fittest solutions act as parents of the next generation of solutions. This allows the next generation to inherit the strong features naturally. Selection can be performed using Roulette Wheel Selection or **Ranked Selection** based on the fitness value.

#### **Cross-over** or recombination happens when genes from the two fittest parents are randomly exchanged to form a new genotype or solution. Cross over can be a One-point cross over or Multi-Point Cross over based on the parent's segments of genes exchanged.
![image.png](attachment:e240e0f3-60da-44b4-81f7-16bb1e506ff5.png)
#### Here **One-point Cross-over** is used.
#### After a new population is created through selection and crossover, it is randomly modified through **mutation**. A **mutation** is a process to modify a genotype using a random process to promote diversity in the population to find better and optimized solutions.
![](https://miro.medium.com/max/385/1*bk6zF_rpgGi8IcPIY6fCWg.png)
______
### Usage of Genetic Algorithm in Artificial Intelligence
#### A Genetic Algorithm is used for Search and Optimization using an iterative process to arrive at the best solution out of multiple solutions.
#### 1. A Genetic Algorithm can find an appropriate set of hyperparameters and their values for a deep learning model to increase its performance in Deep Learning.
#### 2. A Genetic Algorithm can also be used to determine the best amount of features to include in a machine learning model for predicting the target variable.
____

### Working of Genetic Algorithm
![](https://miro.medium.com/max/598/1*TZ840m0DvghL80GodVGLeQ.png)
____

In [None]:
from google.colab import drive   ## allow Colab to connect your Drive
root = '/content/gdrive/'
drive.mount( root )

### Importing the required libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from random import randint
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
def split(df,label):
    X_tr, X_te, Y_tr, Y_te = train_test_split(df, label, test_size=0.25, random_state=42)
    return X_tr, X_te, Y_tr, Y_te

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

import sklearn as sk
from sklearn.neural_network import MLPClassifier



classifiers = ['LinearDiscriminantAnalysis',   ## i add these three classifiers
               'LinearSVM', 'RadialSVM',
                'RandomForest',
                'MLP' ]

models = [LinearDiscriminantAnalysis(),
          svm.SVC(kernel='linear'),
          svm.SVC(kernel='rbf'),
          RandomForestClassifier(n_estimators=200, random_state=0),
          MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(30, 10), random_state=1)
]



def acc_score(df,label):
    Score = pd.DataFrame({"Classifier":classifiers})
    j = 0
    acc = []
    precision=[]
    recall=[]
    auc=[]

    k=5
    #X_train,X_test,Y_train,Y_test = split(df,label)
    for i in models:
      model = i
      accuracy_score=cross_val_score(i, x_data, y_data, cv=k, scoring="accuracy")
      accuracy_score_mean=np.mean(accuracy_score)
      accuracy_score_mean="{:.2f}".format(accuracy_score_mean*100)

      precision_score=cross_val_score(i, x_data, y_data, cv=k, scoring="precision")
      precision_score_mean=np.mean(precision_score)
      precision_score_mean= "{:.2f}".format(precision_score_mean*100)

      recall_score=cross_val_score(i, x_data, y_data, cv=k, scoring="recall")
      recall_score_mean=np.mean(recall_score)
      recall_score_mean="{:.2f}".format(recall_score_mean*100)

      auc_score=cross_val_score(i, x_data, y_data, cv=k, scoring="roc_auc")
      auc_score_mean=np.mean(auc_score)
      auc_score_mean="{:.2f}".format(auc_score_mean*100)

      acc.append(accuracy_score_mean)
      precision.append(precision_score_mean)
      recall.append(recall_score_mean)
      auc.append(auc_score_mean)


      j = j+1
    Score["Accuracy"] = acc
    Score["precision"]=precision
    Score["recall"]=recall
    Score["auc"]=auc
    Score.sort_values(by="Accuracy", ascending=False,inplace = True)
    Score.reset_index(drop=True, inplace=True)
    return Score

def plot(score,x,y,c = "b"):
    gen = [1,2,3,4,5]
    plt.figure(figsize=(6,4))
    ax = sns.pointplot(x=gen, y=score,color = c )
    ax.set(xlabel="Generation", ylabel="Accuracy")
    ax.set(ylim=(x,y))

In [None]:
import numpy as np
from random import randint
from sklearn.model_selection import cross_val_score

def initilization_of_population(size, n_feat):
    population = []
    for i in range(size):
        chromosome = np.ones(n_feat, dtype=bool)  # Updated from np.bool
        chromosome[:int(0.3 * n_feat)] = False
        np.random.shuffle(chromosome)
        population.append(chromosome)
    return population
k=5
def fitness_score(population):
    scores = []
    #precision=[]
    #recall=[]
    for chromosome in population:
        accuracy_score=cross_val_score(logmodel, x_data.iloc[:,chromosome], y_data, cv=k, scoring="accuracy")
        accuracy_score_mean=np.mean(accuracy_score)
        accuracy_score_mean="{:.2f}".format(accuracy_score_mean*100)
        scores.append(accuracy_score_mean)
        #logmodel.fit(X_train.iloc[:,chromosome],Y_train)
        #predictions = logmodel.predict(X_test.iloc[:,chromosome])
        #scores.append(accuracy_score(Y_test,predictions))
        #precision.append(precision_score(Y_test, predictions))
        #recall.append(recall_score(Y_test, predictions))
    #scores, population,precision, recall = np.array(scores), np.array(population) ,np.array(precision), np.array(recall)
    scores, population = np.array(scores), np.array(population)
    inds = np.argsort(scores)
    return list(scores[inds][::-1]), list(population[inds,:][::-1])


def selection(pop_after_fit,n_parents):
    population_nextgen = []
    for i in range(n_parents):
        population_nextgen.append(pop_after_fit[i])
    return population_nextgen


def crossover(pop_after_sel):
    pop_nextgen = pop_after_sel
    for i in range(0,len(pop_after_sel),2):
        new_par = []
        child_1 , child_2 = pop_nextgen[i] , pop_nextgen[i+1]
        new_par = np.concatenate((child_1[:len(child_1)//2],child_2[len(child_1)//2:]))
        pop_nextgen.append(new_par)
    return pop_nextgen


def mutation(pop_after_cross,mutation_rate,n_feat):
    mutation_range = int(mutation_rate*n_feat)
    pop_next_gen = []
    for n in range(0,len(pop_after_cross)):
        chromo = pop_after_cross[n]
        rand_posi = []
        for i in range(0,mutation_range):
            pos = randint(0,n_feat-1)
            rand_posi.append(pos)
        for j in rand_posi:
            chromo[j] = not chromo[j]
        pop_next_gen.append(chromo)
    return pop_next_gen
#def generations(df,label,size,n_feat,n_parents,mutation_rate,n_gen,X_train,
                                   #X_test, Y_train, Y_test):
def generations(df,label,size,n_feat,n_parents,mutation_rate,n_gen):
    best_chromo= []
    best_score= []
    population_nextgen=initilization_of_population(size,n_feat)
    for i in range(n_gen):
        scores, pop_after_fit = fitness_score(population_nextgen)
        print('Best score in generation',i+1,':',scores[:1])  #2
        pop_after_sel = selection(pop_after_fit,n_parents)
        pop_after_cross = crossover(pop_after_sel)
        population_nextgen = mutation(pop_after_cross,mutation_rate,n_feat)
        best_chromo.append(pop_after_fit[0])
        best_score.append(scores[0])
    return best_chromo,best_score

____
### Function Description
#### 1. split():
Splits the dataset into training and test set.
#### 2. acc_score():
Returns accuracy for all the classifiers.
#### 3. plot():
For plotting the results.
_____
### Function Description for Genetic Algorithm
#### 1. initilization_of_population():
To initialize a random population.
#### 2. fitness_score():
Returns the best parents along with their score.
#### 3. selection():
Selection of the best parents.
#### 4. crossover():
Picks half of the first parent and half of the second parent.
#### 5. mutation():
Randomly flips selected bits from the crossover child.
#### 6. generations():
Executes all the above functions for the specified number of generations
____
### Plan of action:

* Looking at dataset (includes a little preprocessing)
* Checking Accuracy (comparing accuracies with the new dataset)
* Visualization (Plotting the graphs)
____

## Implementation of Genetic Algorithm for Feature Selection
________
#### First, we run a function to initialize a random population.
#### The randomized population is now run through the fitness function, which returns the best parents (highest accuracy).
#### Selection from these best parents will occur depending on the n-parent parameter.
#### After doing the same, it will be put through the crossover and mutation functions respectively.
#### Cross over is created by combining genes from the two fittest parents by randomly picking a part of the first parent and a part of the second parent.
#### The mutation is achieved by randomly flipping selected bits for the crossover child.
#### A new generation is created by selecting the fittest parents from the previous generation and applying cross-over and mutation.
#### This process is repeated for n number of generations.
______

### 1. Looking at dataset

In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/thesis-article-2023/classic/features_DF/mushroom-RGB-TOTAL(texture).csv')
df = df.dropna(axis='columns')

## remove class and images name column from dataframe to achieve only feature matrix
y_data=df['class']  #label
x_data=df.loc[:, df.columns != 'images_name']
x_data=x_data.loc[:, x_data.columns != 'class']   #feature dataframe
x_data = x_data.loc[:, ~x_data.columns.str.contains('^Unnamed')]

In [None]:

#x_data.drop(['lbp_energy','lbp_entropy'], axis=1, inplace=True)
#x_data

In [None]:
 display(x_data.head())
print("All the features in this dataset have continuous values")

### 2. Checking Accuracy

In [62]:
score1 = acc_score(x_data,y_data)
score1

Unnamed: 0,Classifier,Accuracy,precision,recall,auc
0,MLP,86.04,88.82,82.62,93.09
1,LinearDiscriminantAnalysis,83.62,87.28,78.56,92.17
2,RandomForest,82.55,84.65,79.02,91.01
3,RadialSVM,80.37,81.61,77.28,88.23
4,LinearSVM,75.13,81.54,63.45,81.92


In [63]:
#score1.to_csv('/content/gdrive/MyDrive/thesis-article-2023/classic/evaluation result/lda-svm-rf-mlp/RGB_mushroom.csv')

In [64]:
n_feat=x_data.shape[1]

#### Choosing the best classifier for further calculations

In [None]:
%%time
logmodel = RandomForestClassifier(n_estimators=200, random_state=0)
X_train,X_test, Y_train, Y_test = split(x_data,y_data)
#chromo_df_bc,score_bc=generations(x_data,y_data,size=80,n_feat=x_data.shape[1],n_parents=64,mutation_rate=0.20,n_gen=10,
                        # X_train = X_train,X_test = X_test,Y_train = Y_train,Y_test = Y_test)
chromo_df_bc,score_bc=generations(x_data,y_data,size=80,n_feat=x_data.shape[1],n_parents=64,mutation_rate=0.20,n_gen=5)

Best score in generation 1 : ['84.29']
Best score in generation 2 : ['84.08']


In [None]:
%%time
logmodel = LinearDiscriminantAnalysis()
X_train,X_test, Y_train, Y_test = split(x_data,y_data)
chromo_df_bc,score_bc=generations(x_data,y_data,size=80,n_feat=x_data.shape[1],n_parents=64,mutation_rate=0.20,n_gen=5)
                         #X_train = X_train,X_test = X_test,Y_train = Y_train,Y_test = Y_test)

In [None]:
%%time
model= svm.SVC(kernel='rbf')
X_train,X_test, Y_train, Y_test = split(x_data,y_data)
chromo_df_bc,score_bc=generations(x_data,y_data,size=80,n_feat=x_data.shape[1],n_parents=64,mutation_rate=0.20,n_gen=5)
                         #X_train = X_train,X_test = X_test,Y_train = Y_train,Y_test = Y_test)

In [None]:
%%time
logmodel = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(30, 10), random_state=1)
X_train,X_test, Y_train, Y_test = split(x_data,y_data)
chromo_df_bc,score_bc=generations(x_data,y_data,size=80,n_feat=x_data.shape[1],n_parents=64,mutation_rate=0.20,n_gen=5)
                        # X_train = X_train,X_test = X_test,Y_train = Y_train,Y_test = Y_test)