# **Feature Engineering usin PSO and GA**

In this notebook, we utilized a loan dataset, accessible at https://www.kaggle.com/datasets/mirzahasnine/loan-data-set, to identify significant features within the data using evolutionary algorithms like PSO and GA.

---



In [1]:
import pandas as pd

# Define the file path
file_path = '/content/sample_data/loan_train.csv'

# Load the CSV file into a DataFrame
df = pd.read_csv(file_path)
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,Applicant_Income,Coapplicant_Income,Loan_Amount,Term,Credit_History,Area,Status
0,Male,No,0,Graduate,No,584900,0.0,15000000,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,458300,150800.0,12800000,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,300000,0.0,6600000,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,258300,235800.0,12000000,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,600000,0.0,14100000,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
609,Female,No,0,Graduate,No,290000,0.0,7100000,360.0,1.0,Rural,Y
610,Male,Yes,3+,Graduate,No,410600,0.0,4000000,180.0,1.0,Rural,Y
611,Male,Yes,1,Graduate,No,807200,24000.0,25300000,360.0,1.0,Urban,Y
612,Male,Yes,2,Graduate,No,758300,0.0,18700000,360.0,1.0,Urban,Y


In [2]:
# Drop rows with null values
df_cleaned = df.dropna()

# Check for null values
null_values = df_cleaned.isnull().sum()

# Display the null value counts
print("Null value counts:")
print(null_values)


Null value counts:
Gender                0
Married               0
Dependents            0
Education             0
Self_Employed         0
Applicant_Income      0
Coapplicant_Income    0
Loan_Amount           0
Term                  0
Credit_History        0
Area                  0
Status                0
dtype: int64


In [3]:
from sklearn.preprocessing import LabelEncoder

# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Apply label encoding to categorical columns
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Area','Status']
for column in categorical_columns:
    df_cleaned[column] = label_encoder.fit_transform(df_cleaned[column])

# Display the DataFrame after label encoding
print(df_cleaned.head())

   Gender  Married  Dependents  Education  Self_Employed  Applicant_Income  \
0       1        0           0          0              0            584900   
1       1        1           1          0              0            458300   
2       1        1           0          0              1            300000   
3       1        1           0          1              0            258300   
4       1        0           0          0              0            600000   

   Coapplicant_Income  Loan_Amount   Term  Credit_History  Area  Status  
0                 0.0     15000000  360.0             1.0     2       1  
1            150800.0     12800000  360.0             1.0     0       0  
2                 0.0      6600000  360.0             1.0     2       1  
3            235800.0     12000000  360.0             1.0     2       1  
4                 0.0     14100000  360.0             1.0     2       1  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[column] = label_encoder.fit_transform(df_cleaned[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[column] = label_encoder.fit_transform(df_cleaned[column])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[column] = label_encoder.fit_transform(df_cleaned[column]


# **Particle Swarm Optimization (PSO)**

To apply Particle Swarm Optimization (PSO) for feature engineering on your dataset, you'll need to follow a structured approach. PSO is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. It solves a problem by having a population of candidate solutions, here dubbed particles, and moving these particles around in the space of possible solutions according to simple mathematical formulae over the particle's position and velocity. Each particle's movement is influenced by its local best known position but is also guided toward the best known positions in the search-space, which are updated as better positions are found by other particles.

Since PSO is generally not directly used for feature selection in its raw form but adapted for the task, let's outline how you might set up a PSO algorithm for feature selection:

**Objective Function:** Your objective function needs to measure the quality of a subset of features. This can be the accuracy of a classifier using those features, for example.

**Particles:** Each particle represents a potential solution to the problem, which, in this case, is a subset of features. You can represent each particle as a binary vector, where each element corresponds to a feature (1 if the feature is selected and 0 otherwise).

**Initialization:** Randomly initialize the positions and velocities of the particles in the swarm. In the context of feature selection, the position of a particle can be initialized by randomly selecting features.

**Velocity and Position Update:** At each iteration, update the velocity and position of each particle based on its own experience (best solution it has found) and the experience of the entire swarm (the global best solution). The velocity update will determine how the particle moves towards these best solutions.

**Evaluation:** After moving the particles, evaluate their new positions according to the objective function. If a particle finds a better position (a set of features resulting in a higher accuracy), update its best known position. Also, update the global best if necessary.

**Termination:** Repeat the velocity and position update steps until a stopping criterion is met (e.g., a maximum number of iterations or a satisfactory objective function value).


In [4]:
!pip install pyswarms




In [5]:
import numpy as np
import pyswarms as ps
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


X = df_cleaned.drop('Status', axis=1).values  # Feature matrix
y = df_cleaned['Status'].values  # Target variable

# Define binary PSO
options = {'c1': 0.5, 'c2': 0.3, 'w':0.9}

# Define bounds
max_bound = 1.0 * np.ones(X.shape[1])
min_bound = 0.0 * np.ones(X.shape[1])
bounds = (min_bound, max_bound)

# Objective function
def f_per_particle(m, alpha):
    """Computes for the objective function per particle

    Inputs
    ------
    m : numpy.ndarray
        Binary mask that can be obtained from BinaryPSO, will
        be used to mask features.
    alpha: float
        Penalty factor to control the number of features selected.

    Returns
    -------
    numpy.ndarray
    """
    total_features = X.shape[1]
    # Apply mask to features
    X_subset = X[:,m>0.5]
    if X_subset.shape[1] == 0:
        return float('inf')
    # Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(X_subset, y, test_size=0.3, random_state=42)
    # Fit the model
    clf = RandomForestClassifier(n_estimators=50, random_state=42)
    clf.fit(X_train, y_train)
    # Predict and calculate accuracy
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    # Calculate objective
    j = (alpha * (1.0 - accuracy) + (1.0 - alpha) * (1 - (X_subset.shape[1] / total_features)))
    return j

# Define objective function
def f(x, alpha=0.5):
    """Higher-level method to do classification in the
    whole swarm.

    Inputs
    ------
    x: numpy.ndarray of shape (n_particles, dimensions)
        The swarm that will perform the search

    Returns
    -------
    numpy.ndarray of shape (n_particles, )
        The computed loss for each particle
    """
    n_particles = x.shape[0]
    j = [f_per_particle(x[i], alpha) for i in range(n_particles)]
    return np.array(j)

# Initialize swarm
optimizer = ps.single.GlobalBestPSO(n_particles=50, dimensions=X.shape[1], options=options, bounds=bounds)

# Perform optimization
cost, pos = optimizer.optimize(f, iters=100)


2024-04-02 01:22:46,609 - pyswarms.single.global_best - INFO - Optimize for 100 iters with {'c1': 0.5, 'c2': 0.3, 'w': 0.9}
pyswarms.single.global_best: 100%|██████████|100/100, best_cost=0.09
2024-04-02 01:31:17,909 - pyswarms.single.global_best - INFO - Optimization finished | best cost: 0.09000000000000002, best pos: [0.79358531 0.79357312 0.61057031 0.55163501 0.59481175 0.65287157
 0.85299039 0.87030244 0.53853014 0.5598393  0.72719174]


In [7]:
#'pos' is the variable containing the best position returned by optimizer.optimize()
best_pos = np.array(pos)

print("Position as NumPy Array:", best_pos)


Position as NumPy Array: [0.79358531 0.79357312 0.61057031 0.55163501 0.59481175 0.65287157
 0.85299039 0.87030244 0.53853014 0.5598393  0.72719174]


In [11]:
# Define a threshold to determine whether a feature is selected
threshold = 0.6

# Create a boolean mask from 'best_pos' based on the threshold
selected_features_mask = best_pos > threshold

# df_cleaned.columns[:-1] gives you all feature names excluding the target variable
feature_names = df_cleaned.columns[:-1]  # Modify this as necessary

# Use the mask to select the names of the features
selected_features = feature_names[selected_features_mask]

print("Selected Features:", selected_features)


Selected Features: Index(['Gender', 'Married', 'Dependents', 'Applicant_Income',
       'Coapplicant_Income', 'Loan_Amount', 'Area'],
      dtype='object')


# **Genetic Algorithm (GA)**

Applying a Genetic Algorithm (GA) for feature selection is a great approach for optimizing the subset of features that contribute most significantly to the performance of a predictive model. Like Particle Swarm Optimization (PSO), GA simulates the process of natural selection where the fittest individuals are selected for reproduction in order to produce offspring of the next generation.

For feature selection, the Genetic Algorithm process can be summarized as follows:

**Initialization:** Randomly generate an initial population of individuals. Each individual, or chromosome, represents a possible solution to the feature selection problem and can be encoded as a binary string where each bit represents the presence (1) or absence (0) of a feature.

**Fitness Evaluation:** Evaluate the fitness of each individual. In the context of feature selection, the fitness could be the accuracy of a predictive model that uses the subset of features represented by the individual.

**Selection:** Select individuals for reproduction. Individuals with higher fitness are more likely to be chosen. This can be done through various methods such as tournament selection, roulette wheel selection, etc.

**Crossover:** Combine pairs of individuals (parents) to produce offspring for the next generation. This can introduce new genetic combinations into the population.

**Mutation:** Apply random changes to individual offspring. This helps to maintain genetic diversity within the population and can introduce new solutions.

**Replacement:** Form a new generation of individuals from the offspring and possibly some members of the current generation.

**Termination:** Repeat the process from step 2 to step 6 until a termination condition is met (e.g., a set number of generations or a satisfactory level of fitness).

Here’s a basic template using DEAP (Distributed Evolutionary Algorithms in Python) library, which simplifies the implementation of evolutionary algorithms in Python. If you don't have DEAP installed, you can install it using pip:

In [9]:
!pip install deap



In [10]:
from deap import base, creator, tools, algorithms
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X = df_cleaned.drop('Status', axis=1).values
y = df_cleaned['Status'].values

# Genetic Algorithm constants:
POPULATION_SIZE = 50
P_CROSSOVER = 0.9  # probability for crossover
P_MUTATION = 0.1   # probability for mutating an individual
MAX_GENERATIONS = 50
HALL_OF_FAME_SIZE = 5
FEATURES = X.shape[1]

# Fitness function
def fitness(individual):
    mask = np.array(individual, dtype=bool)
    if np.sum(mask) == 0:  # prevent the classifier from failing
        return 0,
    X_selected = X[:, mask]
    X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.3, random_state=42)
    clf = RandomForestClassifier()
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)
    return accuracy_score(y_test, predictions),

# Set up DEAP
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)

toolbox = base.Toolbox()
toolbox.register("attr_bool", random.randint, 0, 1)
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, FEATURES)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", fitness)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)

# Initialize population
population = toolbox.population(n=POPULATION_SIZE)

# Run Genetic Algorithm
result = algorithms.eaSimple(population, toolbox, cxpb=P_CROSSOVER, mutpb=P_MUTATION,
                             ngen=MAX_GENERATIONS, stats=None, halloffame=tools.HallOfFame(HALL_OF_FAME_SIZE),
                             verbose=True)

# Extracting the best individual
best_ind = tools.selBest(population, 1)[0]
print("Best Individual = ", best_ind)
print("Best Fitness = ", best_ind.fitness.values[0])

# Translate the best individual into selected features
selected_features_indices = [index for index, value in enumerate(best_ind) if value == 1]
selected_features_names = df_cleaned.drop('Status', axis=1).columns[selected_features_indices]
print("Selected Features by GA:", selected_features_names)


gen	nevals
0  	50    
1  	44    
2  	48    
3  	45    
4  	46    
5  	47    
6  	41    
7  	46    
8  	47    
9  	46    
10 	46    
11 	44    
12 	42    
13 	46    
14 	39    
15 	45    
16 	42    
17 	50    
18 	49    
19 	46    
20 	47    
21 	47    
22 	49    
23 	46    
24 	46    
25 	48    
26 	46    
27 	46    
28 	46    
29 	46    
30 	38    
31 	48    
32 	48    
33 	46    
34 	50    
35 	48    
36 	47    
37 	48    
38 	41    
39 	48    
40 	46    
41 	44    
42 	50    
43 	44    
44 	45    
45 	48    
46 	48    
47 	48    
48 	50    
49 	44    
50 	46    
Best Individual =  [0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1]
Best Fitness =  0.8666666666666667
Selected Features by GA: Index(['Dependents', 'Education', 'Applicant_Income', 'Loan_Amount', 'Term',
       'Credit_History', 'Area'],
      dtype='object')
