<div>
<img src="https://www.ul.ie/themes/custom/ul/logo.jpg" />
</div>

#**MSc in Artificial Intelligence and Machine Learning**
##CS6271 - Evolutionary Algorithms and Humanoid Robotics 2023
### Kaggle Competition


Module Leader: Conor Ryan

Developer: Allan De Lima

Link to access the competition: https://www.kaggle.com/competitions/cs6271-20234-final-project

Link to join the competition: https://www.kaggle.com/t/2b316ba38c144f23ac780c8fc898b4d7



## Introduction

Predict whether income exceeds $50K/yr based on census data. This is a shorter version of the also known as "Census Income" dataset (donated on 4/30/1996).

In [1]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Dataset

Class:

income: >50K, <=50K.


Listing of features:

age: continuous.

workclass: categorical (Private, Self-emp-not-inc, Local-gov, State-gov).

education: categorical (Bachelors, Some-college, HS-grad, Masters, Doctorate).

marital-status: categorical (Married-civ-spouse, Divorced, Never-married).

relationship: categorical (Wife, Husband, Not-in-family, Other-relative).

race: categorical (White, Asian-Pac-Islander, Black).

sex: categorical (Female, Male).

capital-gain: continuous.

capital-loss: continuous.

hours-per-week: continuous.

native-country: categorical (United-States, Others).


### Load the dataset

In [2]:
# Suppressing Warnings:
import warnings
warnings.filterwarnings("ignore")

In [3]:
## mount your Google drive
# 1) run this cell
# 2) sign in
# 3) verify your drive is mounted

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Clone the GRAPE repository at first because the dataset to be used is already there.

In [4]:
import os
# Get the library from our BDS research Group
# copy the path from your drive
PATH = '/content/drive/MyDrive/grape/'

# check if 'grape' already exists
if os.path.exists(PATH):
    print('grape directory already exists')
else:
    %cd /content/drive/MyDrive/
    !git clone https://github.com/bdsul/grape.git
    print('Cloning grape in your Drive')

# change directory to 'grape'
%cd /content/drive/MyDrive/grape/

grape directory already exists
/content/drive/MyDrive/grape


Now you have a grape folder in your Drive account.

Upload the files adult_training.csv and adult_test.csv to the folder grape/datasets in your Drive before running the next cells.

### Train set

In [5]:
train_file = 'datasets/adult_training.csv'

In [6]:
# load train set
df_train = pd.read_csv(PATH+train_file)
df_train.head()

Unnamed: 0,age,workclass,education,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,28,Private,Bachelors,Never-married,Not-in-family,White,Male,0,0,40,United-States,<=50K
1,34,Self-emp-not-inc,Bachelors,Married-civ-spouse,Husband,Black,Male,0,1887,48,United-States,>50K
2,32,Private,Bachelors,Never-married,Not-in-family,Black,Female,0,0,40,United-States,<=50K
3,46,Private,Bachelors,Divorced,Not-in-family,White,Male,0,0,40,Others,<=50K
4,44,Private,Bachelors,Married-civ-spouse,Husband,White,Male,0,0,50,United-States,>50K


In [7]:
unique_workclass_values = df_train['workclass'].unique()
print(unique_workclass_values)

['Private' 'Self-emp-not-inc' 'Local-gov' 'State-gov']


In [8]:
unique_marital_status_values = df_train['marital-status'].unique()
print(unique_marital_status_values)

['Never-married' 'Married-civ-spouse' 'Divorced']


In [9]:
unique_relationship_values = df_train['relationship'].unique()
print(unique_relationship_values)

['Not-in-family' 'Husband' 'Wife' 'Other-relative']


In [10]:
unique_education_values = df_train['education'].unique()
print(unique_education_values)

['Bachelors' 'Some-college' 'HS-grad' 'Masters' 'Doctorate']


In [11]:
unique_race_values = df_train['race'].unique()
print(unique_race_values)

['White' 'Black' 'Asian-Pac-Islander']


In [12]:
df_train.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,5200.0,5200.0,5200.0,5200.0
mean,39.688077,1059.895,109.486346,42.786538
std,11.973363,6687.36408,442.694051,10.937644
min,17.0,0.0,0.0,1.0
25%,30.0,0.0,0.0,40.0
50%,38.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,48.0
max,90.0,99999.0,2559.0,99.0


# Pre-processing:-

1. Encoding:-

In [13]:
df_train['sex']= df_train['sex'].map({'Male':1, 'Female': 0})
df_train['native-country']= df_train['native-country'].map({'United-States':1, 'Others': 0})

In [14]:
df_train = pd.get_dummies(df_train, columns=['workclass', 'education', 'marital-status', 'relationship', 'race'])

In [15]:
X_train = df_train.copy()
# warning: cannot drop it more than once
X_train.drop(['income'], axis=1, inplace=True)

In [16]:
X_train

Unnamed: 0,age,sex,capital-gain,capital-loss,hours-per-week,native-country,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,...,marital-status_Divorced,marital-status_Married-civ-spouse,marital-status_Never-married,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White
0,28,1,0,0,40,1,0,1,0,0,...,0,0,1,0,1,0,0,0,0,1
1,34,1,0,1887,48,1,0,0,1,0,...,0,1,0,1,0,0,0,0,1,0
2,32,0,0,0,40,1,0,1,0,0,...,0,0,1,0,1,0,0,0,1,0
3,46,1,0,0,40,0,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
4,44,1,0,0,50,1,0,1,0,0,...,0,1,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5195,57,0,0,0,40,1,0,1,0,0,...,1,0,0,0,1,0,0,0,0,1
5196,23,1,0,0,40,1,0,1,0,0,...,0,1,0,1,0,0,0,0,0,1
5197,47,1,0,0,65,1,0,1,0,0,...,0,1,0,1,0,0,0,0,0,1
5198,66,1,20051,0,40,1,0,1,0,0,...,0,1,0,1,0,0,0,0,0,1


You should represent the outputs with 0 where the income is smaller or equal to 50K and with 1 if it is greater than 50K.

Follow exactly this approach, because the test targets are represented like this in the competition.

In [17]:
# class labels
l, _ = X_train.shape

y_train = np.zeros([l,], dtype=int)

for i in range(l):
  if df_train['income'].iloc[i] == '>50K':
    y_train[i] = 1
  elif df_train['income'].iloc[i] == '<=50K':
    y_train[i] = 0

In [18]:
print(y_train[0:5]) #print head

[0 1 0 0 1]


### Test set

In [19]:
test_file = 'datasets/adult_test.csv'

In [20]:
# load test set
df_test = pd.read_csv(PATH+test_file)
df_test.head()

Unnamed: 0,age,workclass,education,marital-status,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,33,Private,HS-grad,Never-married,Not-in-family,White,Male,3325,0,50,United-States
1,58,Private,HS-grad,Married-civ-spouse,Husband,White,Male,0,0,40,United-States
2,30,Self-emp-not-inc,HS-grad,Married-civ-spouse,Husband,White,Male,0,0,60,United-States
3,26,Private,Some-college,Never-married,Not-in-family,White,Female,0,0,20,United-States
4,43,State-gov,HS-grad,Never-married,Not-in-family,White,Male,0,0,60,United-States


In [21]:
df_test.describe()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
count,10402.0,10402.0,10402.0,10402.0
mean,39.811575,1280.969237,106.101038,42.749567
std,12.063746,7826.438595,438.826968,11.200949
min,18.0,0.0,0.0,1.0
25%,30.0,0.0,0.0,40.0
50%,38.0,0.0,0.0,40.0
75%,48.0,0.0,0.0,48.0
max,90.0,99999.0,3683.0,99.0


In [22]:
X_test = df_test.copy()

You will need to prepare both training and test datasets before working with a

1.   List item
2.   List item

Machine Learning method.

Consider you need to use some encoding method with categorical data.

You are free to use any other pre-processing ideas.

In [23]:
# Mapping 'sex' column
X_test['sex'] = X_test['sex'].map({'Male': 1, 'Female': 0})

# Mapping 'native-country' column
X_test['native-country'] = X_test['native-country'].map({'United-States': 1, 'Others': 0})

# Perform one-hot encoding for specified columns
X_test = pd.get_dummies(X_test, columns=['workclass', 'education', 'marital-status', 'relationship', 'race'])


In [24]:
X_test.head()

Unnamed: 0,age,sex,capital-gain,capital-loss,hours-per-week,native-country,workclass_Local-gov,workclass_Private,workclass_Self-emp-not-inc,workclass_State-gov,...,marital-status_Divorced,marital-status_Married-civ-spouse,marital-status_Never-married,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_White
0,33,1,3325,0,50,1,0,1,0,0,...,0,0,1,0,1,0,0,0,0,1
1,58,1,0,0,40,1,0,1,0,0,...,0,1,0,1,0,0,0,0,0,1
2,30,1,0,0,60,1,0,0,1,0,...,0,1,0,1,0,0,0,0,0,1
3,26,0,0,0,20,1,0,1,0,0,...,0,0,1,0,1,0,0,0,0,1
4,43,1,0,0,60,1,0,0,0,1,...,0,0,1,0,1,0,0,0,0,1


Convert the datasets to NumPy to easily use them.

In [25]:
# data features
# X_train = X_train.to_numpy()
X_test = X_test.to_numpy()

In [26]:
X_train = X_train.to_numpy()

## GRAPE

<div>
<img src="https://drive.google.com/uc?export=view&id=1hw43Oi3lGTCkspQ0ged2bZB8q2EpcPhz" width="150"/>
</div>

GRammatical Algorithms in Python for Evolution (GRAPE)


You can import functions to be used with your grammar from [functions.py](https://github.com/UL-BDS/grape/blob/main/functions.py) on GRAPE repository and / or you can define your own functions.

In [27]:
!pip install deap

import grape
import algorithms
from os import path
from deap import creator, base, tools
import random
import csv
import matplotlib.pyplot as plt

Collecting deap
  Downloading deap-1.4.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: deap
Successfully installed deap-1.4.1


You can import functions to be used with your grammar from [functions.py](https://github.com/UL-BDS/grape/blob/main/functions.py) on GRAPE repository and / or you can define your own functions.

In [28]:
from functions import add, sub, mul, pdiv, psqrt, plog, and_, or_, not_, less_than_or_equal, greater_than_or_equal

'heartDisease.bnf' is a grammar used for another problem just to check if everything is working well.

Write your own grammar in a text file and save it in your Drive account.

Put the whole address on GRAMMAR_FILE and print to check it.

In [29]:
GRAMMAR_FILE = 'Adult_grammar.bnf' #remove this line when you are using your own grammar

#f = open(GRAMMAR_FILE, "r") #remove the # in the beginning of this line when you are using your own grammar
f = open("grammars/" + GRAMMAR_FILE, "r") #remove this line when you are using your own grammar
print(f.read())
f.close()

<log_op> ::= <conditional_branches> | and_(<log_op>,<log_op>) | or_(<log_op>,<log_op>) | not_(<log_op>) | <boolean_feature>
<conditional_branches> ::= less_than_or_equal(<num_op>,<num_op>) | greater_than_or_equal(<num_op>, <num_op>)
<num_op>   ::= add(<num_op>,<num_op>) | sub(<num_op>,<num_op>) | mul(<num_op>,<num_op>) | pdiv(<num_op>,<num_op>) | <nonboolean_feature>
<boolean_feature> ::= x[1]|x[5]|x[6]|x[7]|x[8]|x[9]|x[10]|x[11]|x[12]|x[13]|x[14]|x[15]|x[16]|x[17]|x[18]|x[19]|x[20]|x[21]|x[22]|x[23]|x[24]
<nonboolean_feature> ::= x[0]|x[2]|x[3]|x[4]|<c><c>.<c><c>
<c>  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9


Run the following cell to put your grammar on the class Grammar.

In [30]:
BNF_GRAMMAR = grape.Grammar(path.join("grammars", GRAMMAR_FILE))

The fitness function here is the percentage of outputs wrongly predicted.

You can write your own fitness function if you prefer.

In [31]:
def fitness_eval(individual, points):
    """
    Fitness Function
    """

    x = points[0]
    Y = points[1]

    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        #print("individual.phenotype ::: ", individual.phenotype)
        pred = eval(individual.phenotype)
        #print("pred :  ", pred)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    compare = np.equal(Y,pred)
    fitness = 1 - np.mean(compare)

    return fitness,

To use properly the fitness function above with GRAPE, the features must be in the lines, and the samples must be in the columns, so if your data is not like that, you need to transpose the matrix.

Take a look at the print. If you run this cell two times, the matrix will be transposed again and will not work properly.

In [32]:
print('Before Translose Training (X,Y):\t', X_train.shape, y_train.shape)
print('Before Translose Test (X):\t', X_test.shape)

X_train = np.transpose(X_train)
X_test = np.transpose(X_test)

print('Training (X,Y):\t', X_train.shape, y_train.shape)
print('Test (X):\t', X_test.shape)

Before Translose Training (X,Y):	 (5200, 25) (5200,)
Before Translose Test (X):	 (10402, 25)
Training (X,Y):	 (25, 5200) (5200,)
Test (X):	 (25, 10402)


Set the Grammatical Evolution parameters.

Make sure you set a random seed just in case we need to re-run your experiments.

In [33]:
POPULATION_SIZE = 1000
MAX_GENERATIONS = 200
P_CROSSOVER = 0.7
P_MUTATION = 0.01
ELITE_SIZE = 2
HALL_OF_FAME_SIZE =  10

TOURNAMENT_SIZE = 15
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

CODON_CONSUMPTION = 'lazy'
GENOME_REPRESENTATION = 'list'
MAX_GENOME_LENGTH = None

MAX_INIT_TREE_DEPTH = 15
MIN_INIT_TREE_DEPTH = 5
MAX_TREE_DEPTH = 90
MAX_WRAPS = 0
CODON_SIZE = 255

REPORT_ITEMS = ['gen', 'invalid', 'avg', 'std', 'min', 'max',
                'best_ind_length', 'avg_length',
                'best_ind_nodes', 'avg_nodes',
                'best_ind_depth', 'avg_depth',
                'avg_used_codons', 'best_ind_used_codons',
                'structural_diversity', 'fitness_diversity',
                'selection_time', 'generation_time']

In [34]:
toolbox = base.Toolbox()

# define a single objective, minimising fitness strategy:
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))

creator.create('Individual', grape.Individual, fitness=creator.FitnessMin)

toolbox.register("populationCreator", grape.sensible_initialisation, creator.Individual)

toolbox.register("evaluate", fitness_eval)

# Tournament selection:
toolbox.register("select", tools.selTournament, tournsize=TOURNAMENT_SIZE)

# Single-point crossover:
toolbox.register("mate", grape.crossover_onepoint)

# Flip-int mutation:
toolbox.register("mutate", grape.mutation_int_flip_per_codon)

In [35]:
# create initial population (generation 0):
population = toolbox.populationCreator(pop_size=POPULATION_SIZE,
                                           bnf_grammar=BNF_GRAMMAR,
                                           min_init_depth=MIN_INIT_TREE_DEPTH,
                                           max_init_depth=MAX_INIT_TREE_DEPTH,
                                           codon_size=CODON_SIZE,
                                           codon_consumption=CODON_CONSUMPTION,
                                           genome_representation=GENOME_REPRESENTATION
                                            )

# define the hall-of-fame object:
hof = tools.HallOfFame(HALL_OF_FAME_SIZE)

# prepare the statistics object:
stats = tools.Statistics(key=lambda ind: ind.fitness.values)
stats.register("avg", np.nanmean)
stats.register("std", np.nanstd)
stats.register("min", np.nanmin)
stats.register("max", np.nanmax)

In [36]:
population, logbook = algorithms.ge_eaSimpleWithElitism(population, toolbox, cxpb=P_CROSSOVER, mutpb=P_MUTATION,
                                              ngen=MAX_GENERATIONS, elite_size=ELITE_SIZE,
                                              bnf_grammar=BNF_GRAMMAR,
                                              codon_size=CODON_SIZE,
                                              max_tree_depth=MAX_TREE_DEPTH,
                                              max_genome_length=MAX_GENOME_LENGTH,
                                              points_train=[X_train, y_train],
                                              codon_consumption=CODON_CONSUMPTION,
                                              report_items=REPORT_ITEMS,
                                              genome_representation=GENOME_REPRESENTATION,
                                              stats=stats, halloffame=hof, verbose=False)

gen = 0 , Best fitness = (0.3011538461538461,)
gen = 1 , Best fitness = (0.2973076923076923,) , Number of invalids = 164
gen = 2 , Best fitness = (0.28711538461538466,) , Number of invalids = 175
gen = 3 , Best fitness = (0.2828846153846154,) , Number of invalids = 104
gen = 4 , Best fitness = (0.2828846153846154,) , Number of invalids = 122
gen = 5 , Best fitness = (0.2828846153846154,) , Number of invalids = 156
gen = 6 , Best fitness = (0.2792307692307693,) , Number of invalids = 148
gen = 7 , Best fitness = (0.2759615384615385,) , Number of invalids = 43
gen = 8 , Best fitness = (0.27365384615384614,) , Number of invalids = 37
gen = 9 , Best fitness = (0.2619230769230769,) , Number of invalids = 26
gen = 10 , Best fitness = (0.2619230769230769,) , Number of invalids = 4
gen = 11 , Best fitness = (0.2619230769230769,) , Number of invalids = 0
gen = 12 , Best fitness = (0.2619230769230769,) , Number of invalids = 0
gen = 13 , Best fitness = (0.2619230769230769,) , Number of invalids 

Show the best individual as an expression.

In [37]:
# Best individual
import textwrap
best = hof.items[0].phenotype
print("Best individual: \n","\n".join(textwrap.wrap(best,80)))
print("\nTraining Fitness: ", hof.items[0].fitness.values[0])
print("Depth: ", hof.items[0].depth)
print("Length of the genome: ", len(hof.items[0].genome))
print(f'Used portion of the genome: {hof.items[0].used_codons/len(hof.items[0].genome):.2f}')

Best individual: 
 or_(and_(or_(x[13],or_(less_than_or_equal(91.86,pdiv(x[2],54.75)),or_(less_than_
or_equal(74.99,pdiv(x[0],x[0])),and_(greater_than_or_equal(x[3], x[2]),or_(less_
than_or_equal(74.59,pdiv(x[3],64.74)),or_(x[11],or_(less_than_or_equal(74.55,pdi
v(x[3],24.75)),not_(not_(x[10]))))))))),x[16]),greater_than_or_equal(pdiv(sub(65
.08,x[3]),add(add(sub(44.01,98.69),pdiv(x[2],79.55)),pdiv(19.54,pdiv(79.43,sub(6
7.80,pdiv(x[3],add(add(sub(40.34,98.92),pdiv(x[2],79.53)),pdiv(19.51,pdiv(79.55,
sub(67.53,pdiv(60.41,pdiv(x[3],76.01)))))))))))), sub(98.55,pdiv(x[2],79.97))))

Training Fitness:  0.20019230769230767
Depth:  19
Length of the genome:  808
Used portion of the genome: 0.28


Define a function to predict values, without comparing to expected outputs.

In [38]:
def predict(individual, X):
    x = X

    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    return pred

Predict the classes of the test set.

Make sure you print here in the notebook you will submit to Brightspace the same predictions you used in your best submission to the Kaggle competition.

In [39]:
y_pred = predict(hof.items[0], X_test)
print("Predicted classes of the test set: ", y_pred)

Predicted classes of the test set:  [False False False ... False False False]


In [40]:
type(y_pred)

numpy.ndarray

In [41]:
y_pred_new = y_pred
y_pred_new

array([False, False, False, ..., False, False, False])

In [42]:
converted_array = y_pred_new.astype(int)
converted_array
y_pred_int = y_pred.astype(int)

In [43]:
#Code to create a .csv

In [44]:
import csv

# Create a CSV file with the specified format
csv_file_path = 'income.csv'

# Open the CSV file in write mode
with open(csv_file_path, 'w', newline='') as csvfile:
    # Create a CSV writer
    csv_writer = csv.writer(csvfile)

    # Write the header row
    csv_writer.writerow(['index', 'income'])

    # Write the data rows
    for i, pred in enumerate(y_pred_int):
        csv_writer.writerow([i, pred])