## Description

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!

In [None]:
# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Dataset

The dataset provides 12 input variables that are a mixture of categorical, ordinal, boolean and numerical data types:

1. PassengerId
2. HomePlanet
3. CryoSleep
4. Destination
5. Age
6. VIP
7. RoomService
8. FoodCourt
9. ShoppingMall
10. Spa
11. VRDeck
12. Name


This is a binary classification problem where the task is to predict whether a passenger was transported to an alternate dimension.

### Load the dataset

In [None]:
# Suppressing Warnings:
import warnings
warnings.filterwarnings("ignore")

In [None]:
## mount your Google drive
# 1) click on the link
# 2) sign in
# 3) copy the provided code
# 4) paste it in the text box bellow
# 5) click the folder icon at the right
# 6) verify your drive is mounted

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Clone the GRAPE repository at first because the dataset to be used is already there.

In [None]:
import os
# Get the library from our BDS research Group
# copy the path from your drive
PATH = '/content/drive/MyDrive/grape/'

# check if 'grape' already exists
if os.path.exists(PATH):
    print('grape directory already exists')
else:
    %cd /content/drive/MyDrive/
    !git clone https://github.com/UL-BDS/grape.git
    print('Cloning grape in your Drive')

# change directory to 'grape'
%cd /content/drive/MyDrive/grape/

grape directory already exists
/content/drive/MyDrive/grape


### Train set

In [None]:
train_file = 'datasets/spaceshipTitanic_train.csv'

In [None]:
# load train set
df_train = pd.read_csv(PATH+train_file)
df_train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0,Earth,False,55 Cancri e,22,False,0,833,381,0,12,Miranda Pratt,True
1,1,Mars,True,TRAPPIST-1e,61,False,0,0,0,0,0,Isaac Werner,True
2,2,Mars,True,TRAPPIST-1e,5,False,0,0,0,0,0,Elisha Rosario,True
3,3,Earth,False,55 Cancri e,14,False,653,0,4,0,0,Deshawn Hall,False
4,4,Earth,False,PSO J318.5-22,2,False,0,0,0,0,0,Justice Archer,True


In [None]:
df_train.describe()

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,999.5,28.5555,213.4605,497.9025,166.237,342.252,269.211
std,577.494589,14.629112,615.762402,1763.257082,509.568841,1236.474773,1021.074852
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,499.75,20.0,0.0,0.0,0.0,0.0,0.0
50%,999.5,26.0,0.0,0.0,0.0,0.0,0.0
75%,1499.25,37.0,32.0,61.25,23.0,67.25,37.0
max,1999.0,79.0,6899.0,27723.0,10424.0,18572.0,14485.0


In [None]:
X_train = df_train.copy()
# warning: cannot drop it more than once
X_train.drop(['Transported'], axis=1, inplace=True)

In [None]:
# class labels
l, _ = X_train.shape

y_train = np.zeros([l,], dtype=bool)

for i in range(l):
  y_train[i] = df_train['Transported'].iloc[i]

In [None]:
#y_train.head()
print(y_train[0:5])

[ True  True  True False  True]


### Test set

In [None]:
test_file = 'datasets/spaceshipTitanic_test.csv'

In [None]:
# load test set
df_test = pd.read_csv(PATH+test_file)
df_test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,2000,Mars,False,TRAPPIST-1e,54,False,676,0,231,379,0,Dawson Knox
1,2001,Mars,False,TRAPPIST-1e,43,False,336,11,796,15,0,Jaylee Navarro
2,2002,Europa,False,55 Cancri e,33,False,77,2381,0,3656,150,Dario Hart
3,2003,Earth,True,55 Cancri e,30,False,0,0,0,0,0,Alden Parker
4,2004,Europa,False,TRAPPIST-1e,31,False,0,53,0,2963,1017,Gina Frank


In [None]:
df_test.describe()

Unnamed: 0,PassengerId,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,4923.0,4923.0,4923.0,4923.0,4923.0,4923.0,4923.0
mean,4461.0,29.028235,231.206175,473.335568,184.646354,308.407678,317.807434
std,1421.292018,14.466997,696.138873,1634.705363,677.528376,1126.346091,1164.989135
min,2000.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3230.5,19.0,0.0,0.0,0.0,0.0,0.0
50%,4461.0,27.0,0.0,0.0,0.0,0.0,0.0
75%,5691.5,38.0,59.0,84.0,31.0,63.0,59.5
max,6922.0,79.0,14327.0,29813.0,23492.0,22408.0,20336.0


In [None]:
X_test = df_test.copy()

We need to prepare both training and test datasets before working with a Machine Learning method.

Consider the following tips:

1.   Remove columns that you think does not influence the class label (for example 'PassengerId');
2.   Use some encoding method with categorical data.

You are free to use any other pre-processing ideas.

You could use for instance, one-hot encoding with categorical data, as was shown when we studied the heart disease dataset.


Number of categories on each categorical data:



1.   HomePlanet: 3
2.   Destination: 3



In [None]:
#Include your code here
X_train.drop(['PassengerId'], axis=1, inplace=True)
X_train.drop(['Name'], axis=1, inplace=True)
X_test.drop(['PassengerId'], axis=1, inplace=True)
X_test.drop(['Name'], axis=1, inplace=True)
X_train = pd.get_dummies(X_train, columns=['HomePlanet', 'Destination'])
X_test = pd.get_dummies(X_test, columns=['HomePlanet', 'Destination'])


In [None]:
display(X_train.head(100))
display(X_test.head(100))


Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,22,False,0,833,381,0,12,1,0,0,1,0,0
1,True,61,False,0,0,0,0,0,0,0,1,0,0,1
2,True,5,False,0,0,0,0,0,0,0,1,0,0,1
3,False,14,False,653,0,4,0,0,1,0,0,1,0,0
4,False,2,False,0,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,True,17,False,0,0,0,0,0,1,0,0,1,0,0
96,False,31,False,1,104,1338,698,126,0,1,0,0,0,1
97,False,13,False,16,0,87,544,0,1,0,0,0,1,0
98,False,37,False,0,0,0,0,0,0,1,0,0,0,1


Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,False,54,False,676,0,231,379,0,0,0,1,0,0,1
1,False,43,False,336,11,796,15,0,0,0,1,0,0,1
2,False,33,False,77,2381,0,3656,150,0,1,0,1,0,0
3,True,30,False,0,0,0,0,0,1,0,0,1,0,0
4,False,31,False,0,53,0,2963,1017,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,True,4,False,0,0,0,0,0,1,0,0,1,0,0
96,True,2,False,0,0,0,0,0,1,0,0,0,0,1
97,False,18,False,0,0,0,391,283,1,0,0,0,1,0
98,False,8,False,0,0,0,0,0,1,0,0,0,0,1


Convert the datasets to NumPy to easily use them.

In [None]:
# data features
X_train = X_train.to_numpy()
X_test = X_test.to_numpy()


## GRAPE

<div>
<img src="https://drive.google.com/uc?export=view&id=1hw43Oi3lGTCkspQ0ged2bZB8q2EpcPhz" width="150"/>
</div>

GRammatical Algorithms in Python for Evolution (GRAPE)


In [None]:
!pip install deap==1.3

import grape
import algorithms

from os import path
from deap import creator, base, tools
import random
import csv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting deap==1.3
  Downloading deap-1.3.0-cp37-cp37m-manylinux2010_x86_64.whl (152 kB)
[K     |████████████████████████████████| 152 kB 5.2 MB/s 
Installing collected packages: deap
Successfully installed deap-1.3.0


You can import functions to be used with your grammar from [functions.py](https://github.com/UL-BDS/grape/blob/main/functions.py) on GRAPE repository and / or you can define your own functions.

In [None]:
from functions import add, sub, mul, pdiv, psqrt, plog, neg, and_, or_, not_, less_than_or_equal, greater_than_or_equal, nand_, nor_

'heartDisease.bnf' is a grammar used for another problem just to check if everything is working well.

Write your own grammar in a text file and save it in your Drive account.

Put the whole address on GRAMMAR_FILE and print to check it.

In [None]:
#GRAMMAR_FILE = '/content/drive/MyDrive/data/example.bnf' #put the whole address of your own grammar and remove the # in the beginning of this line
GRAMMAR_FILE = 'spacetitanic_demo.bnf' #remove this line when you are using your own grammar

#f = open(GRAMMAR_FILE, "r") #remove the # in the beginning of this line when you are using your own grammar
f = open("grammars/" + GRAMMAR_FILE, "r") #remove this line when you are using your own grammar
print(f.read())
f.close()


<log_op> ::= <conditional_branches> | and_(<log_op>,<log_op>) | or_(<log_op>,<log_op>) | not_(<log_op>) | <x>
<conditional_branches> ::= less_than_or_equal(<num_op>,<num_op>) | greater_than_or_equal(<num_op>, <num_op>) 
<num_op>   ::= add(<num_op>,<num_op>) | sub(<num_op>,<num_op>) | mul(<num_op>,<num_op>) | pdiv(<num_op>,<num_op>) | <y>
<x> ::= x[0]|x[2]|x[8]|x[9]|x[10]|x[11]|x[12]|x[13]
<y> ::= x[1]|x[3]|x[4]|x[5]|x[6]|x[7]
<c>  ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10



Run the following cell to put your grammar on the class Grammar.

In [None]:
#BNF_GRAMMAR = grape.Grammar(GRAMMAR_FILE) #remove the # in the beginning of this line when you are using your own grammar
BNF_GRAMMAR = grape.Grammar(path.join("grammars", GRAMMAR_FILE)) #remove this line when you are using your own grammar

The fitness function is the percentage of outputs wrongly predicted.

You can use any other fitness function, but you need to maintain the comma at the end of the returning lines.

In this notebook, GRAPE is being used only for minimisation, so the fitness function needs to consider it. If you want to maximise, you also need to change the weights in the toolbox.

Moreover, this fitness functions considers the predicted output as True or False. If your grammar allow individuals with other outputs, you need to change the fitness function.

In [None]:
def mae(y, yhat):
    """
    Calculate mean absolute error between inputs.

    :param y: The expected input (i.e. from dataset).
    :param yhat: The given input (i.e. from phenotype).
    :return: The mean absolute error.
    """

    compare = np.equal(y,yhat)

    return 1 - np.mean(compare)

def fitness_eval(individual, points):
    """
    Fitness Function
    """

    x = points[0]
    Y = points[1]

    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    try:
        Y_class = [1 if pred[i] > 0 else 0 for i in range(len(Y))]
    except (IndexError, TypeError):
        return np.NaN,
    fitness = mae(Y, Y_class)

    return fitness,

To use properly the fitness function above with GRAPE, the features must be in the lines, and the samples must be in the columns, so if your data is not like that, you need to transpose the matrix.

Take a look at the print. If you run this cell two times, the matrix will be transposed again and will not work properly.

In [None]:
X_train = np.transpose(X_train)
X_test = np.transpose(X_test)

print('Training (X,Y):\t', X_train.shape, y_train.shape)
print('Test (X):\t', X_test.shape)

Training (X,Y):	 (14, 2000) (2000,)
Test (X):	 (14, 4923)


Set the Grammatical Evolution parameters.

In [None]:
POPULATION_SIZE = 4000
MAX_GENERATIONS = 100
P_CROSSOVER = 0.8
P_MUTATION = 0.05
ELITE_SIZE = round(0.01*POPULATION_SIZE) #it should be smaller or equal to HALLOFFAME_SIZE
HALLOFFAME_SIZE = round(0.01*POPULATION_SIZE) #it should be at least 1

TOURNAMENT_SIZE = 100
RANDOM_SEED = 42
random.seed(RANDOM_SEED)

CODON_CONSUMPTION = 'lazy'
GENOME_REPRESENTATION = 'list'
MAX_GENOME_LENGTH = None

MAX_INIT_TREE_DEPTH = 14
MIN_INIT_TREE_DEPTH = 5
MAX_TREE_DEPTH = 20
MAX_WRAPS = 0
CODON_SIZE = 255

REPORT_ITEMS = ['gen', 'invalid', 'avg', 'std', 'min', 'max',
                'best_ind_length', 'avg_length',
                'best_ind_nodes', 'avg_nodes',
                'best_ind_depth', 'avg_depth',
                'avg_used_codons', 'best_ind_used_codons',
                'structural_diversity', 'fitness_diversity',
                'selection_time', 'generation_time']

Create a toolbox.

In [None]:
toolbox = base.Toolbox()

# define a single objective, minimising fitness strategy:
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))

creator.create('Individual', grape.Individual, fitness=creator.FitnessMin)

toolbox.register("populationCreator", grape.sensible_initialisation, creator.Individual)

toolbox.register("evaluate", fitness_eval)

# Tournament selection:
toolbox.register("select", tools.selTournament, tournsize=TOURNAMENT_SIZE)

# Single-point crossover:
toolbox.register("mate", grape.crossover_onepoint)

# Flip-int mutation:
toolbox.register("mutate", grape.mutation_int_flip_per_codon)

In [None]:
# create initial population (generation 0):
population = toolbox.populationCreator(pop_size=POPULATION_SIZE,
                                           bnf_grammar=BNF_GRAMMAR,
                                           min_init_depth=MIN_INIT_TREE_DEPTH,
                                           max_init_depth=MAX_INIT_TREE_DEPTH,
                                           codon_size=CODON_SIZE,
                                           codon_consumption=CODON_CONSUMPTION,
                                           genome_representation=GENOME_REPRESENTATION
                                            )

# define the hall-of-fame object:
hof = tools.HallOfFame(HALLOFFAME_SIZE)

# prepare the statistics object:
stats = tools.Statistics(key=lambda ind: ind.fitness.values)
stats.register("avg", np.nanmean)
stats.register("std", np.nanstd)
stats.register("min", np.nanmin)
stats.register("max", np.nanmax)

Run Grammatical Evolution.

In [None]:
population, logbook = algorithms.ge_eaSimpleWithElitism(population, toolbox, cxpb=P_CROSSOVER, mutpb=P_MUTATION,
                                              ngen=MAX_GENERATIONS, elite_size=ELITE_SIZE,
                                              bnf_grammar=BNF_GRAMMAR,
                                              codon_size=CODON_SIZE,
                                              max_tree_depth=MAX_TREE_DEPTH,
                                              max_genome_length=MAX_GENOME_LENGTH,
                                              points_train=[X_train, y_train],
                                              codon_consumption=CODON_CONSUMPTION,
                                              report_items=REPORT_ITEMS,
                                              genome_representation=GENOME_REPRESENTATION,
                                              stats=stats, halloffame=hof, verbose=False)

gen = 0 , Fitness = (0.26049999999999995,)
gen = 1 , Best fitness = (0.22050000000000003,) , Length of the best ind = 106
gen = 2 , Best fitness = (0.22050000000000003,) , Length of the best ind = 106
gen = 3 , Best fitness = (0.22050000000000003,) , Length of the best ind = 106
gen = 4 , Best fitness = (0.22050000000000003,) , Length of the best ind = 106
gen = 5 , Best fitness = (0.22050000000000003,) , Length of the best ind = 106
gen = 6 , Best fitness = (0.22050000000000003,) , Length of the best ind = 19
gen = 7 , Best fitness = (0.22050000000000003,) , Length of the best ind = 163
gen = 8 , Best fitness = (0.21599999999999997,) , Length of the best ind = 33
gen = 9 , Best fitness = (0.21599999999999997,) , Length of the best ind = 33
gen = 10 , Best fitness = (0.21599999999999997,) , Length of the best ind = 33
gen = 11 , Best fitness = (0.21599999999999997,) , Length of the best ind = 33
gen = 12 , Best fitness = (0.21599999999999997,) , Length of the best ind = 33
gen = 13 , B

Show the best individual as an expression.

In [None]:
# Best individual
import textwrap
best = hof.items[0].phenotype
print("Best individual: \n","\n".join(textwrap.wrap(best,80)))
print("\nTraining Fitness: ", hof.items[0].fitness.values[0])
print("Depth: ", hof.items[0].depth)
print("Length of the genome: ", len(hof.items[0].genome))
print(f'Used portion of the genome: {hof.items[0].used_codons/len(hof.items[0].genome):.2f}')

Best individual: 
 greater_than_or_equal(sub(x[1],add(x[7],x[7])),
add(sub(x[3],sub(x[1],add(x[3],x[7]))),add(sub(x[7],x[4]),add(x[6],x[3]))))

Training Fitness:  0.21050000000000002
Depth:  9
Length of the genome:  274
Used portion of the genome: 0.12


Define a function to predict values.

In [None]:
def predict(individual, X):
    x = X

    if individual.invalid == True:
        return np.NaN,

    # Evaluate the expression
    try:
        pred = eval(individual.phenotype)
    except (FloatingPointError, ZeroDivisionError, OverflowError,
            MemoryError):
        return np.NaN,
    assert np.isrealobj(pred)

    _, c = x.shape

    try:
        Y_class = [True if pred[i] > 0 else False for i in range(c)]
    except (IndexError, TypeError):
        return np.NaN,

    return Y_class

Predict the classes of the test set.

In [None]:
y_pred = predict(hof.items[0], X_test)
for i in range(len(y_pred)):
  y_pred[i] = str(y_pred[i]).upper()
print("Predicted classes of the test set: ", y_pred)

Predicted classes of the test set:  ['FALSE', 'FALSE', 'FALSE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'FALSE', 'TRUE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'FALSE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'FALSE', 'FALSE', 

Save it in a .csv file and submit it in the Kaggle competition.

The format is as follows:
1. First column is the original `PassengerId` column in the test set;
2. Second column is named `Transported` and contains the predictions (only 0's or 1's).

In [None]:
df_id = df_test['PassengerId']
df_class = pd.DataFrame(data=y_pred, columns = ['Transported'])
df_pred = pd.concat([df_id, df_class], axis=1)

df_pred.to_csv('predictions1.csv', sep=',', index=False)

## Summary



In conclusion, I get a score of 0.78.




