# Titanic - Redes Neurais

Primeiro, lemos os arquivos e filtramos o dataset.

A coluna Survived do dataset de treino foi separada,
pois nela estão os labels dos dados de treino.

In [104]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

trainDataset = pd.read_csv('train.csv')
testData = pd.read_csv('test.csv')

trainLabels = trainDataset['Survived']
trainData = trainDataset.drop(['Survived'], axis=1)

## Pré processamento

Primeiramente, buscamos analisar os dados. Queremos verificar
quais dados precisamos pré processar para que estes possam ser
processados corretamente pelo algoritmo de rede neural:

In [105]:
display(trainData.head(10))
display(trainLabels.head(5))

print('Tipos:')
display(trainData.dtypes)

print('Únicos:')
display(trainData.nunique())

print('Nulos:')
display(trainData.isna().sum())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Tipos:


PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Únicos:


PassengerId    891
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

Nulos:


PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Como percebemos que há valores textuais e valores nulos, buscaremos
transformar esses dados em valores numéricos, para que possam ser processados pelo algoritmo.

Para as colunas com valores textuais, as que tem um número pequeno de valores
repetidos podem ser interpretadas como classes, em que cada valor representa
uma classe diferente (como as colunas Sex e Embarked). A partir disso,
podemos mapear essas classes para valores numéricos que irão representá-las.
Usaremos o OneHotEncoder para mapear cada classe para um novo atributo.

A coluna Pclass, apesar de numérica, também representa classes, portanto também separaremos esta em uma coluna para cada classe.

In [106]:
from sklearn.preprocessing import OneHotEncoder

def split_class(trainData, testData, c):
    enc = OneHotEncoder(handle_unknown='ignore')
    col = trainData[c].to_numpy().reshape(-1, 1)
    enc.fit(col)
    cols = np.vectorize(lambda s: f'{c}_{s}')(np.char.capitalize(enc.categories_[0].astype('str')))
    
    trainData = trainData.loc[:, trainData.columns != c].join(pd.DataFrame(enc.transform(col).toarray(), columns=cols))
    testData = testData.loc[:, testData.columns != c].join(pd.DataFrame(enc.transform(col).toarray(), columns=cols))
    return trainData, testData

trainData, testData = split_class(trainData, testData, 'Sex')
trainData, testData = split_class(trainData, testData, 'Embarked')
trainData, testData = split_class(trainData, testData, 'Pclass')

display(trainData.head(4))
display(testData.head(4))

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_Female,Sex_Male,Embarked_C,Embarked_Q,Embarked_S,Embarked_Nan,Pclass_1,Pclass_2,Pclass_3
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_Female,Sex_Male,Embarked_C,Embarked_Q,Embarked_S,Embarked_Nan,Pclass_1,Pclass_2,Pclass_3
0,892,"Kelly, Mr. James",34.5,0,0,330911,7.8292,,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,893,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0,,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,894,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,895,"Wirz, Mr. Albert",27.0,0,0,315154,8.6625,,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


Agora, iremos tratar os campos sem valor definido, retirando os dados faltantes.
Para transformá-los em valores numéricos, iremos preencher os dados faltantes
em cada coluna com o valor mais frequente e com a mediana na coluna Age (visto que é inteira):

Iremos, também, separar a coluna com o id dos passageiros para que esta não afete
o modelo a ser treinado, mas usaremos esta depois para fazer a previsão dos dados
do titanic.

As outras colunas com texto deverão ser descartadas (Name, Ticket e Cabin):

In [107]:
# Retirando colunas de texto
trainData = trainData.drop(['Name', 'Ticket', 'Cabin'], axis=1)
testData = testData.drop(['Name', 'Ticket', 'Cabin'], axis=1)

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
ageImp = SimpleImputer(missing_values=np.nan, strategy='median')

trainData.loc[:, trainData.columns != 'Age'] = imp.fit_transform(trainData.loc[:, trainData.columns != 'Age'])
trainData[['Age']] = ageImp.fit_transform(trainData[['Age']])

testData.loc[:, testData.columns != 'Age'] = imp.transform(testData.loc[:, testData.columns != 'Age'])
testData[['Age']] = ageImp.transform(testData[['Age']])

trainIds = trainData.pop('PassengerId')
testIds = testData.pop('PassengerId')

display(trainData.head(4))
display(testData.head(4))
display(trainIds.head(4))
display(testIds.head(4))

  trainData.loc[:, trainData.columns != 'Age'] = imp.fit_transform(trainData.loc[:, trainData.columns != 'Age'])
  testData.loc[:, testData.columns != 'Age'] = imp.transform(testData.loc[:, testData.columns != 'Age'])


Unnamed: 0,Age,SibSp,Parch,Fare,Sex_Female,Sex_Male,Embarked_C,Embarked_Q,Embarked_S,Embarked_Nan,Pclass_1,Pclass_2,Pclass_3
0,22.0,1.0,0.0,7.25,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,38.0,1.0,0.0,71.2833,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,26.0,0.0,0.0,7.925,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,35.0,1.0,0.0,53.1,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


Unnamed: 0,Age,SibSp,Parch,Fare,Sex_Female,Sex_Male,Embarked_C,Embarked_Q,Embarked_S,Embarked_Nan,Pclass_1,Pclass_2,Pclass_3
0,34.5,0.0,0.0,7.8292,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,47.0,1.0,0.0,7.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,62.0,0.0,0.0,9.6875,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,27.0,0.0,0.0,8.6625,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


0    1.0
1    2.0
2    3.0
3    4.0
Name: PassengerId, dtype: float64

0    892.0
1    893.0
2    894.0
3    895.0
Name: PassengerId, dtype: float64

Podemos ver que os dados ainda não estão normalizados. Iremos, portanto, normalizar os dados para que estes fiquem entre 0 e 1:

In [108]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

trainData[:] = scaler.fit_transform(trainData)
testData[:] = scaler.transform(testData)

display(trainData.head(4))
display(testData.head(4))

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_Female,Sex_Male,Embarked_C,Embarked_Q,Embarked_S,Embarked_Nan,Pclass_1,Pclass_2,Pclass_3
0,0.271174,0.125,0.0,0.014151,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.472229,0.125,0.0,0.139136,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.321438,0.0,0.0,0.015469,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.434531,0.125,0.0,0.103644,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


Unnamed: 0,Age,SibSp,Parch,Fare,Sex_Female,Sex_Male,Embarked_C,Embarked_Q,Embarked_S,Embarked_Nan,Pclass_1,Pclass_2,Pclass_3
0,0.428248,0.0,0.0,0.015282,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.585323,0.125,0.0,0.013663,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.773813,0.0,0.0,0.018909,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,0.334004,0.0,0.0,0.016908,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


## Treinando e analisando o modelo

Agora, podemos treinar o nosso modelo.

Vamos fazer um k-fold com 5 divisões testando o modelo com os otimizadores rmsprop e adam e com batch sizes de 32, 64, 96, 128 com 10 épocas:

In [None]:
from tensorflow import keras
from tensorflow.keras import layers, optimizers
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score


def train_model(m, bs, epochs, n_splits, X, y):
    split_scores = []
    kf = StratifiedKFold(n_splits, shuffle=False)
    for i, (tr_id, te_id) in enumerate(kf.split(X, y)):
        trData, trLabels = X.iloc[tr_id], y.iloc[tr_id]
        teData, teLabels = X.iloc[te_id], y.iloc[te_id]
        
        m.fit(trData, trLabels, epochs=epochs, batch_size=bs, verbose=0, use_multiprocessing=True)
        split_scores.append(m.evaluate(teData, teLabels, verbose=0)[1])
    
    return np.average(split_scores)


def train_keras(n_splits, X, y, models_scores, m_creator, epochs):
    for opt in adams:
        # 8 16 32 64 128
        for bs in (2**i for i in range(3, 6)):
            m = m_creator(opt)
            avg_score = train_model(m, bs, epochs, n_splits, X, y)
            models_scores.append([(opt, bs), m, avg_score])


def create_keras(num_cols, n_layers, n_neurons):
    model = keras.Sequential()
    
    for i in range(n_layers):
        model.add(layers.Dense(
            n_neurons,
            activation='relu',
            kernel_initializer='uniform'
        ))
    
    model.add(layers.Dense(
        1,
        activation='softmax',
        kernel_initializer='uniform'
    ))

    model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model


accuracies = []
highest = (None, None, -1)
X, y = trainData, trainLabels
kf = StratifiedKFold(5, shuffle=True, random_state=42)
for i, (tr_id, te_id) in enumerate(kf.split(X, y)):
    trData, trLabels = X.iloc[tr_id], y.iloc[tr_id]
    teData, teLabels = X.iloc[te_id], y.iloc[te_id]

    models_scores = []
    num_cols = len(trainData.columns)
    train_keras(5, trData, trLabels, models_scores, lambda *args: create_keras(num_cols, *args), 10)
    
    scores = np.array(list(zip(*models_scores))[2])
    highest_accuracy = models_scores[np.argmax(scores)]
    
    (opt, bs), _, score = tuple(highest_accuracy)
    print('Model with highest accuracy:', (opt, bs), ', average accuracy: ', score)
    
    m = create_keras(num_cols, opt)
    m.fit(trData, trLabels, epochs=10, batch_size=bs, verbose=0, use_multiprocessing=True)
    acc = m.evaluate(teData, teLabels, verbose=0)[1]
    accuracies.append(acc)
    
    print('Model accuracy with test dataset:', acc)

    if acc > highest[2]:
        highest = ((opt, bs), m, acc)

total_acc = np.average(accuracies)
print('Average accuracy for all models:', total_acc)

print('Model with highest overall accuracy:', highest[0], ', accuracy with test dataset:', highest[2])


## Prevendo resultados para o dataset de teste

Agora podemos usar o nosso modelo para prever o dataset de teste
e submeter os dados de sobrevivência previstos.

Primeiro, vamos treinar o modelo por mais épocas e, depois, o usaremos
para prever os resultados:

In [8]:
opt, bs = highest[0]

model = create_keras(len(trainData.columns), opt, bs, 100)
model.fit(trData, trLabels)

testLabelsPredict = model.predict(testData)
display(testIds[:4])
display(testLabelsPredict[:4])
submission = pd.DataFrame()
submission['PassengerId'] = testIds
submission['Survived'] = testLabelsPredict
display(submission.head(6))
submission.to_csv('submission.csv', index=False)



0    892
1    893
2    894
3    895
Name: PassengerId, dtype: int64

array([[1.],
       [1.],
       [1.],
       [1.]], dtype=float32)

Unnamed: 0,PassengerId,Survived
0,892,1.0
1,893,1.0
2,894,1.0
3,895,1.0
4,896,1.0
5,897,1.0


Após submeter os dados para a plataforma, o score recebido foi de 0.79665.