Utilizar o conjunto de dados Titanic.

Entregáveis:
* Conjunto de teste não utilizado no treinamento
* Análise dos resultados (porque escolheu esse modelo, como foi a avaliação, etc)
* Utilizar FastAPI para fazer o predict de cada amostra individualmente

Dicas:
* Realizar limpeza dos dados
* Realizar experimentos utilizando normalização, padronização e dados brutos
* Fazer categorização dos dados. Se fizer categorização, realizar experimentos utilizando OneHotEncoding
* Realizer experimentos selecionando atributos
* Utilizar técnicas de otimização de hiperparametros


## Importações

In [2]:
import pandas as pd
from sklearn.preprocessing import Normalizer,LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
import joblib
import glob
import os

In [3]:
df_test = pd.read_csv("datasets/test.csv")
df_train = pd.read_csv("datasets/train.csv")
df_test_surv = pd.read_csv('datasets/gender_submission.csv')

## Recebendo dados e analizando

In [4]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [5]:
y_test = df_test_surv["Survived"]

In [6]:
df_proc = df_train.drop(columns=["PassengerId","Name","Ticket","Fare","Cabin","Embarked",])
x_test = df_test.drop(columns=["PassengerId","Name","Ticket","Fare","Cabin","Embarked",])

In [7]:
df_proc.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch
0,0,3,male,22.0,1,0
1,1,1,female,38.0,1,0
2,1,3,female,26.0,0,0
3,1,1,female,35.0,1,0
4,0,3,male,35.0,0,0


In [8]:
x_train = df_proc.drop(columns=["Survived"])

In [9]:
y_train = df_proc["Survived"]

In [10]:
#Substitui os NaN's por 0
x_train = x_train.fillna(0.0)
x_test = x_test.fillna(0.0)

In [11]:
x_train["Sex"] = LabelEncoder().fit_transform(x_train["Sex"])
x_test["Sex"] = LabelEncoder().fit_transform(x_test["Sex"])
# male == 1 , female == 0 

In [12]:
x_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch
0,3,1,22.0,1,0
1,1,0,38.0,1,0
2,3,0,26.0,0,0
3,1,0,35.0,1,0
4,3,1,35.0,0,0


## Pré-processamento

In [13]:
transformer = Normalizer().fit(x_train)
x_train = transformer.transform(x_train)
transformer = Normalizer().fit(x_test)
x_test = transformer.transform(x_test)

## KNN

In [11]:
nknn = [1,3,5,7,9,11,13,15]
params = {'n_neighbors': nknn}
knn = KNeighborsClassifier()
gs_knn = GridSearchCV(knn,params)
gs_knn.fit(x_train,y_train)

GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15]})

In [12]:
pred_knn = gs_knn.predict(x_test)

In [13]:
accuracy_score(y_test,pred_knn)
print(f"Accuracy = {accuracy_score(y_test,pred_knn)}")

Accuracy = 0.8325358851674641


## Decision Tree

In [14]:
criterion = ['gini','entropy']
params = {'criterion': criterion}
dt = DecisionTreeClassifier()
gs_dt = GridSearchCV(dt,params)
gs_dt.fit(x_train,y_train)

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy']})

In [15]:
pred_dt = gs_dt.predict(x_test)

In [16]:
print(f"Accuracy = {accuracy_score(pred_dt,y_test)}")

Accuracy = 0.8086124401913876


## Random Forest

In [17]:
rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)

RandomForestClassifier()

In [18]:
lista_rf = [x for x in range(50,300,50)]
params = {'n_estimators':lista_rf}
rfc = RandomForestClassifier()
clf_rf = RandomizedSearchCV(rfc, params, n_iter=3)
clf_rf.fit(x_train,y_train)

RandomizedSearchCV(estimator=RandomForestClassifier(), n_iter=3,
                   param_distributions={'n_estimators': [50, 100, 150, 200,
                                                         250]})

In [19]:
pred_rfc = clf_rf.predict(x_test)

In [20]:
print(f"Accuracy = {accuracy_score(pred_rfc,y_test)}")

Accuracy = 0.8397129186602871


## Bayes

In [21]:
gnb = GaussianNB()

In [22]:
pred = gnb.fit(x_train,y_train).predict(x_test)

In [23]:
print(f"Accuracy = {accuracy_score(pred,y_test)}")

Accuracy = 0.65311004784689


## MLP

In [24]:
clf = MLPClassifier().fit(x_train,y_train)



In [25]:
pred = clf.predict(x_test)

In [26]:
print(f"Accuracy = {accuracy_score(pred,y_test)}")

Accuracy = 0.8421052631578947


### Tentando melhorar o MLP

In [14]:
solver = ['sgd', 'adam']
hidden_layer = [(100,), (50,50), (100,50)]
params = {'solver': solver,'hidden_layer_sizes':hidden_layer}
mlp = MLPClassifier(max_iter=4500)
clf_mlp = RandomizedSearchCV(mlp,params, n_jobs=-1)
clf_mlp.fit(x_train,y_train)
pred2 = clf_mlp.predict(x_test)



In [15]:
print(f"Accuracy = {accuracy_score(y_test,pred2)}")

Accuracy = 0.9234449760765551


## Resultados dos classificadores (com hiperparametros default)

* KNN - 0.8325358851674641
* Decision Tree - 0.8277511961722488
* Random Forrest - 0.8373205741626795
* Bayes - 0.65311004784689
* MLP - 0.7990430622009569

## Resultado dos classificadores (Utilizando GridSearch e RandomSearch)

*  KNN - 0.8325358851674641
*  Decision Tree - 0.8325358851674641
*  Random Forest - 0.8421052631578947
*  MLP - 0.9401913875598086

## Carregando o modelo

In [21]:
save_models_path = './results'
if not os.path.exists(save_models_path):
    os.mkdir(save_models_path)

In [22]:
filename = save_models_path+'/titanic.pkl'
joblib.dump(clf_mlp, filename)

['./results/titanic.pkl']

In [23]:
model_load = joblib.load('./results/titanic.pkl')

In [26]:
accuracy_score(y_test,model_load.predict(x_test))

0.9354066985645934

## Conclusão

###  Baseado nos testes realizados com cinco classificadores, entendi que o mais adequado para a situação seja o Multilayer Perceptron, pois seus resultados foram de 93,06% de precisão quando utilizado técnica de otimização, como o Randomized Search que alterei o hiperparâmetro do solver e max_iter, quando executei sem alteração nos hiperparametros, encontrei o Random Forest como o melhor classificador com cerca de 83,73% de precisão.