## **Predição de Sobreviventes ao Titanic**
---
O objetivo desse projeto é prever quais passageiros morreram no acidente do Titanic a partir dos indicadores (informações) referentes a cada passageiro.

---
O banco de dados possui os seguintes indicadores:





*   PassengerId (Identificador de cada Passageiro)
*   Survival (Sobrevivência) - 0 para "não" e 1 para "sim"
*   pclass (Classe de Ingresso) - 1 para "Primeiro andar - Andar mais elevado", 2 para "Segundo andar = Andar do meio" e 3 para "Terceiro andar - Andar mais térreo"
*   Name (Nome de cada Passageiro)
*   Sex (Sexo) - Male (masculino) e Female (feminino)
*   Age (Idade)
*   sibsp (Número de Irmãos/Cônjuges à bordo do Titanic)
*   parch (Numero de pais/filhos à bordo do Titanic)
*   ticket(ero do ingresso)
*   fare (Tarifa do passageiro)
*   Cabin (Número da Cabine)
*   embarked (Porte da Embarcação) C para "Cherbourg", Q para "Queenstown" e S para "Southampton".

Os dados já foram obtidos em versões de treinamento e teste (com a coluna de survival, o target, não preenchida) e foram extraídos do [Titanic - Machine Learning from Disaster](hhttps://www.kaggle.com/competitions/titanic/overview)

Feito por: Whendel Muniz dos Santos

E-mail: whendel.muniz@ufpe.br

##Importação e Instalação de Bibliotecas

In [None]:
pip install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [None]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from io import open
from sklearn import model_selection
from sklearn import metrics
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from catboost import CatBoostClassifier, Pool, cv
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from scipy import stats

## Carregamento e Primeira Análise dos Dados

In [None]:
tit_train = pd.read_csv('/content/train.csv')
tit_test = pd.read_csv("/content/test.csv")

In [None]:
tit_train #Base de Dados de treino sem alterações

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
tit_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [None]:
tit_train.isnull().sum() #É possível verificar  que há dados faltantea/nulos, principalmente na idade dos passageiros e no identificador da cabine deles.

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
tit_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [None]:
tit_passid = tit_test['PassengerId'] #Salvar apenas as colunas do PassengerID, já que serão importantes para o teste futuro.
tit_passid

0       892
1       893
2       894
3       895
4       896
       ... 
413    1305
414    1306
415    1307
416    1308
417    1309
Name: PassengerId, Length: 418, dtype: int64

In [None]:
tit_test.isnull().sum() #É possível analisar, também, que há dados nulos nos dados de teste em idade e cabine.

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64



---
Há diversas formas de tratar os dados nulos, mas nesse projeto, será utilizado  testes de comparação da eficiência dos métodos de Machine Learning entre os tratamento de dados:


*   Tratamento 1 (Controle): Retirar as linhas nulas e retirar Colunas que não são interessantes para o aprendizado de máquina;
*   Tratamento 2: Preencher os dados faltantes com valores relacionados com outras colunas e retirar colunas que não são interessantes para o aprendizado de máquina;
*   Tratamento 3: O Tratamento 2 com o preenchimento da Idade de maneira melhorada.


---






##Tratamento Geral dos Dados

###Sexo

In [None]:
tit_train['Sex'] = tit_train['Sex'].replace(['male','female'],[0,1])
tit_test['Sex'] = tit_test['Sex'].replace(['male','female'],[0,1])
#Os sexos masculino e feminino foram trocados para 0 e 1, respectivamente, no arquivo de treino e de teste.

### Nome

In [None]:
tit_train['Name'] = tit_train['Name'].apply(lambda name: name.split(',')[1].split('.')[0].strip()) #O cumprimento já é um ótimo indicador para o AM, então foram retirados os nomes em específico, deixando apenas as formas de cumprimento.

In [None]:
tit_train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Mr,0,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,Mrs,1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Miss,1,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,Mrs,1,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,Mr,0,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,Rev,0,27.0,0,0,211536,13.0000,,S
887,888,1,1,Miss,1,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,Miss,1,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,Mr,0,26.0,0,0,111369,30.0000,C148,C


In [None]:
tit_test['Name'] = tit_test['Name'].apply(lambda name: name.split(',')[1].split('.')[0].strip()) #A mesma situação funciona para o banco de dados de teste

In [None]:
tit_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,Mr,0,34.5,0,0,330911,7.8292,,Q
1,893,3,Mrs,1,47.0,1,0,363272,7.0000,,S
2,894,2,Mr,0,62.0,0,0,240276,9.6875,,Q
3,895,3,Mr,0,27.0,0,0,315154,8.6625,,S
4,896,3,Mrs,1,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,Mr,0,,0,0,A.5. 3236,8.0500,,S
414,1306,1,Dona,1,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,Mr,0,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,Mr,0,,0,0,359309,8.0500,,S


In [None]:
tit_train['Name'].value_counts() #Contagem de todas as formas de tratamento do banco de dados de treinamento

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Mme               1
Don               1
Jonkheer          1
Name: Name, dtype: int64

In [None]:
tit_test['Name'].value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: Name, dtype: int64

In [None]:
tit_train.loc[tit_train['Name'] == 'Dr'] #Verificação do sexo dos passageiros que possuíam Dr como forma de tratamento.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
245,246,0,1,Dr,0,44.0,2,0,19928,90.0,C78,Q
317,318,0,2,Dr,0,54.0,0,0,29011,14.0,,S
398,399,0,2,Dr,0,23.0,0,0,244278,10.5,,S
632,633,1,1,Dr,0,32.0,0,0,13214,30.5,B50,C
660,661,1,1,Dr,0,50.0,2,0,PC 17611,133.65,,S
766,767,0,1,Dr,0,,0,0,112379,39.6,,C
796,797,1,1,Dr,1,49.0,0,0,17465,25.9292,D17,S


In [None]:
tit_test.loc[tit_test['Name'] == 'Dr']

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
293,1185,1,Dr,0,53.0,1,1,33638,81.8583,A34,S


In [None]:
tit_train.loc[tit_train['Name'] == 'Major']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
449,450,1,1,Major,0,52.0,0,0,113786,30.5,C104,S
536,537,0,1,Major,0,45.0,0,0,113050,26.55,B38,S


---
Com relação aos termos de tratamento, Mr (senhor), Miss(garota ou mulher não casada), Mrs (mulher casada) e Master (utilizado para garotos que ainda não são adultos na época) são os que possuem maiores dados. Dr (Doutor ou Doutora), Rev(Reverendo - líder religioso), Mlle (título francês equivalente à "Miss"), Major (Uma pessoa de grande importância ou um estudante graduado), Col (Coronel), The Countness (Condessa), Capt (Captão), Ms(mulher que não queria ser identificada pelo casamento), Sir (Senhor), Lady (Senhora), Mmme (equivalente à Mrs), Don (Homem da realeza/monarquia), Dona (Mulher da realeza/monarquia) e Jonkheer (jovem solteiro da realeza) estão em poucas quantidades nos dois bancos de dados. Por questão estratégica, esse valores serão realocados para as 4 principais formas de tratamento de acordo com a comparação estimada de correlação.


In [None]:
tit_train['Name'] = tit_train['Name'].replace(['Dr','Rev','Mlle','Major','Col','the Countess','Capt','Ms','Sir','Lady','Mme','Don','Jonkheer'],['Mr','Mr','Miss','Mr','Mr','Mrs','Mr','Miss','Mr','Mrs','Mrs','Mr','Master'])
tit_test['Name'] = tit_test['Name'].replace(['Dr','Col','Rev','Ms','Dona',],['Mr','Mr','Mr','Miss','Mrs'])

In [None]:
tit_train.at[796, 'Name'] = 'Mrs'

##Tratamentos Específicos e Testes de Modelos

###Tratamento 1

#### Deleção de Dados Nulos


---
A função **dropna()** junto com o incremento de *subset* tem como objetivo de eliminar os dados nulos em colunas específicas. Com o comando **isnull().sum()**
já informando a quantidade de dados nulos em cada coluna, é possível eliminar as linhas das colunas com dados nulos.

---



In [None]:
tita_train_t1 = tit_train.copy()
tita_test_t1 = tit_test.copy()

In [None]:
tita_train_t1 = tit_train.dropna(subset=['Age', 'Cabin', 'Embarked'])
tita_train_t1.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [None]:
tita_test_t1 = tit_test.dropna(subset = ['Age', 'Cabin'])
tita_test_t1.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

#### Novo Banco de Dados

In [None]:
tita_train_t1 = tita_train_t1.drop(columns=['Ticket','Cabin']) #Foram retiradas as colunas de Ticket e Cabin pela conclusão de não serem interessantes para o aprendizado de máquina.

In [None]:
tita_train_t1 = tita_train_t1.drop(columns=['PassengerId'])

In [None]:
tita_test_t1 = tita_test_t1.drop(columns=['Ticket','Cabin'])

In [None]:
tita_test_t1 = tita_test_t1.drop(columns=['PassengerId'])

In [None]:
tita_train_t1

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
1,1,1,Mrs,1,38.0,1,0,71.2833,C
3,1,1,Mrs,1,35.0,1,0,53.1000,S
6,0,1,Mr,0,54.0,0,0,51.8625,S
10,1,3,Miss,1,4.0,1,1,16.7000,S
11,1,1,Miss,1,58.0,0,0,26.5500,S
...,...,...,...,...,...,...,...,...,...
871,1,1,Mrs,1,47.0,1,1,52.5542,S
872,0,1,Mr,0,33.0,0,0,5.0000,S
879,1,1,Mrs,1,56.0,0,1,83.1583,C
887,1,1,Miss,1,19.0,0,0,30.0000,S


In [None]:
tita_test_t1

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
12,1,Mrs,1,23.0,1,0,82.2667,S
14,1,Mrs,1,47.0,1,0,61.1750,S
24,1,Mrs,1,48.0,1,3,262.3750,C
26,1,Miss,1,22.0,0,1,61.9792,C
28,1,Mr,0,41.0,0,0,30.5000,S
...,...,...,...,...,...,...,...,...
404,1,Mr,0,43.0,1,0,27.7208,C
405,2,Mr,0,20.0,0,0,13.8625,C
407,1,Mr,0,50.0,1,1,211.5000,C
411,1,Mrs,1,37.0,1,0,90.0000,Q


In [None]:
tita_train_t1.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [None]:
tita_test_t1.isnull().sum()

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

### Tratamento 2



---
Nesse Tratamento, será feita a análise dos dados faltantes e o preenchimento desses dados a partir de relações com outras colunas do banco de dados.

---




In [None]:
tit_train.corr() #Mostra a correlação de cada indicador com os outros indicadores do bando de dados

  tit_train.corr() #Mostra a correlação de cada indicador com os outros indicadores do bando de dados


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,-0.042939,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,0.543351,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.1319,-0.369226,0.083081,0.018443,-0.5495
Sex,-0.042939,0.543351,-0.1319,1.0,-0.093254,0.114631,0.245489,0.182333
Age,0.036847,-0.077221,-0.369226,-0.093254,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,0.114631,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,0.245489,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.182333,0.096067,0.159651,0.216225,1.0


#### Idade (Age)


---
Pela correlação de variáveis, é possível perceber que a Idade é bastante correlacionada com a classe do ingresso (1º,2º e 3 andares). Então é possível estimar as idades faltantes dos passageiros a partir da média de idade das classes onde estão inseridos.

---




In [None]:
for i in tit_train['Pclass'].unique():
  print(f'Média da idade do Pclass {i}: {tit_train[tit_train["Pclass"] == i]["Age"].mean()}')

Média da idade do Pclass 3: 25.14061971830986
Média da idade do Pclass 1: 38.233440860215055
Média da idade do Pclass 2: 29.87763005780347


In [None]:
for i in tit_test['Pclass'].unique():
  print(f'Média da idade do Pclass {i}: {tit_test[tit_test["Pclass"] == i]["Age"].mean()}')

Média da idade do Pclass 3: 24.02794520547945
Média da idade do Pclass 2: 28.7775
Média da idade do Pclass 1: 40.91836734693877


In [None]:
tita_train_t2 = tit_train.copy()
tita_test_t2 = tit_test.copy()

In [None]:
for i in tita_train_t2.index:
  if pd.isnull(tita_train_t2['Age'][i]):
    if tita_train_t2['Pclass'][i] == 1:
      tita_train_t2['Age'][i] = tita_train_t2[tita_train_t2["Pclass"] == 1]["Age"].mean()
    elif tita_train_t2['Pclass'][i] == 2:
       tita_train_t2['Age'][i] = tita_train_t2[tita_train_t2["Pclass"] == 2]["Age"].mean()
    elif tita_train_t2['Pclass'][i] == 3:
      tita_train_t2['Age'][i] = tita_train_t2[tita_train_t2["Pclass"] == 3]["Age"].mean()
  else:
    continue


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_train_t2['Age'][i] = tita_train_t2[tita_train_t2["Pclass"] == 3]["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_train_t2['Age'][i] = tita_train_t2[tita_train_t2["Pclass"] == 2]["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_train_t2['Age'][i] = tita_train_t2[tita_train_t2["Pclass"] == 1]["Age"].mean()


In [None]:
for i in tita_test_t2.index:
  if pd.isnull(tita_test_t2['Age'][i]):
    if tita_test_t2['Pclass'][i] == 1:
      tita_test_t2['Age'][i] = tita_test_t2[tita_test_t2["Pclass"] == 1]["Age"].mean()
    elif tita_test_t2['Pclass'][i] == 2:
       tita_test_t2['Age'][i] = tita_test_t2[tita_test_t2["Pclass"] == 2]["Age"].mean()
    elif tita_test_t2['Pclass'][i] == 3:
      tita_test_t2['Age'][i] = tita_test_t2[tita_test_t2["Pclass"] == 3]["Age"].mean()
  else:
    continue

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_test_t2['Age'][i] = tita_test_t2[tita_test_t2["Pclass"] == 3]["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_test_t2['Age'][i] = tita_test_t2[tita_test_t2["Pclass"] == 1]["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_test_t2['Age'][i] = tita_test_t2[tita_test_t2["Pclass"] == 2]["Age"].mean()


####Embarcação (Embarked)

In [None]:
tita_train_t2[tita_train_t2['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,Miss,1,38.0,0,0,113572,80.0,B28,
829,830,1,1,Mrs,1,62.0,0,0,113572,80.0,B28,


In [None]:
for i in tita_train_t2['Embarked'].unique():
  print(f'Estimativa da embarcação do indicador {i}: {tita_train_t2[tita_train_t2["Embarked"] == i]["Pclass"].mean()}') #Por questão de estimativa, a maioria das pessoas que entraram no embarque C estavam entre o 1 e o 2 andar. Logo, para os dados nulos, será considerado que as duas passageiras entraram no embarque C

Estimativa da embarcação do indicador S: 2.3509316770186337
Estimativa da embarcação do indicador C: 1.8869047619047619
Estimativa da embarcação do indicador Q: 2.909090909090909
Estimativa da embarcação do indicador nan: nan


In [None]:
tita_train_t2['Embarked'] = tita_train_t2['Embarked'].fillna('C')

####Tarifa (Fare)

In [None]:
tita_test_t2[tita_test_t2['Fare'].isnull()]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
152,1044,3,Mr,0,60.5,0,0,3701,,,S


In [None]:
tita_test_t2[tita_test_t2["Pclass"] == 3]['Fare'].mean() #Utiliza a média da tarifa referente ao terceiro andar, já que o passageiro em específico estava nesse andar.

12.459677880184334

In [None]:
tita_test_t2['Fare'] = tita_test_t2['Fare'].fillna(tita_test_t2[tita_test_t2["Pclass"] == 3]['Fare'].mean())

####Novo Banco de Dados

In [None]:
tita_train_t2 = tita_train_t2.drop(columns=['Cabin','Ticket', 'PassengerId'])

In [None]:
tita_test_t2 = tita_test_t2.drop(columns=['Cabin','Ticket', 'PassengerId'])

In [None]:
tita_train_t2

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,Mr,0,22.00000,1,0,7.2500,S
1,1,1,Mrs,1,38.00000,1,0,71.2833,C
2,1,3,Miss,1,26.00000,0,0,7.9250,S
3,1,1,Mrs,1,35.00000,1,0,53.1000,S
4,0,3,Mr,0,35.00000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,0,2,Mr,0,27.00000,0,0,13.0000,S
887,1,1,Miss,1,19.00000,0,0,30.0000,S
888,0,3,Miss,1,25.14062,1,2,23.4500,S
889,1,1,Mr,0,26.00000,0,0,30.0000,C


In [None]:
tita_test_t2

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,Mr,0,34.500000,0,0,7.8292,Q
1,3,Mrs,1,47.000000,1,0,7.0000,S
2,2,Mr,0,62.000000,0,0,9.6875,Q
3,3,Mr,0,27.000000,0,0,8.6625,S
4,3,Mrs,1,22.000000,1,1,12.2875,S
...,...,...,...,...,...,...,...,...
413,3,Mr,0,24.027945,0,0,8.0500,S
414,1,Mrs,1,39.000000,0,0,108.9000,C
415,3,Mr,0,38.500000,0,0,7.2500,S
416,3,Mr,0,24.027945,0,0,8.0500,S


In [None]:
tita_train_t2.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [None]:
tita_test_t2.isnull().sum()

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64


---
Nesse tratamento, é interessar levar em consideração os títulos de tratamento dos passageiros, que eram bastante importantes na época, já que, por exemplo, Master era um tratamento para crianças/pré-adolescentes, porém a média das idades com relação ao pclass não correspondem às idade dessa faixa. Então, utilizar essa média pode ser "incoerente". Para isso, as idades nulas serão preenchida com a média das idades de cada tipo de tratamento do banco de dados. Com relação aos dados das outras colunas, seguirá o modelo do tratamento 2.


---




In [None]:
tita_train_t3 = tit_train.copy()
tita_test_t3 = tit_test.copy()

#### Idade (Age)

In [None]:
for i in tita_train_t3['Name'].unique():
  print(f'Média da idade do Name {i}: {tita_train_t3[tita_train_t3["Name"] == i]["Age"].mean()}')

Média da idade do Name Mr: 32.97235576923077
Média da idade do Name Mrs: 35.99107142857143
Média da idade do Name Miss: 21.845637583892618
Média da idade do Name Master: 5.477567567567568


In [None]:
for i in tita_test_t3['Name'].unique():
  print(f'Média da idade do Name {i}: {tita_test_t3[tita_test_t3["Name"] == i]["Age"].mean()}')

Média da idade do Name Mr: 32.340425531914896
Média da idade do Name Mrs: 38.904761904761905
Média da idade do Name Miss: 21.774843750000002
Média da idade do Name Master: 7.406470588235294


In [None]:
for i in tita_train_t3.index:
  if pd.isnull(tita_train_t3['Age'][i]):
    if tita_train_t3['Name'][i] == 'Mr':
      tita_train_t3['Age'][i] = tita_train_t3[tita_train_t3["Name"] == 'Mr']["Age"].mean()
    elif tita_train_t3['Name'][i] == 'Miss':
       tita_train_t3['Age'][i] = tita_train_t3[tita_train_t3["Name"] == 'Miss']["Age"].mean()
    elif tita_train_t3['Name'][i] == 'Mrs':
      tita_train_t3['Age'][i] = tita_train_t3[tita_train_t3["Name"] == 'Mrs']["Age"].mean()
    elif tita_train_t3['Name'][i] == 'Master':
      tita_train_t3['Age'][i] = tita_train_t3[tita_train_t3["Name"] == 'Master']["Age"].mean()
  else:
    continue


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_train_t3['Age'][i] = tita_train_t3[tita_train_t3["Name"] == 'Mr']["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_train_t3['Age'][i] = tita_train_t3[tita_train_t3["Name"] == 'Mrs']["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_train_t3['Age'][i] = tita_train_t3[tita_train_t3["Name"] == 'Miss']["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://panda

In [None]:
for i in tita_test_t3.index:
  if pd.isnull(tita_test_t3['Age'][i]):
    if tita_test_t3['Name'][i] == 'Mr':
      tita_test_t3['Age'][i] = tita_test_t3[tita_test_t3["Name"] == 'Mr']["Age"].mean()
    elif tita_test_t3['Name'][i] == 'Miss':
       tita_test_t3['Age'][i] = tita_test_t3[tita_test_t3["Name"] == 'Miss']["Age"].mean()
    elif tita_test_t3['Name'][i] == 'Mrs':
      tita_test_t3['Age'][i] = tita_test_t3[tita_test_t3["Name"] == 'Mrs']["Age"].mean()
    elif tita_test_t3['Name'][i] == 'Master':
      tita_test_t3['Age'][i] = tita_test_t3[tita_test_t3["Name"] == 'Master']["Age"].mean()
  else:
    continue

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_test_t3['Age'][i] = tita_test_t3[tita_test_t3["Name"] == 'Mr']["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_test_t3['Age'][i] = tita_test_t3[tita_test_t3["Name"] == 'Mrs']["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tita_test_t3['Age'][i] = tita_test_t3[tita_test_t3["Name"] == 'Miss']["Age"].mean()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.

####Embarcação (Embarked)

In [None]:
tita_train_t3['Embarked'] = tita_train_t3['Embarked'].fillna('C')

####Tarifa (Fare)

In [None]:
tita_test_t3['Fare'] = tita_test_t3['Fare'].fillna(tita_test_t3[tita_test_t3["Pclass"] == 3]['Fare'].mean())

#### Novo Banco de Dados

In [None]:
tita_train_t3 = tita_train_t3.drop(columns=['Cabin','Ticket', 'PassengerId'])

In [None]:
tita_test_t3= tita_test_t3.drop(columns=['Cabin','Ticket', 'PassengerId'])

In [None]:
tita_train_t3

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,Mr,0,22.000000,1,0,7.2500,S
1,1,1,Mrs,1,38.000000,1,0,71.2833,C
2,1,3,Miss,1,26.000000,0,0,7.9250,S
3,1,1,Mrs,1,35.000000,1,0,53.1000,S
4,0,3,Mr,0,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,0,2,Mr,0,27.000000,0,0,13.0000,S
887,1,1,Miss,1,19.000000,0,0,30.0000,S
888,0,3,Miss,1,21.845638,1,2,23.4500,S
889,1,1,Mr,0,26.000000,0,0,30.0000,C


In [None]:
tita_test_t3

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,Mr,0,34.500000,0,0,7.8292,Q
1,3,Mrs,1,47.000000,1,0,7.0000,S
2,2,Mr,0,62.000000,0,0,9.6875,Q
3,3,Mr,0,27.000000,0,0,8.6625,S
4,3,Mrs,1,22.000000,1,1,12.2875,S
...,...,...,...,...,...,...,...,...
413,3,Mr,0,32.340426,0,0,8.0500,S
414,1,Mrs,1,39.000000,0,0,108.9000,C
415,3,Mr,0,38.500000,0,0,7.2500,S
416,3,Mr,0,32.340426,0,0,8.0500,S


In [None]:
tita_train_t3.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

In [None]:
tita_test_t3.isnull().sum()

Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64

##Aplicação de Modelos

####Tratamento 1

In [None]:
tita_model_train_t1 = tita_train_t1.copy()
tita_model_test_t1 = tita_test_t1.copy()

In [None]:
tita_model_train_t1 = pd.get_dummies(tita_model_train_t1, columns =["Pclass", "Name", "Embarked"] )
tita_model_test_t1 = pd.get_dummies(tita_model_test_t1, columns =["Pclass", "Name", "Embarked"] )

In [None]:
tita_model_test_t1['Survived'] = np.nan

In [None]:
x_train_t1 = tita_model_train_t1.drop('Survived', axis = 1)
y_train_t1 = tita_model_train_t1['Survived']
x_test_t1 = tita_model_test_t1.drop('Survived', axis = 1)

In [None]:
y_test_t1= tita_model_test_t1['Survived']

In [None]:
tita_model_train_t1

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Name_Master,Name_Miss,Name_Mr,Name_Mrs,Embarked_C,Embarked_Q,Embarked_S
1,1,1,38.0,1,0,71.2833,1,0,0,0,0,0,1,1,0,0
3,1,1,35.0,1,0,53.1000,1,0,0,0,0,0,1,0,0,1
6,0,0,54.0,0,0,51.8625,1,0,0,0,0,1,0,0,0,1
10,1,1,4.0,1,1,16.7000,0,0,1,0,1,0,0,0,0,1
11,1,1,58.0,0,0,26.5500,1,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
871,1,1,47.0,1,1,52.5542,1,0,0,0,0,0,1,0,0,1
872,0,0,33.0,0,0,5.0000,1,0,0,0,0,1,0,0,0,1
879,1,1,56.0,0,1,83.1583,1,0,0,0,0,0,1,1,0,0
887,1,1,19.0,0,0,30.0000,1,0,0,0,1,0,0,0,0,1


In [None]:
tita_model_test_t1

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Name_Master,Name_Miss,Name_Mr,Name_Mrs,Embarked_C,Embarked_Q,Embarked_S,Survived
12,1,23.0,1,0,82.2667,1,0,0,0,0,0,1,0,0,1,
14,1,47.0,1,0,61.1750,1,0,0,0,0,0,1,0,0,1,
24,1,48.0,1,3,262.3750,1,0,0,0,0,0,1,1,0,0,
26,1,22.0,0,1,61.9792,1,0,0,0,1,0,0,1,0,0,
28,0,41.0,0,0,30.5000,1,0,0,0,0,1,0,0,0,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
404,0,43.0,1,0,27.7208,1,0,0,0,0,1,0,1,0,0,
405,0,20.0,0,0,13.8625,0,1,0,0,0,1,0,1,0,0,
407,0,50.0,1,1,211.5000,1,0,0,0,0,1,0,1,0,0,
411,1,37.0,1,0,90.0000,1,0,0,0,0,0,1,0,1,0,


####Tratamento 2

In [None]:
tita_model_train_t2 = tita_train_t2.copy()
tita_model_test_t2 = tita_test_t2.copy()

In [None]:
tita_model_train_t2 = pd.get_dummies(tita_model_train_t2, columns =["Pclass", "Name", "Embarked"] )
tita_model_test_t2 = pd.get_dummies(tita_model_test_t2, columns =["Pclass", "Name", "Embarked"] )

In [None]:
tita_model_test_t2['Survived'] = np.nan

In [None]:
x_train_t2 = tita_model_train_t2.drop('Survived', axis = 1)
y_train_t2 = tita_model_train_t2['Survived']
x_test_t2 = tita_model_test_t2.drop('Survived', axis = 1)

In [None]:
tita_model_train_t2

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Name_Master,Name_Miss,Name_Mr,Name_Mrs,Embarked_C,Embarked_Q,Embarked_S
0,0,0,22.00000,1,0,7.2500,0,0,1,0,0,1,0,0,0,1
1,1,1,38.00000,1,0,71.2833,1,0,0,0,0,0,1,1,0,0
2,1,1,26.00000,0,0,7.9250,0,0,1,0,1,0,0,0,0,1
3,1,1,35.00000,1,0,53.1000,1,0,0,0,0,0,1,0,0,1
4,0,0,35.00000,0,0,8.0500,0,0,1,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,0,27.00000,0,0,13.0000,0,1,0,0,0,1,0,0,0,1
887,1,1,19.00000,0,0,30.0000,1,0,0,0,1,0,0,0,0,1
888,0,1,25.14062,1,2,23.4500,0,0,1,0,1,0,0,0,0,1
889,1,0,26.00000,0,0,30.0000,1,0,0,0,0,1,0,1,0,0


In [None]:
tita_model_test_t2

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Name_Master,Name_Miss,Name_Mr,Name_Mrs,Embarked_C,Embarked_Q,Embarked_S,Survived
0,0,34.500000,0,0,7.8292,0,0,1,0,0,1,0,0,1,0,
1,1,47.000000,1,0,7.0000,0,0,1,0,0,0,1,0,0,1,
2,0,62.000000,0,0,9.6875,0,1,0,0,0,1,0,0,1,0,
3,0,27.000000,0,0,8.6625,0,0,1,0,0,1,0,0,0,1,
4,1,22.000000,1,1,12.2875,0,0,1,0,0,0,1,0,0,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,0,24.027945,0,0,8.0500,0,0,1,0,0,1,0,0,0,1,
414,1,39.000000,0,0,108.9000,1,0,0,0,0,0,1,1,0,0,
415,0,38.500000,0,0,7.2500,0,0,1,0,0,1,0,0,0,1,
416,0,24.027945,0,0,8.0500,0,0,1,0,0,1,0,0,0,1,


In [None]:
y_test_t2 = tita_model_test_t2['Survived']

####Tratamento 3

In [None]:
tita_model_train_t3 = tita_train_t3.copy()
tita_model_test_t3 = tita_test_t3.copy()

In [None]:
tita_model_train_t3= pd.get_dummies(tita_model_train_t3, columns =["Pclass", "Name", "Embarked"] )
tita_model_test_t3 = pd.get_dummies(tita_model_test_t3,columns =["Pclass", "Name", "Embarked"] )

In [None]:
tita_model_test_t3['Survived'] = np.nan

In [None]:
x_train_t3 = tita_model_train_t3.drop('Survived', axis = 1)
y_train_t3 = tita_model_train_t3['Survived']
x_test_t3 = tita_model_test_t3.drop('Survived', axis = 1)

In [None]:
y_test_t3  = tita_model_test_t3['Survived']

In [None]:
tita_model_train_t3

Unnamed: 0,Survived,Sex,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Name_Master,Name_Miss,Name_Mr,Name_Mrs,Embarked_C,Embarked_Q,Embarked_S
0,0,0,22.000000,1,0,7.2500,0,0,1,0,0,1,0,0,0,1
1,1,1,38.000000,1,0,71.2833,1,0,0,0,0,0,1,1,0,0
2,1,1,26.000000,0,0,7.9250,0,0,1,0,1,0,0,0,0,1
3,1,1,35.000000,1,0,53.1000,1,0,0,0,0,0,1,0,0,1
4,0,0,35.000000,0,0,8.0500,0,0,1,0,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,0,27.000000,0,0,13.0000,0,1,0,0,0,1,0,0,0,1
887,1,1,19.000000,0,0,30.0000,1,0,0,0,1,0,0,0,0,1
888,0,1,21.845638,1,2,23.4500,0,0,1,0,1,0,0,0,0,1
889,1,0,26.000000,0,0,30.0000,1,0,0,0,0,1,0,1,0,0


In [None]:
tita_model_test_t3

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Name_Master,Name_Miss,Name_Mr,Name_Mrs,Embarked_C,Embarked_Q,Embarked_S,Survived
0,0,34.500000,0,0,7.8292,0,0,1,0,0,1,0,0,1,0,
1,1,47.000000,1,0,7.0000,0,0,1,0,0,0,1,0,0,1,
2,0,62.000000,0,0,9.6875,0,1,0,0,0,1,0,0,1,0,
3,0,27.000000,0,0,8.6625,0,0,1,0,0,1,0,0,0,1,
4,1,22.000000,1,1,12.2875,0,0,1,0,0,0,1,0,0,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,0,32.340426,0,0,8.0500,0,0,1,0,0,1,0,0,0,1,
414,1,39.000000,0,0,108.9000,1,0,0,0,0,0,1,1,0,0,
415,0,38.500000,0,0,7.2500,0,0,1,0,0,1,0,0,0,1,
416,0,32.340426,0,0,8.0500,0,0,1,0,0,1,0,0,0,1,


####Treino

In [None]:
def funcao_modelos(x_train, y_train, modelo, tratamento, cv):
    am = modelo
    am.fit(x_train, y_train)
    am_score_train = am.score(x_train, y_train)
    prev_train = cross_val_predict(am, x_train, y_train, cv=cv)
    am_score_train_cross = round(metrics.accuracy_score(y_train, prev_train) * 100, 2)
    print(f"A acurácia da base de treino para o tratamento {tratamento} foi: {am_score_train * 100:.2f}%")
    print(f'A acurácia da validação cruzada: {am_score_train_cross}% ')


##### Random Forest

In [None]:
funcao_modelos(x_train_t1, y_train_t1, RandomForestClassifier(n_estimators = 1000),1, 10) #Tratamento 1

A acurácia da base de treino para o tratamento 1 foi: 100.00%
A acurácia da validação cruzada: 72.13% 


In [None]:
funcao_modelos(x_train_t2, y_train_t2, RandomForestClassifier(n_estimators = 1000),2, 10) #Tratamento 2

A acurácia da base de treino para o tratamento 2 foi: 98.20%
A acurácia da validação cruzada: 81.14% 


In [None]:
funcao_modelos(x_train_t3, y_train_t3, RandomForestClassifier(n_estimators = 1000),3, 10) #Tratamento 3

A acurácia da base de treino para o tratamento 3 foi: 98.20%
A acurácia da validação cruzada: 80.81% 


##### Logistic Regression

In [None]:
funcao_modelos(x_train_t1, y_train_t1,LogisticRegression(max_iter = 1000),1, 10) #Tratamento 1

A acurácia da base de treino para o tratamento 1 foi: 79.23%
A acurácia da validação cruzada: 77.6% 


In [None]:
funcao_modelos(x_train_t2, y_train_t2,LogisticRegression(max_iter = 1000),1, 10) #Tratamento 2

A acurácia da base de treino para o tratamento 1 foi: 83.50%
A acurácia da validação cruzada: 83.05% 


In [None]:
funcao_modelos(x_train_t3, y_train_t3,LogisticRegression(max_iter = 1000),1, 10) #Tratamento 3

A acurácia da base de treino para o tratamento 1 foi: 83.28%
A acurácia da validação cruzada: 82.6% 


##### K-Nearest Neighbours

In [None]:
funcao_modelos(x_train_t1, y_train_t1,KNeighborsClassifier(), 1, 10) #Tratamento 1

A acurácia da base de treino para o tratamento 1 foi: 74.86%
A acurácia da validação cruzada: 67.21% 


In [None]:
funcao_modelos(x_train_t2, y_train_t2,KNeighborsClassifier(), 2, 10) #Tratamento 2

A acurácia da base de treino para o tratamento 2 foi: 81.59%
A acurácia da validação cruzada: 72.05% 


In [None]:
funcao_modelos(x_train_t3, y_train_t3,KNeighborsClassifier(), 3, 10) #Tratamento 3

A acurácia da base de treino para o tratamento 3 foi: 81.37%
A acurácia da validação cruzada: 71.49% 


##### Gaussian Naive Bayes

In [None]:
funcao_modelos(x_train_t1, y_train_t1,GaussianNB(), 1, 10) #Tratamento 1

A acurácia da base de treino para o tratamento 1 foi: 78.69%
A acurácia da validação cruzada: 77.6% 


In [None]:
funcao_modelos(x_train_t2, y_train_t2,GaussianNB(), 2, 10) #Tratamento 2

A acurácia da base de treino para o tratamento 2 foi: 81.71%
A acurácia da validação cruzada: 80.92% 


In [None]:
funcao_modelos(x_train_t3, y_train_t3,GaussianNB(), 3, 10) #Tratamento 3

A acurácia da base de treino para o tratamento 3 foi: 81.82%
A acurácia da validação cruzada: 80.58% 


#####SVC

In [None]:
funcao_modelos(x_train_t1, y_train_t1,LinearSVC(dual = False), 1, 10) #Tratamento 1

A acurácia da base de treino para o tratamento 1 foi: 79.78%
A acurácia da validação cruzada: 78.69% 


In [None]:
funcao_modelos(x_train_t2, y_train_t2,LinearSVC(dual = False), 2, 10) #Tratamento 2

A acurácia da base de treino para o tratamento 2 foi: 83.05%
A acurácia da validação cruzada: 82.94% 


In [None]:
funcao_modelos(x_train_t3, y_train_t3,LinearSVC(dual = False), 3, 10) #Tratamento 3

A acurácia da base de treino para o tratamento 3 foi: 83.05%
A acurácia da validação cruzada: 82.83% 


##### Stochastic Gradiente Descent  

In [None]:
funcao_modelos(x_train_t1, y_train_t1,SGDClassifier(), 1, 10) #Tratamento 1

A acurácia da base de treino para o tratamento 1 foi: 68.31%
A acurácia da validação cruzada: 67.76% 


In [None]:
funcao_modelos(x_train_t2, y_train_t2,SGDClassifier(), 2, 10) #Tratamento 2

A acurácia da base de treino para o tratamento 2 foi: 80.36%
A acurácia da validação cruzada: 69.92% 


In [None]:
funcao_modelos(x_train_t3, y_train_t3,SGDClassifier(), 3, 10) #Tratamento 3

A acurácia da base de treino para o tratamento 3 foi: 76.77%
A acurácia da validação cruzada: 73.29% 


##### Árvore de Decisão

In [None]:
funcao_modelos(x_train_t1, y_train_t1,DecisionTreeClassifier(), 1, 10) #Tratamento 1

A acurácia da base de treino para o tratamento 1 foi: 100.00%
A acurácia da validação cruzada: 75.41% 


In [None]:
funcao_modelos(x_train_t2, y_train_t2,DecisionTreeClassifier(), 2, 10) #Tratamento 2

A acurácia da base de treino para o tratamento 2 foi: 98.20%
A acurácia da validação cruzada: 79.8% 


In [None]:
funcao_modelos(x_train_t3, y_train_t3,DecisionTreeClassifier(), 3, 10) #Tratamento 3

A acurácia da base de treino para o tratamento 3 foi: 98.20%
A acurácia da validação cruzada: 79.12% 


##### Gradiente Boost

In [None]:
funcao_modelos(x_train_t1, y_train_t1,GradientBoostingClassifier(), 1, 10) #Tratamento 1

A acurácia da base de treino para o tratamento 1 foi: 97.27%
A acurácia da validação cruzada: 78.14% 


In [None]:
funcao_modelos(x_train_t2, y_train_t2,GradientBoostingClassifier(), 2, 10) #Tratamento 2

A acurácia da base de treino para o tratamento 2 foi: 90.12%
A acurácia da validação cruzada: 82.94% 


In [None]:
funcao_modelos(x_train_t3, y_train_t3,GradientBoostingClassifier(), 3, 10) #Tratamento 3

A acurácia da base de treino para o tratamento 3 foi: 90.24%
A acurácia da validação cruzada: 82.38% 




---
De todos os modelos treinados,o SVC obteve um desempenho melhor em comparação com os 3 tratamentos.

#####Teste

In [None]:
parametros_aval = {
    'C': [0.1, 1, 10],
    'class_weight': [None, 'balanced'],  # Opções para class_weight
    'break_ties': [True, False],       # Adicionando 'break_ties'
    'cache_size': [100, 200, 300],    # Adicionando 'cache_size'
    'coef0': [0.0, 1.0, 2.0]          # Adicionando 'coef0'
}

modelo_uso = SVC()
svc_cruzada = GridSearchCV(modelo_uso, parametros_aval, cv=5)

In [None]:
svc_cruzada.fit(x_train_t2, y_train_t2) #Foi escolhido o segundo tratamento por ter maior taxa de assertividade.

print("Melhores parâmetros:", svc_cruzada.best_params_)

Melhores parâmetros: {'C': 10, 'break_ties': True, 'cache_size': 100, 'class_weight': 'balanced', 'coef0': 0.0}


In [None]:
print(f'Melhor score: {svc_cruzada.best_score_}%')

Melhor score: 0.7576988261879355%


In [None]:
y_pred = svc_cruzada.predict(x_test_t3) #O terceiro tratamento foi testado por ser o que, nesse contexto, teve a filtragem mais real.


In [None]:
data_kaggle = pd.DataFrame({'PassengerId': tit_passid, 'Survived': y_pred }) #Criação de um dataframe apenas com a base de dados de teste do terceiro tratamento para enviar para o kaggle.

In [None]:
data_kaggle.to_csv('Desafio_Titanic.csv', index = False)

In [None]:
from google.colab import files
files.download('Desafio_Titanic.csv') #No Kaggle, foi conseguido 71% de acurácia.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>