# Resolução do Spaceship Titanic

## Introdução

Esse problema é similar ao do Titanic original; contudo o que muda é na última coluna, que ao invés de estar escrito "Sobreviveu", está como "Transportado" (para outra dimensão ou não). 

Usarei o modelo de Naive-Bayes neste caso.

No demais, as colunas no arquivo são as seguintes:
    
* ```PassengerId```: Identidade do passageiro, que está no formato ```gggg_pp```, onde ```gggg``` indica o grupo que o passageiro está viajando, e ```pp``` é seu número dentro daquele grupo.

* ```HomePlanet```: Planeta natal.

* ```CryoSleep```: ```True``` se o passageiro estava confinado na cabine e ```False``` se estava suspenso.

* ```Cabin```: Código da cabine.

* ```Destination```: Planeta de destino.

* ```Age```: Idade do passageiro.

* ```VIP```: ```True``` se a pessoa pagou passagem VIP e ```False``` caso contrário.

* ``` RoomService, FoodCourt, ShoppingMall, Spa, VRDeck ```: Valores que o passageiro gastou com cada um desses serviços.

* ```Name```: Nome do passageiro.

* ```Transported```: ```True``` se o passageiro foi transportado para outra dimensão e ```False``` caso contrário.

## Dados Iniciais

In [26]:
from pandas import read_csv, DataFrame


df = read_csv('train.csv')

df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


## Tratamento dos dados

Neste problema vou ignorar as colunas de Id, Cabine, e Nome.

In [27]:
colunas = ['PassengerId', 'Cabin', 'Name']

df.drop(columns = colunas,
        inplace = True)

df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True
...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False
8689,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False
8690,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True
8691,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False


Colocando todos os gastos que os passageiros tiveram numa só coluna.

In [28]:
df['GastoTotal'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']

df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,GastoTotal
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,0.0
1,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,736.0
2,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,10383.0
3,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,5176.0
4,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,1091.0
...,...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,8536.0
8689,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,False,0.0
8690,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,1873.0
8691,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,4637.0


Tirando as colunas separadas

In [29]:
colunas2 = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

df.drop(columns = colunas2,
        inplace = True)

df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal
0,Europa,False,TRAPPIST-1e,39.0,False,False,0.0
1,Earth,False,TRAPPIST-1e,24.0,False,True,736.0
2,Europa,False,TRAPPIST-1e,58.0,True,False,10383.0
3,Europa,False,TRAPPIST-1e,33.0,False,False,5176.0
4,Earth,False,TRAPPIST-1e,16.0,False,True,1091.0
...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,False,8536.0
8689,Earth,True,PSO J318.5-22,18.0,False,False,0.0
8690,Earth,False,TRAPPIST-1e,26.0,False,True,1873.0
8691,Europa,False,55 Cancri e,32.0,False,False,4637.0


Antes de converter os dados categóricos, vamos preencher as células vazias.

In [30]:
df[ df['HomePlanet'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal
59,,True,TRAPPIST-1e,33.0,False,True,
113,,False,TRAPPIST-1e,39.0,False,False,9307.0
186,,True,55 Cancri e,24.0,False,True,
225,,False,TRAPPIST-1e,18.0,False,False,1288.0
234,,True,55 Cancri e,54.0,False,True,0.0
...,...,...,...,...,...,...,...
8515,,False,TRAPPIST-1e,25.0,False,False,1299.0
8613,,False,55 Cancri e,53.0,False,False,7177.0
8666,,False,55 Cancri e,38.0,,True,2416.0
8674,,False,TRAPPIST-1e,13.0,False,False,1148.0


In [31]:
from statistics import mode

df[ df['HomePlanet'].isnull() ]

PlanetaModa = mode(df['HomePlanet'])

df['HomePlanet'].fillna(value = PlanetaModa,
                        inplace = True)

df[ df['HomePlanet'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal


In [32]:
df[ df['CryoSleep'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal
92,Earth,,TRAPPIST-1e,2.0,False,True,0.0
98,Earth,,TRAPPIST-1e,27.0,False,False,703.0
104,Europa,,TRAPPIST-1e,40.0,False,False,2018.0
111,Mars,,TRAPPIST-1e,26.0,False,True,
152,Earth,,TRAPPIST-1e,58.0,False,True,990.0
...,...,...,...,...,...,...,...
8620,Europa,,55 Cancri e,44.0,False,True,0.0
8651,Earth,,TRAPPIST-1e,8.0,False,False,0.0
8664,Earth,,TRAPPIST-1e,32.0,False,True,0.0
8675,Earth,,TRAPPIST-1e,44.0,False,True,


In [33]:
ModaCryoSleep = mode(df['CryoSleep'])

df['CryoSleep'].fillna(value = ModaCryoSleep,
                       inplace = True)

df[ df['CryoSleep'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal


In [34]:
df[ df['Destination'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal
47,Mars,True,,19.0,False,True,0.0
128,Earth,False,,34.0,False,False,793.0
139,Earth,False,,41.0,False,False,607.0
347,Earth,False,,23.0,False,False,720.0
430,Earth,True,,50.0,False,False,0.0
...,...,...,...,...,...,...,...
8372,Earth,True,,20.0,False,True,0.0
8551,Mars,True,,41.0,False,True,0.0
8616,Mars,True,,33.0,False,True,0.0
8621,Europa,False,,41.0,True,False,17041.0


In [35]:
DestinoModa = mode(df['Destination'])

df['Destination'].fillna(value = DestinoModa,
                         inplace = True)

df[ df['Destination'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal


In [36]:
df[ df['Age'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal
50,Earth,False,TRAPPIST-1e,,False,False,4689.0
64,Mars,False,TRAPPIST-1e,,False,False,1048.0
137,Earth,True,55 Cancri e,,False,True,0.0
181,Europa,False,55 Cancri e,,False,True,
184,Europa,False,55 Cancri e,,False,True,2981.0
...,...,...,...,...,...,...,...
8274,Earth,True,TRAPPIST-1e,,False,False,0.0
8301,Europa,True,TRAPPIST-1e,,False,True,0.0
8374,Earth,False,TRAPPIST-1e,,False,False,834.0
8407,Earth,True,TRAPPIST-1e,,False,True,0.0


In [37]:
MediaIdade = df['Age'].mean()

df['Age'].fillna(value = MediaIdade,
                 inplace = True)

df[ df['Age'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal


In [38]:
df[ df['VIP'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal
38,Earth,False,55 Cancri e,15.00000,,False,961.0
102,Earth,False,TRAPPIST-1e,0.00000,,True,0.0
145,Mars,True,TRAPPIST-1e,35.00000,,True,0.0
228,Mars,True,55 Cancri e,14.00000,,True,0.0
566,Mars,False,TRAPPIST-1e,28.82793,,False,2383.0
...,...,...,...,...,...,...,...
8494,Earth,True,TRAPPIST-1e,0.00000,,True,
8512,Earth,False,PSO J318.5-22,16.00000,,False,761.0
8542,Earth,True,55 Cancri e,55.00000,,False,0.0
8630,Europa,True,TRAPPIST-1e,52.00000,,True,0.0


In [39]:
VIPmoda = mode(df['VIP'])

df['VIP'].fillna(value = VIPmoda,
                 inplace = True)

df[ df['VIP'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal


In [40]:
df[df['Transported'].isnull()]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal


In [41]:
df[ df['GastoTotal'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal
7,Earth,True,TRAPPIST-1e,28.0,False,True,
10,Europa,True,TRAPPIST-1e,34.0,False,True,
16,Mars,False,55 Cancri e,27.0,False,False,
23,Earth,True,55 Cancri e,29.0,False,False,
25,Earth,True,PSO J318.5-22,1.0,False,False,
...,...,...,...,...,...,...,...
8642,Earth,True,TRAPPIST-1e,21.0,False,False,
8643,Mars,True,TRAPPIST-1e,50.0,False,True,
8665,Earth,True,TRAPPIST-1e,33.0,False,False,
8667,Europa,False,TRAPPIST-1e,29.0,False,True,


In [42]:
GastoTotalMedio = df['GastoTotal'].mean()

df['GastoTotal'].fillna(value = GastoTotalMedio,
                        inplace = True)

df[ df['GastoTotal'].isnull() ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal


Antes e concluir essa parte, vejamos se há linhas duplicadas.

In [43]:
df[ df.duplicated(keep = False) ]

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,Transported,GastoTotal
7,Earth,True,TRAPPIST-1e,28.0,False,True,1484.601541
9,Europa,True,55 Cancri e,14.0,False,True,0.000000
10,Europa,True,TRAPPIST-1e,34.0,False,True,1484.601541
12,Mars,False,TRAPPIST-1e,32.0,False,True,1309.000000
15,Earth,False,TRAPPIST-1e,31.0,False,False,908.000000
...,...,...,...,...,...,...,...
8680,Earth,True,TRAPPIST-1e,31.0,False,True,0.000000
8681,Earth,True,55 Cancri e,33.0,False,True,0.000000
8684,Earth,True,TRAPPIST-1e,23.0,False,True,0.000000
8685,Europa,False,TRAPPIST-1e,0.0,False,True,0.000000


Como eu ignorei algumas colunas, estas podem "quebrar simetria" da tabela. Deste modo, não vou retirar linhas duplicadas.

In [44]:
aa = read_csv('train.csv')

aa[ aa.duplicated(keep = False) ]

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported


De fato, considerando todas as colunas, não teríamos dados repetidos.

## Trocando a ordem das colunas

Antes de codificar as categorias, vamos trocar as últimas colunas só para ficar mais fácil na hora de fatiar os dados entre atributos e classe.

In [45]:
cols = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'GastoTotal', 'Transported']

df = df[cols]

df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,GastoTotal,Transported
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,False
1,Earth,False,TRAPPIST-1e,24.0,False,736.0,True
2,Europa,False,TRAPPIST-1e,58.0,True,10383.0,False
3,Europa,False,TRAPPIST-1e,33.0,False,5176.0,False
4,Earth,False,TRAPPIST-1e,16.0,False,1091.0,True
...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,8536.0,False
8689,Earth,True,PSO J318.5-22,18.0,False,0.0,False
8690,Earth,False,TRAPPIST-1e,26.0,False,1873.0,True
8691,Europa,False,55 Cancri e,32.0,False,4637.0,False


## Codificação de Categorias

Agora que todas as células estão preenchidas, vamos codificar as categorias da seguinte maneira:
    
* ```HomePlanet```: One Hot Encoding.

* ```CryoSleep```: Label Encoder.

* ```Destination```: One Hot Encoding.

* ```VIP```: Label Encoder.

Obs: Não codificaremos a coluna ```Transported``` pois no método de Naive-Bayes não é preciso transformar tudo em dados numéricos.

Vamos primeiramente codificar o HomePlanet e Destination. Começando pelo HomePlanet, temos

In [46]:
from sklearn.preprocessing import OneHotEncoder

CodificadorPlaneta = OneHotEncoder(sparse = False,
                                    drop = 'first')

ArrayPlaneta = CodificadorPlaneta.fit_transform(df[['HomePlanet']])

ArrayPlaneta

array([[1., 0.],
       [0., 0.],
       [1., 0.],
       ...,
       [0., 0.],
       [1., 0.],
       [1., 0.]])

In [47]:
CodificadorDestino = OneHotEncoder(sparse = False,
                                    drop = 'first')

ArrayDestino = CodificadorDestino.fit_transform(df[['Destination']])

ArrayDestino

array([[0., 1.],
       [0., 1.],
       [0., 1.],
       ...,
       [0., 1.],
       [0., 0.],
       [0., 1.]])

In [68]:
# df = df.drop(columns = ['0', '1'])

#df.drop(columns = 0, inplace = True)

In [69]:
# df.drop(columns = 1, inplace = True)

# df

In [58]:
from numpy import concatenate

arraytotal = concatenate((ArrayPlaneta, ArrayDestino), 
                         axis = 1)

arraytotal

array([[1., 0., 0., 1.],
       [0., 0., 0., 1.],
       [1., 0., 0., 1.],
       ...,
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [1., 0., 0., 1.]])

In [67]:
dfArraytotal = DataFrame(arraytotal)

df = df.join(dfArraytotal)

df

Unnamed: 0,HomePlanet,CryoSleep,Destination,Age,VIP,GastoTotal,Transported,0,1,2,3
0,Europa,False,TRAPPIST-1e,39.0,False,0.0,False,1.0,0.0,0.0,1.0
1,Earth,False,TRAPPIST-1e,24.0,False,736.0,True,0.0,0.0,0.0,1.0
2,Europa,False,TRAPPIST-1e,58.0,True,10383.0,False,1.0,0.0,0.0,1.0
3,Europa,False,TRAPPIST-1e,33.0,False,5176.0,False,1.0,0.0,0.0,1.0
4,Earth,False,TRAPPIST-1e,16.0,False,1091.0,True,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
8688,Europa,False,55 Cancri e,41.0,True,8536.0,False,1.0,0.0,0.0,0.0
8689,Earth,True,PSO J318.5-22,18.0,False,0.0,False,0.0,0.0,1.0,0.0
8690,Earth,False,TRAPPIST-1e,26.0,False,1873.0,True,0.0,0.0,0.0,1.0
8691,Europa,False,55 Cancri e,32.0,False,4637.0,False,1.0,0.0,0.0,0.0


In [70]:
df.drop(columns = ['HomePlanet', 'Destination'],
        inplace = True)

df

Unnamed: 0,CryoSleep,Age,VIP,GastoTotal,Transported,0,1,2,3
0,False,39.0,False,0.0,False,1.0,0.0,0.0,1.0
1,False,24.0,False,736.0,True,0.0,0.0,0.0,1.0
2,False,58.0,True,10383.0,False,1.0,0.0,0.0,1.0
3,False,33.0,False,5176.0,False,1.0,0.0,0.0,1.0
4,False,16.0,False,1091.0,True,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,8536.0,False,1.0,0.0,0.0,0.0
8689,True,18.0,False,0.0,False,0.0,0.0,1.0,0.0
8690,False,26.0,False,1873.0,True,0.0,0.0,0.0,1.0
8691,False,32.0,False,4637.0,False,1.0,0.0,0.0,0.0


Mudando pela última vez a ordem das colunas.

In [72]:
df = df[['CryoSleep', 'Age', 'VIP', 'GastoTotal', 0, 1, 2, 3, 'Transported']]

df

Unnamed: 0,CryoSleep,Age,VIP,GastoTotal,0,1,2,3,Transported
0,False,39.0,False,0.0,1.0,0.0,0.0,1.0,False
1,False,24.0,False,736.0,0.0,0.0,0.0,1.0,True
2,False,58.0,True,10383.0,1.0,0.0,0.0,1.0,False
3,False,33.0,False,5176.0,1.0,0.0,0.0,1.0,False
4,False,16.0,False,1091.0,0.0,0.0,0.0,1.0,True
...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,8536.0,1.0,0.0,0.0,0.0,False
8689,True,18.0,False,0.0,0.0,0.0,1.0,0.0,False
8690,False,26.0,False,1873.0,0.0,0.0,0.0,1.0,True
8691,False,32.0,False,4637.0,1.0,0.0,0.0,0.0,False


Agora vamos usar o Label Encoder para as colunas 'CryoSleep' e 'VIP'. Antes disso, iremos dividir já os dados entre atributos e classe.

In [75]:
atributos = df.iloc[:,0:8].values

atributos

array([[False, 39.0, False, ..., 0.0, 0.0, 1.0],
       [False, 24.0, False, ..., 0.0, 0.0, 1.0],
       [False, 58.0, True, ..., 0.0, 0.0, 1.0],
       ...,
       [False, 26.0, False, ..., 0.0, 0.0, 1.0],
       [False, 32.0, False, ..., 0.0, 0.0, 0.0],
       [False, 44.0, False, ..., 0.0, 0.0, 1.0]], dtype=object)

In [77]:
classe = df.iloc[:,8].values

classe

array([False,  True, False, ...,  True, False,  True])

In [78]:
from sklearn.preprocessing import LabelEncoder

LabelCry = LabelEncoder()

LabelVIP = LabelEncoder()

atributos[:,0] = LabelCry.fit_transform(atributos[:,0])

atributos[:,2] = LabelVIP.fit_transform(atributos[:,2])

atributos

array([[0, 39.0, 0, ..., 0.0, 0.0, 1.0],
       [0, 24.0, 0, ..., 0.0, 0.0, 1.0],
       [0, 58.0, 1, ..., 0.0, 0.0, 1.0],
       ...,
       [0, 26.0, 0, ..., 0.0, 0.0, 1.0],
       [0, 32.0, 0, ..., 0.0, 0.0, 0.0],
       [0, 44.0, 0, ..., 0.0, 0.0, 1.0]], dtype=object)

## Particionamento dos dados

In [80]:
from sklearn.model_selection import train_test_split

xTreino, xTeste, yTreino, yTeste = train_test_split(atributos, classe,
                                                    test_size = 0.3,
                                                    random_state = 0)

xTreino

array([[0, 64.0, 0, ..., 0.0, 0.0, 0.0],
       [0, 24.0, 0, ..., 0.0, 1.0, 0.0],
       [0, 44.0, 0, ..., 0.0, 0.0, 1.0],
       ...,
       [0, 29.0, 0, ..., 1.0, 0.0, 1.0],
       [0, 0.0, 0, ..., 0.0, 0.0, 1.0],
       [0, 45.0, 0, ..., 0.0, 0.0, 1.0]], dtype=object)

## Implementação do modelo de Naive-Bayes

Vamos implementar o modelo de Naive-Bayes nesse problema

In [81]:
from sklearn.naive_bayes import GaussianNB

modelo = GaussianNB()

modelo.fit(xTreino, yTreino)

previsao = modelo.predict(xTeste)

previsao

array([False, False, False, ...,  True, False, False])

## Precisão do modelo

Calculemos agora a precisão do modelo comparando com os dados esperados.

In [82]:
from sklearn.metrics import confusion_matrix, accuracy_score

matriz = confusion_matrix(previsao, yTeste)

matriz

array([[1028,  475],
       [ 275,  830]])

In [83]:
TaxaAcerto = accuracy_score(previsao, yTeste)

TaxaAcerto

0.7124233128834356

Tivemos uma precisão de 71% neste problema.

## Aplicação do modelo no arquivo de teste

Feito isso, apliquemos isso no arquivo ```test.csv``` e gerar um arquivo correspondente a resposta do problema.

In [85]:
df2 = read_csv('test.csv')

df2.drop(columns = colunas,
        inplace = True)

df2['GastoTotal'] = df2['RoomService'] + df2['FoodCourt'] + df2['ShoppingMall'] + df2['Spa'] + df2['VRDeck']

df2.drop(columns = colunas2,
        inplace = True)

PlanetaModa2 = mode(df2['HomePlanet'])
df2['HomePlanet'].fillna(value = PlanetaModa2,
                        inplace = True)


ModaCryoSleep2 = mode(df2['CryoSleep'])
df2['CryoSleep'].fillna(value = ModaCryoSleep2,
                       inplace = True)


DestinoModa2 = mode(df2['Destination'])
df2['Destination'].fillna(value = DestinoModa2,
                         inplace = True)


MediaIdade2 = df2['Age'].mean()
df2['Age'].fillna(value = MediaIdade2,
                 inplace = True)


VIPmoda2 = mode(df2['VIP'])
df2['VIP'].fillna(value = VIPmoda2,
                 inplace = True)


GastoTotalMedio2 = df2['GastoTotal'].mean()
df2['GastoTotal'].fillna(value = GastoTotalMedio2,
                        inplace = True)

cols2 = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'GastoTotal']

df2 = df2[cols2]


CodificadorPlaneta2 = OneHotEncoder(sparse = False,
                                    drop = 'first')

CodificadorDestino2 = OneHotEncoder(sparse = False,
                                    drop = 'first')

ArrayPlaneta2 = CodificadorPlaneta2.fit_transform(df2[['HomePlanet']])

ArrayDestino2 = CodificadorDestino2.fit_transform(df2[['Destination']])

arraytotal2 = concatenate((ArrayPlaneta2, ArrayDestino2),
                         axis = 1)

dfArraytotal2 = DataFrame(arraytotal2)

df2 = df2.join(dfArraytotal2)

df2.drop(columns = ['HomePlanet', 'Destination'],
        inplace = True)


df2 = df2[['CryoSleep', 'Age', 'VIP', 'GastoTotal', 0, 1, 2, 3]]


atributos2 = df2.iloc[:,0:8].values


LabelCry2 = LabelEncoder()

LabelVIP2 = LabelEncoder()

atributos2[:,0] = LabelCry2.fit_transform(atributos2[:,0])

atributos2[:,2] = LabelVIP2.fit_transform(atributos2[:,2])

previsao2 = modelo.predict(atributos2)

In [86]:
previsao2

array([ True, False,  True, ...,  True, False,  True])

## Gerando o arquivo final

In [87]:
len(previsao2)

4277

In [89]:
IdPassageiros = read_csv('test.csv',
                          usecols = ['PassengerId'])

IdPassageiros

Unnamed: 0,PassengerId
0,0013_01
1,0018_01
2,0019_01
3,0021_01
4,0023_01
...,...
4272,9266_02
4273,9269_01
4274,9271_01
4275,9273_01


In [92]:
len(previsao2)

4277

In [93]:
dicionario = {'PassengerId': IdPassageiros['PassengerId'], 
              'Transported': previsao2}

resultado = DataFrame(dicionario)

resultado

Unnamed: 0,PassengerId,Transported
0,0013_01,True
1,0018_01,False
2,0019_01,True
3,0021_01,False
4,0023_01,False
...,...,...
4272,9266_02,True
4273,9269_01,False
4274,9271_01,True
4275,9273_01,False


In [94]:
resultado.to_csv('RespostaSpaceshipTitanic.csv',
                  index = False)