# DOJO CargoX - Data Science

## O que deverá ser feito? 

Esse desafio consiste em um _dataset_ que nós gostariamos que você estudasse e propusesse um modelo preditivo.

### Objetivos

* Analisar os dados, apresentar o que foi encontrado e decisões tomadas e,
* Construir um modelo preditivo em cima dos dados do desafio

Vale ressaltar que **não existe resposta certa** para o desafio, e que o
trabalho e o processo para realizar esses dois objetivos são tão importantes
quanto o resultado final.

---

# A Seguradora Thomas Andrews

Você trabalha para a Seguradora Thomas Andrews, responsável por segurar os
passageiros dos [três Transatlânticos da Classe
_Olympic_](https://en.wikipedia.org/wiki/Olympic-class_ocean_liner) da companhia
inglesa [White Star Line](https://en.wikipedia.org/wiki/White_Star_Line).

Infelizmente, a segunda embarcação da White Star Line, o [RMS
Titanic](https://en.wikipedia.org/wiki/RMS_Titanic) afundou em sua viagem
inaugural. Muitos passageiros faleceram pela falta de botes salva-vidas
suficientes na embarcação, levando a um enorme prejuízo para a Thomas Andrews.

## Seria possível evitar tragédias? 

Para evitar futuras tragédias -financeiras para a empresa, no caso- a Diretoria
da Thomas Andrews está pedindo um modelo para tentar prever quais passageiros
podem vir a falecer nas próximas viagens, para cobrar um _premium_ sobre seu
seguro, ou até impedir eles de embarcar!

Para isso, a seguradora está disponibilizando dois arquivos com dados do
Titanic para seu trabalho:

* passageiros que a companhia sabe se sobreviveram ou não (`train.csv`),
* passageiros que a companhia tem dados, mas desconhece o destino (`test.csv`)

Existe mais um arquivo em anexo, o `variables.txt` que descreve os campos que
você encontra nas duas listagens de passageiros.

A partir dessas informações, a Diretoria espera que você entenda o dados
coletados e modele algo que possa ajudar a companhia previnir futuras perdas!

## Próxima reunião da Diretoria

É esperado que você apresente seus resultados na próxima reunião da Diretoria da
Thomas Andrews. A apresentação não é formal, mas você precisa mostrar seu
progresso e suas decisões ao longo do caminho.

Boa sorte e até a próxima reunião!

---
## Variáveis
survived:
  * se o passageiro sobreviveu ou não
  * valores possíveis:
    0 = Não
    1 = Sim

pclass:
  * classe da Passagem no navio
  * valores possíveis:
    1 = primeira classe,
    2 = classe executiva,
    3 = classe economica

sex:
  * o genero do passageiro

Age:
  * idade em anos

sibsp
  * numero de irmaos/companheiros dentro do navio

parch:
  * numero de parentes e filhos no navio

ticket:
  * codigo da passagem

fare:
  * valor da passagem

cabin:
  * número da cabine

embarked:
  * local onde embarcou no Titanic
  * possiveis valores: 
    C = Cherbourg,
    Q = Queenstown,
    S = Southampton



In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from os import path

In [2]:
df_train = pd.read_csv("train.csv")
df_train

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3.0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",Female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3.0,"Heikkinen, Miss. Laina",women,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female_,35.0,1,0,113803,53.1000,C123,S
4,5,0,3.0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3.0,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1.0,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3.0,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3.0,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",woman,27.0,0,2,347742,11.1333,,S
9,10,1,2.0,"Nasser, Mrs. Nicholas (Adele Achem)",Female,14.0,1,0,237736,30.0708,,C


In [3]:
df_test = pd.read_csv("test.csv")
df_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.2250,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0000,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.1500,,S


In [4]:
df_train.sex.isna().sum()

46

In [5]:
size_train = df_train.shape[0]
print(size_train)
size_test = df_test.shape[0]
print(size_test)

936
418


In [6]:
df_train.sex.unique()

array(['male', 'Female ', 'women', 'female_', 'male ', 'woman', ' male',
       'man', 'female', 'mal', nan, 'Male', ' female  '], dtype=object)

In [7]:
dict_sex = {'male': 'male',
            'male ': 'male',
            ' male': 'male',
            'man': 'male',
            'mal': 'male',
            'Male': 'male',
            'Female': 'female',
            'women': 'female',
            'female_': 'female',
            'woman': 'female',
            'female': 'female',
            ' female  ': 'female'}

In [8]:
df_train.sex = df_train.sex.map(dict_sex)

In [9]:
df_train

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3.0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3.0,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3.0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3.0,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1.0,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3.0,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3.0,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2.0,"Nasser, Mrs. Nicholas (Adele Achem)",,14.0,1,0,237736,30.0708,,C


In [10]:
df_train.sex.unique()

array(['male', nan, 'female'], dtype=object)

In [12]:
df_train.name[df_train.sex == np.nan]

Series([], Name: name, dtype: object)

In [13]:
df_train.name[df_train.sex.isnull()]

1      Cumings, Mrs. John Bradley (Florence Briggs Th...
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
14                  Vestrom, Miss. Hulda Amanda Adolfina
22                           McGowan, Miss. Anna "Annie"
24                         Palsson, Miss. Torborg Danira
31        Spencer, Mrs. William Augustus (Marie Eugenie)
33                                 Wheadon, Mr. Edward H
39                           Nicola-Yarred, Miss. Jamila
56                                     Rugg, Miss. Emily
61                                   Icard, Miss. Amelie
67                              Crease, Mr. Ernest James
81                           Sheerlinck, Mr. Jan Baptist
111                                 Zabour, Miss. Hileni
119                    Andersson, Miss. Ellis Anna Maria
186      O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)
190                                  Pinsky, Mrs. (Rosa)
215                            

In [14]:
df_train[df_train.survived == 1].groupby(['pclass']).count()

Unnamed: 0_level_0,passengerid,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1.0,142,142,142,127,128,142,142,142,142,120,140
2.0,88,88,88,69,84,88,88,88,88,13,88
3.0,128,128,128,111,90,128,128,128,128,6,128


In [15]:
df_train.groupby(['pclass']).count()

Unnamed: 0_level_0,passengerid,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1.0,224,224,224,206,193,224,224,224,224,180,222
2.0,187,187,187,163,176,187,187,187,187,16,187
3.0,516,516,516,467,370,516,516,516,516,12,516


In [16]:
sobreviventes = df_train[df_train.survived == 1].groupby(['pclass']).count().survived
total = df_train.groupby(['pclass']).count().survived
resul = sobreviventes/total
print(resul)

pclass
1.0    0.633929
2.0    0.470588
3.0    0.248062
Name: survived, dtype: float64


In [17]:
df_train

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3.0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3.0,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3.0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3.0,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1.0,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3.0,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3.0,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2.0,"Nasser, Mrs. Nicholas (Adele Achem)",,14.0,1,0,237736,30.0708,,C


In [18]:
sobreviventes = df_train[df_train.survived == 1].groupby(['sex']).count().survived
total = df_train.groupby(['sex']).count().survived
resul = sobreviventes/total
print(resul)

sex
female    0.756554
male      0.184348
Name: survived, dtype: float64


In [19]:
sobreviventes = df_train[df_train.survived == 1].groupby(['sex','pclass']).count().survived
total = df_train.groupby(['sex', 'pclass']).count().survived
resul = sobreviventes/total
print(resul)

sex     pclass
female  1.0       0.965116
        2.0       0.898305
        3.0       0.537190
male    1.0       0.366667
        2.0       0.153846
        3.0       0.132948
Name: survived, dtype: float64


In [20]:
print(df_train.age.unique().sort())

None


In [21]:
np.sort(df_train.age.unique())

array([   0.42,    0.67,    0.75,    0.83,    0.92,    1.  ,    2.  ,
          3.  ,    4.  ,    5.  ,    6.  ,    7.  ,    8.  ,    9.  ,
         10.  ,   11.  ,   12.  ,   13.  ,   14.  ,   14.5 ,   15.  ,
         16.  ,   17.  ,   18.  ,   19.  ,   20.  ,   20.5 ,   21.  ,
         22.  ,   23.  ,   23.5 ,   24.  ,   24.5 ,   25.  ,   26.  ,
         27.  ,   28.  ,   28.5 ,   29.  ,   30.  ,   30.5 ,   31.  ,
         32.  ,   32.5 ,   33.  ,   34.  ,   34.5 ,   35.  ,   36.  ,
         36.5 ,   37.  ,   38.  ,   39.  ,   40.  ,   40.5 ,   41.  ,
         42.  ,   43.  ,   44.  ,   45.  ,   45.5 ,   46.  ,   47.  ,
         48.  ,   49.  ,   50.  ,   51.  ,   52.  ,   53.  ,   54.  ,
         55.  ,   56.  ,   57.  ,   58.  ,   59.  ,   60.  ,   61.  ,
         62.  ,   63.  ,   64.  ,   65.  ,   70.  ,   70.5 ,   71.  ,
         74.  ,   80.  ,  117.  ,  194.  ,     nan])

In [22]:
df_train.age.mean()

29.777633689839568

In [23]:
np.sort(df_train.age.replace([117., 194.], np.nan).unique())

array([  0.42,   0.67,   0.75,   0.83,   0.92,   1.  ,   2.  ,   3.  ,
         4.  ,   5.  ,   6.  ,   7.  ,   8.  ,   9.  ,  10.  ,  11.  ,
        12.  ,  13.  ,  14.  ,  14.5 ,  15.  ,  16.  ,  17.  ,  18.  ,
        19.  ,  20.  ,  20.5 ,  21.  ,  22.  ,  23.  ,  23.5 ,  24.  ,
        24.5 ,  25.  ,  26.  ,  27.  ,  28.  ,  28.5 ,  29.  ,  30.  ,
        30.5 ,  31.  ,  32.  ,  32.5 ,  33.  ,  34.  ,  34.5 ,  35.  ,
        36.  ,  36.5 ,  37.  ,  38.  ,  39.  ,  40.  ,  40.5 ,  41.  ,
        42.  ,  43.  ,  44.  ,  45.  ,  45.5 ,  46.  ,  47.  ,  48.  ,
        49.  ,  50.  ,  51.  ,  52.  ,  53.  ,  54.  ,  55.  ,  56.  ,
        57.  ,  58.  ,  59.  ,  60.  ,  61.  ,  62.  ,  63.  ,  64.  ,
        65.  ,  70.  ,  70.5 ,  71.  ,  74.  ,  80.  ,    nan])

In [24]:
df_train.age = df_train.age.replace([117., 194.], np.nan)

In [25]:
df_train.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3.0,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3.0,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3.0,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [26]:
df_train.age_group = np.nan