# Introdução


Esse notebook é uma adaptação de Chris Deotte, um usuário do Kaggle. Nesse notebook, a previsão se um passageiro vai sobreviver ou não é baseado no sexo, na idade e na sobrevivência de outros passageiros da mesma família. Os grupos familiares são determinados através do sobrenome dos passageiros.

A notebook Kaggle escolhido foi: https://www.kaggle.com/cdeotte/titanic-using-name-only-0-81818 

É uma implementação relativamente simples em R, mas tem uma sacada inteligente, juntando familiares e fazendo decisões se as pessoas viverão ou não baseado nisso, combinado com o padrão WCG (em que mulheres e crianças tendem fortemente a sobreviver).

A nota final do Kaggle da minha implementação ficou abaixo do esperado (0.76555), quando o esperado era acima de 0.8. Porém, esse notebook será melhorado para que possa ser publicado em sites como o Medium.

# Desenvolvimento

In [0]:
import pandas as pd
import numpy as np

Serão usados o pandas e o numpy. Note que decidi não usar nenhuma biblioteca para fazer o plot de gráficos, já que no contexto da atividade o foco era a reprodução de um notebook, e o notebook escolhido não utiliza muitos gráficos (apenas para apresentar diferenças)

In [0]:
# load the datasets
train = pd.read_csv("train.csv").set_index('PassengerId')
test = pd.read_csv("test.csv").set_index('PassengerId')
full = pd.concat([train, test], axis=0, sort=False)


As bases de dados são lidas e já são concatenadas, já que vêm separadas. Elas serão modificadas separadamente, então a base de dados "completa" será atualizada quando isso acontecer.

In [492]:
# Visualização dos primeiros registros
train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [493]:
test.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [0]:
# É feita uma cópia, para não ser necessário reimportar, caso seja feita alguma operação equivocada
train_df = train.copy()

In [495]:
train_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [0]:
# Método que descobre o título (Mr, Mrs, etc) do passageiro
def FindTitle(data):
    title = data.Name.split(', ')[1]
    title = title.split('.')[0]
    return title

In [497]:
# Aplicação do método que encontra o título baseado no nome
train_df["Title"] = train_df.apply(lambda x: FindTitle(x),axis=1)
train_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr


In [0]:
# Método que determina um rótulo baseado no título encontrado
def FindLabelByTitle(data):
    manTitles = ["Capt","Don","Major","Col","Rev","Dr","Sir","Mr","Jonkheer"]
    womanTitles = ["Dona","the Countess","Mme","Mlle","Ms","Miss","Lady","Mrs"]
    if (data.Title in manTitles):
      return "man"
    elif (data.Title in womanTitles):
      return "woman"
    elif (data.Title == "Master"):
      return "boy"

In [499]:
# Aplicação do método acima
train_df["Label"] = train_df.apply(lambda x: FindLabelByTitle(x), axis=1)
train_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,woman
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,man


In [500]:
# Determinando o sobrenome do passageiro, sem nome inicial nem nome do meio
train_df["Surname"] = train_df.Name.str.split(",").str[0]
train_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man,Braund
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman,Cumings
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman,Heikkinen
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,woman,Futrelle
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,man,Allen


In [501]:
# Homens (e não garotos) recebem o sobrenome noGroup
train_df.Surname[train_df.Label=='man'] = 'noGroup'
train_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man,noGroup
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman,Cumings
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman,Heikkinen
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,woman,Futrelle
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,man,noGroup


In [502]:
# Contador de aparições de cada sobrenome, para determinar a frequência em que aparecem
train_df["SurnameFreq"] = train_df.apply(lambda x: pd.value_counts(train_df.Surname == x.Surname)[1], axis=1)
train_df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man,noGroup,538
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman,Cumings,1
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman,Heikkinen,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,woman,Futrelle,1
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,man,noGroup,538
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Mr,man,noGroup,538
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Mr,man,noGroup,538
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Master,boy,Palsson,4
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Mrs,woman,Johnson,3
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs,woman,Nasser,1


In [503]:
# Pessoas que estão sozinhas (<=1) recebem o sobrenome noGroup também
train_df.Surname[train_df.SurnameFreq <= 1] = 'noGroup'
train_df.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man,noGroup,538
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman,noGroup,1
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman,noGroup,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,woman,noGroup,1
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,man,noGroup,538
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Mr,man,noGroup,538
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Mr,man,noGroup,538
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Master,boy,Palsson,4
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Mrs,woman,Johnson,3
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs,woman,noGroup,1


In [511]:
# Média de sobrevivência dos grupos de mulher + criança
train_df['SurnameSurvival'] = train_df.groupby('Surname')['Survived'].transform('mean')
train_df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq,SurnameSurvival
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man,noGroup,538,0.34713
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman,noGroup,1,0.34713
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman,noGroup,1,0.34713
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,woman,noGroup,1,0.34713
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,man,noGroup,538,0.34713
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Mr,man,noGroup,538,0.34713
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Mr,man,noGroup,538,0.34713
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,Master,boy,Palsson,4,0.0
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Mrs,woman,Johnson,3,1.0
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Mrs,woman,noGroup,1,0.34713


Esse código acima pode estar incorreto. No notebook em R, o desenvolvedor usou:

\$SurnameSurvival <- ave(train\$Survived,train\$Surname)

Porém, ele não chega a apresentar a saída.

In [512]:
# Grupos WCG em que todos morrem
train_df[train_df.SurnameSurvival == 0].Surname.unique()

array(['Palsson', 'Rice', 'Vander Planke', 'Panula', 'Goodwin', 'Skoog',
       'Zabour', 'Jussila', 'Boulos', 'Ford', 'Sage', 'Lefebre', 'Strom',
       'Barbara', 'Van Impe', 'Bourke'], dtype=object)

In [513]:
# Grupos WCG em que todos vivem
train_df[train_df.SurnameSurvival == 1].Surname.unique()

array(['Johnson', 'Sandstrom', 'Nicola-Yarred', 'Laroche', 'Harper',
       'West', 'Moubarek', 'Caldwell', 'Fortune', 'Doling', 'Peter',
       'Goldsmith', 'Becker', 'Navratil', 'Brown', 'Newell', 'Collyer',
       'Murphy', 'Hamalainen', 'Graham', 'Mellinger', 'Kelly', 'Hays',
       'Ryerson', 'Wick', 'Hippach', 'Coutts', 'Richards', 'Hart',
       'Baclini', 'Quick', 'Taussig', 'Herman', 'Moor'], dtype=object)

In [514]:
# Grupos WCG com ambos os resultados
train_df[train_df.SurnameSurvival < 1][train_df.SurnameSurvival > 0].Surname.unique()

  


array(['noGroup', 'Asplund', 'Andersson', 'Allison', 'Carter'],
      dtype=object)

In [0]:
# É feito um ajuste da taxa de sobrevivência
train_df["AdjustedSurvival"] = (train_df.SurnameSurvival * train_df.SurnameFreq - train_df.Survived) / train_df.SurnameFreq-1

In [518]:
# E são feitas as previsões, baseados na taxa de sobrevivência
train_df["predict"] = 0
train_df.predict[train_df.Title=='woman'] = 1
train_df.predict[train_df.Title=='boy'][train_df.AdjustedSurvival==1] = 1
train_df.predict[train_df.Title=='woman'][train_df.AdjustedSurvival==0] = 0
train_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq,SurnameSurvival,predict,AdjustedSurvival
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man,noGroup,538,0.34713,0,-0.65287
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman,noGroup,1,0.34713,0,-1.65287
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman,noGroup,1,0.34713,0,-1.65287
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,woman,noGroup,1,0.34713,0,-1.65287
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,man,noGroup,538,0.34713,0,-0.65287


In [519]:
# Cópia e visualização da base de dados de teste
test_df = test.copy()
test_df.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [520]:
test_df["Title"] = test_df.apply(lambda x: FindTitle(x),axis=1)
test_df["Label"] = test_df.apply(lambda x: FindLabelByTitle(x), axis=1)
test_df.Survived = np.nan
test_df.predict = np.nan
train_df.AdjustedSurvival = np.nan
train_df.Surname = ""
train_df.SurnameFreq = np.nan
train_df.SurnameSurvival = np.nan

full_df = pd.concat([train_df, test_df], axis=0, sort=False)
full_df.tail()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq,SurnameSurvival,predict,AdjustedSurvival
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S,Mr,man,,,,,
1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C,Dona,woman,,,,,
1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S,Mr,man,,,,,
1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S,Mr,man,,,,,
1309,,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C,Master,boy,,,,,


Acima acontece o uso das mesmas funções usadas no data frame de treino. Depois disso, alguns valores de train_df são modificados, pois será necessário atualizá-los com as informações de test_df (frequência do sobrenome, por exemplo). Também é feita a concatenação das bases de dados.

In [521]:
# Preenchimento de sobrenomes, frequência
full_df.Surname = full_df.Name.str.split(",").str[0]
full_df.Surname[full_df.Label=='man'] = 'noGroup'
full_df["SurnameFreq"] = full_df.apply(lambda x: pd.value_counts(full_df.Surname == x.Surname)[1], axis=1)
full_df.Surname[full_df.SurnameFreq <= 1] = 'noGroup'
full_df.tail()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq,SurnameSurvival,predict,AdjustedSurvival
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S,Mr,man,noGroup,783,,,
1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C,Dona,woman,noGroup,1,,,
1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S,Mr,man,noGroup,783,,,
1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S,Mr,man,noGroup,783,,,
1309,,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C,Master,boy,Peter,3,,,


In [522]:
full_df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq,SurnameSurvival,predict,AdjustedSurvival
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man,noGroup,783,,0.0,
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman,noGroup,1,,0.0,
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman,noGroup,1,,0.0,
4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Mrs,woman,noGroup,1,,0.0,
5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Mr,man,noGroup,783,,0.0,


In [523]:
full_df.Surname[pd.isna(full_df.Surname)] = 'noGroup'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [524]:
# Re-calculando taxa de sobrevivência nos primeiros registros
full_df.SurnameSurvival = np.nan
full_df[0:890]['SurnameSurvival'] = full_df[0:890].groupby('Surname')['Survived'].transform('mean')
full_df.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq,SurnameSurvival,predict,AdjustedSurvival
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,man,noGroup,783,0.321127,0.0,
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,woman,noGroup,1,0.321127,0.0,
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,woman,noGroup,1,0.321127,0.0,


In [525]:
# Aqui é diferente, já que não se sabe se o passageiro morreu ou não
# Já que é a parte de testes.
for i in range(891, 1309):
    if (full_df[i:i+1].SurnameFreq[i+1] > 1):
        full_df[i:i+1]['SurnameSurvival'][i+1] = 1
    else:
        full_df[i:i+1]['SurnameSurvival'][i+1] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [526]:
# Novas previsões, agora em todos os registros
full_df["predict"] = 0
full_df.predict[full_df.Label=='woman'] = 1
full_df.predict[full_df.Label=='boy'][full_df.SurnameSurvival==1] = 1
full_df.predict[full_df.Label=='woman'][full_df.SurnameSurvival==0] = 0

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [528]:
full_df.tail()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,Label,Surname,SurnameFreq,SurnameSurvival,predict,AdjustedSurvival
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S,Mr,man,noGroup,783,1.0,0,
1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C,Dona,woman,noGroup,1,0.0,1,
1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S,Mr,man,noGroup,783,1.0,0,
1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S,Mr,man,noGroup,783,1.0,0,
1309,,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C,Master,boy,Peter,3,1.0,0,


In [0]:
# Criação do arquivo de submissão
holdout_ids = test_df.index
submission_df = {"PassengerId": holdout_ids,
                 "Survived": full_df[891:].predict}
submission = pd.DataFrame(submission_df)

submission.to_csv("submission.csv",index=False)