# Titanic

## Conteúdo 

- **Survived**: Indica se o passageiro sobreviveu ao desastre. É atribuído o valor de 0 para aqueles que não sobreviveram, e 1 para quem sobreviveu;
- **Pclass**: Classe na qual o passageiro viajou. É informado 1 para primeira classe; 2 para segunda; e 3 para terceira;
- **Name**: Nome do passageiro;
- **Sex**: Sexo do passageiro;
- **Age**: Idade do passageiro em anos;
- **SibSp**: Quantidade de irmãos e cônjuges a bordo ;
- **Parch**: Quantidade de pais e filhos a bordo;
- **Ticket**: Número da passagem;
- **Fare**: Preço da passagem;
- **Cabin**: Número da cabine do passageiro;
- **Embarked**: Indica o porto no qual o passageiro embarcou. Há apenas três valores possíveis: Cherbourg, Queenstown e Southampton, indicados pelas letras “C”, “Q” e “S”, respectivamente.

## Carregando as bibliotecas

In [1]:
library(repr)
options(repr.plot.width = 4, repr.plot.height = 4)

## Carregando o DataFrame 

In [157]:
df = read.csv("../datasets/titanic/train.csv", na.strings = '')
head(df)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


# Processamento das Variáveis

## Nomes - Extração dos Pronomes de Tratamento

- Senhor: Sir/ Mr / Don / Jonkheer.
- Senhora casada: Madam/ Mrs / Ms / Mme (em frânces) / Countess (condessa).
- Mulher solteira/sehorita/menina: Miss, Mlle (mademoiselle).
- Meninos (formalmente): Master
- Doutor: Dr
- Reverendo: Rev
- Capitão: capt
- Major: Major
- Col: Coronel

In [158]:
df$Name <- ifelse(grepl(", Mr. ", df$Name), 'Mr', as.character(df$Name))
df$Name <- ifelse(grepl("Capt", df$Name), 'Tripulacao', as.character(df$Name))
df$Name <- ifelse(grepl("Don", df$Name), 'Mr', as.character(df$Name))
df$Name <- ifelse(grepl("Major", df$Name), 'Tripulacao', as.character(df$Name))
df$Name <- ifelse(grepl("Col", df$Name), 'Tripulacao', as.character(df$Name))
df$Name <- ifelse(grepl("Dr", df$Name), 'Dr', as.character(df$Name))
df$Name <- ifelse(grepl("Rev", df$Name), 'Rev', as.character(df$Name))
df$Name <- ifelse(grepl("Sir", df$Name), 'Mr', as.character(df$Name))
df$Name <- ifelse(grepl("Jonkheer", df$Name), 'Mr', as.character(df$Name))
df$Name <- ifelse(grepl("Dona", df$Name), 'Mrs', as.character(df$Name))
df$Name <- ifelse(grepl("Countess", df$Name), 'Mrs', as.character(df$Name))
df$Name <- ifelse(grepl("Mme", df$Name), 'Mrs', as.character(df$Name))
df$Name <- ifelse(grepl("Lady", df$Name), 'Mrs', as.character(df$Name))
df$Name <- ifelse(grepl("Mrs", df$Name), 'Mrs', as.character(df$Name))
df$Name <- ifelse(grepl("Mlle", df$Name), 'Miss', as.character(df$Name))
df$Name <- ifelse(grepl("Ms", df$Name), 'Miss', as.character(df$Name))
df$Name <- ifelse(grepl("Miss", df$Name), 'Miss', as.character(df$Name))
df$Name <- ifelse(grepl("Master", df$Name), 'Master', as.character(df$Name))

unique(df$Name)

## Name - Categório para Numérico

In [159]:
df$Name = factor(df$Name, levels = unique(df$Name), labels = 1:7)
head(df)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,1,male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,2,female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,2,female,35.0,1,0,113803,53.1,C123,S
5,0,3,1,male,35.0,0,0,373450,8.05,,S
6,0,3,1,male,,0,0,330877,8.4583,,Q


## Age - Substituir Valores Ausentes

- 177 idade inválidas

In [160]:
media = mean(df$Age, na.rm = TRUE)
df$Age = ifelse(is.na(df$Age), media, df$Age)
head(df)

sum(is.na(df$Age))

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,1,male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,2,female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,2,female,35.0,1,0,113803,53.1,C123,S
5,0,3,1,male,35.0,0,0,373450,8.05,,S
6,0,3,1,male,29.69912,0,0,330877,8.4583,,Q


## Embarked - Substituir Valores Nulos pela Moda

In [161]:
t = table(df$Embarked)
moda = names(t[t == max(t)])

df$Embarked <- ifelse(is.na(df$Embarked), as.character(moda), as.character(df$Embarked) )

head(df)
sum(is.null(df$Embarked))

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,1,male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,2,female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,2,female,35.0,1,0,113803,53.1,C123,S
5,0,3,1,male,35.0,0,0,373450,8.05,,S
6,0,3,1,male,29.69912,0,0,330877,8.4583,,Q


## Sex - Categórico para Numérico

In [162]:
df$Sex = factor(df$Sex, levels = unique(df$Sex), labels = 0:1)
head(df)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,1,0,22.0,1,0,A/5 21171,7.25,,S
2,1,1,2,1,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,3,1,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,2,1,35.0,1,0,113803,53.1,C123,S
5,0,3,1,0,35.0,0,0,373450,8.05,,S
6,0,3,1,0,29.69912,0,0,330877,8.4583,,Q


## Embarked - Categórico Para Numérico

In [163]:
df$Embarked = factor(df$Embarked, levels = unique(df$Embarked), labels = 1:3)
head(df)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,1,0,22.0,1,0,A/5 21171,7.25,,1
2,1,1,2,1,38.0,1,0,PC 17599,71.2833,C85,2
3,1,3,3,1,26.0,0,0,STON/O2. 3101282,7.925,,1
4,1,1,2,1,35.0,1,0,113803,53.1,C123,1
5,0,3,1,0,35.0,0,0,373450,8.05,,1
6,0,3,1,0,29.69912,0,0,330877,8.4583,,3


## Remoção de Colunas

- PassangerId - O identificador não intervere na probabilidade de sobreviver
- Ticket - Não interfere na probabilidade de sobreviver
- Cabin - Redundante com Pclass que indicam a mesma coisa. 

In [164]:
df$PassengerId = NULL
df$Ticket = NULL
df$Cabin = NULL
head(df)

Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,1,0,22.0,1,0,7.25,1
1,1,2,1,38.0,1,0,71.2833,2
1,3,3,1,26.0,0,0,7.925,1
1,1,2,1,35.0,1,0,53.1,1
0,3,1,0,35.0,0,0,8.05,1
0,3,1,0,29.69912,0,0,8.4583,3


## Escalonamento de Atributor

- Devemos Escalonar a Idade e o Preço dos Tickets

In [167]:
df[, c('Age', 'Fare')] = scale(df[, c('Age', 'Fare')])

head(df)

Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,1,0,-0.592148,1,0,-0.5021631,1
1,1,2,1,0.6384304,1,0,0.7864036,2
1,3,3,1,-0.2845034,0,0,-0.4885799,1
1,1,2,1,0.407697,1,0,0.4204941,1
0,3,1,0,0.407697,0,0,-0.4860644,1
0,3,1,0,2.0107020000000002e-17,0,0,-0.4778481,3
