# Solving Kaggle Titanic problem in R

## 1. Introduction
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we are asked to analyse what sorts of people were likely to survive. In particular, we apply the tools of machine learning to predict which passengers survived the tragedy. 

## 2. Load the necessary packages

In [76]:
# load required libraries
library(tidyverse) # tidyverse includes ggplot2, tibble, tidyr, readr
library(rpart) # decision tree
library(randomForest) # random forest

## 3. Load the necessary data
Train and test data are provided by Kaggle site.

In [77]:
# load 'train.csv' data
train <- read.csv("train.csv")

# load 'test.csv' data
test <- read.csv("test.csv")

test$Survived <- NA

## 4. Combine Train and Test Data and view
The training data will be used to train the model which will be used to predict survival rate on testing data. It is imperative that both training and test data are consistent and the consistent rules are used to fill missing values in both.

In [78]:
# combine train and test data
combined <- rbind(train, test)
str(combined)

'data.frame':	1309 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
 $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...


## 5. Check for Missing Values

In [79]:
# check for missing values
sapply(combined, function(x) sum(is.na(x)))

In [80]:
# remove cabin column due to high missing values
combined$Cabin <- NULL

In [81]:
# check which Embarked values are missing
which(combined$Embarked == "")

In [82]:
# set missing Embarked values to "S"
combined$Embarked[which(combined$Embarked == "")] <- "S"

In [83]:
combined$Fare[is.na(combined$Fare)] <- round(mean(combined$Fare, na.rm = TRUE), 4)

## 6. Calculate mean ages for different groups of passengers

In [84]:
# average age of male passenger
avg_male_age <- combined %>%
    filter(!is.na(Age)) %>%
    filter(Sex == "male") %>%
    summarize(round(mean(Age)))

print("average age of male passenger")

avg_male_age[[1]]

# average age of a female passenger
avg_female_age <- combined %>%
    filter(!is.na(Age)) %>%
    filter(Sex == "female") %>%
    summarize(round(mean(Age)))

print("average age of female passenger")
avg_female_age[[1]]

# average age male passenger with Mr title
avg_mr_age <- combined %>%
    filter(!is.na(Age)) %>%
    filter(str_detect(Name, fixed("Mr."))) %>%
    summarize(round(mean(Age)))

print("average age of male passenger with Mr title")

avg_mr_age[[1]]

# average age of female passenger with Mr title
avg_mrs_age <- combined %>%
    filter(!is.na(Age)) %>%
    filter(str_detect(Name, fixed("Mrs."))) %>%
    summarize(round(mean(Age)))

print("average age of female passenger with Mrs title")

avg_mrs_age[[1]]

# average age of female passenger with Miss title
avg_miss_age <- combined %>%
    filter(!is.na(Age)) %>%
    filter(str_detect(Name, fixed("Miss."))) %>%
    summarize(round(mean(Age)))

print("average age of female passenger with Miss title")

avg_miss_age[[1]]

# average age of passenger with Master title
avg_master_age <- combined %>%
    filter(!is.na(Age)) %>%
    filter(Sex == "male" & str_detect(Name, fixed("Master."))) %>%
    summarize(round(mean(Age)))

print("average age of passenger with Master title")

avg_master_age[[1]]

[1] "average age of male passenger"


[1] "average age of female passenger"


[1] "average age of male passenger with Mr title"


[1] "average age of female passenger with Mrs title"


[1] "average age of female passenger with Miss title"


[1] "average age of passenger with Master title"


## 7. Fill missing age values with average age of similar group passengers

In [85]:
# set missing age for passengers with Mr. title according to average age of available passengers with same title
combined$Age[is.na(combined$Age) & combined$Sex == "male" & (str_detect(combined$Name, fixed("Mr.")) | str_detect(combined$Name, fixed("Dr.")) )] <- avg_mr_age[[1]]

# set missing age for passengers with Mrs. title according to average age of available passengers with same title
combined$Age[is.na(combined$Age) & combined$Sex == "female" & str_detect(combined$Name, fixed("Mrs."))] <- avg_mrs_age[[1]]

# set missing age for passengers with Mrs. title according to average age of available passengers with same title
combined$Age[is.na(combined$Age) & combined$Sex == "female" & (str_detect(combined$Name, fixed("Miss.")) | str_detect(combined$Name, fixed("Ms.")))] <- avg_miss_age[[1]]

# set missing age for passengers with Master. title according to average age of available passengers with same title
combined$Age[is.na(combined$Age) & combined$Sex == "male" & str_detect(combined$Name, fixed("Master."))] <- avg_master_age[[1]]

## 8. Factor engineering - classify passengers on the basis of their age
"Ladies and children first".
A simple group of passengers by age shows younger passengers had higher chance of surival than other passengers. Hence, It is probably helpful to divide passengers on the basis of age. This will require some factor engineering.
All passengers below the age 18 have been assigned a value of 1 for Child column.

In [86]:
# factor engineering - Classify as child if age 18 or below
combined$Child <- 0
combined$Child[combined$Age <= 18] <- 1

In [87]:
# check for any other missing values
combined %>%
    sapply(function(x) sum(is.na(x)))

## 9 - Separate Training and Testing data

In [88]:
# separate train and test data sets
train <- combined[1:891,]
test <- combined[892:1309,]
rownames(test) <- NULL

In [89]:
train %>%
    group_by(Sex) %>%
    summarize(mean = mean(Survived))

Sex,mean
female,0.7420382
male,0.1889081


In [90]:
train %>%
    group_by(Pclass, Sex) %>%
    summarize(mean(Survived))
# Female passengers in Pclass1 and Pclass2 had near full chance of survival

Pclass,Sex,mean(Survived)
1,female,0.9680851
1,male,0.3688525
2,female,0.9210526
2,male,0.1574074
3,female,0.5
3,male,0.1354467


## 10 - Predicting survival rate

### 10.1 - everyone dies

In [91]:
test$Survived <- 0
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submission-1.csv", row.names = FALSE)

**score - 0.62679**
The results are generally in line with the overall survival rate on titanic.

### 10.2 - all females survived, all males died

In [92]:
test$Survived <- 0
test$Survived[test$Sex == "female"] <- 1
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submission-2.csv", row.names = FALSE)

In [None]:
**score - 0.76555**
Improvement on the initial score.

### 10.3 - all females in Pclass1 and 2 Survived, everyone else died

In [93]:
test$Survived <- 0
test$Survived[test$Sex == "female" & test$Pclass == 1] <- 1
test$Survived[test$Sex == "female" & test$Pclass == 2] <- 1
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "submission-3.csv", row.names = FALSE)

In [None]:
**score - 0.75598**
It does not improve on all female surival rate but generally better not significantly inferior to all female survival rate.

### 10.4 - using decision tree classifier

In [94]:
my_tree <- rpart(Survived ~ 
                 Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, method = "class")

prediction <- predict(my_tree, test, type = "class")
solution <- data.frame(PassengerId = test$PassengerId, Survived = prediction)
write.csv(solution, file = "submission-4.csv", row.names = FALSE)

**Score - 0.79904**
This improves score at 85th percentile on Kaggle. Signficiant improvement on initial numbers.

### 10.5 - Random Forest

In [443]:
# Set seed for reproducibility
set.seed(111)

# Apply the Random Forest Algorithm
my_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
                          data = train, importance = TRUE, ntree = 1000)

prediction <- predict(my_forest, test)

solution <- data.frame(PassengerId = test$PassengerId, Survived = prediction)

write.csv(solution, file = "submission-5.csv", row.names = FALSE)

In [None]:
**Score - 0.7655**
Not better than decision tree classifier