# Introduction

I used random forest algorithm to make predictions for this challenge. First, I created some simple features according to the given data. I then used mice to imputate missing data, specifically, the missing age data. In the second round of feature engineering, we used the full age data (and other things) to create some more features. 

The final stage is feature selection and prediction. The resulting Kaggle score of this model is 0.80383 which put it to the top 18% of all submissions.

I thank Megan L. Risdal for her script that inspired me.

# 1.  Loading Data and Libraries

In [None]:
library(plyr)
library(rpart)
library(randomForest)
library(ggplot2)
library(mice)

In [None]:
train <- read.csv("train.csv", na.string = "")
test <- read.csv("test.csv", na.string = "")
test$Survived <- NA
all_data <- rbind(train, test)

# 2. Feature Engineering

## 2.1 Title

In [None]:
# Extract Title form name
all_data$Title <- gsub('(.*, )|(\\..*)', '', all_data$Name)

In [None]:
#Show title counts by sex
print(table(all_data$Sex, all_data$Title))

In [None]:
all_data$Title <- mapvalues(all_data$Title, from = c("Mlle", "Col", "Major", "Jonkheer", "Don", "Mme", "Capt", 
                                                     "the Countess", "Ms", "Dona"), 
                            to = c("Miss", "Officer", "Officer", "Sir", "Sir", "Mrs", "Officer", 
                                   "Lady", "Mrs", "Mrs"))
all_data$Title <- factor(all_data$Title)
print(unique(all_data$Title))

In [None]:
# Separate the last names from names
all_data$SplitName <- strsplit(as.character(all_data$Name), "," )
all_data$LastName <- sapply(all_data$SplitName, "[", 1)

## 2.2 Family Size

In [None]:
all_data$FamilySize <- all_data$SibSp + all_data$Parch + 1
all_data$Family <- paste(all_data$LastName, all_data$FamilySize, sep = "_")

In [None]:
# Plot family size v.s. survival
ggplot(all_data[1:891,], aes(x = FamilySize, fill = factor(Survived))) +
  geom_bar(position='dodge')

Because there are fewer large families and they perform similarly, it's useful to group some together.

In [None]:
# Discretize family size
all_data$FsizeD[all_data$FamilySize == 1] <- 1
all_data$FsizeD[all_data$FamilySize < 5 & all_data$FamilySize > 1] <- 2
all_data$FsizeD[all_data$FamilySize > 4] <- 3

# Show family size by survival using a mosaic plot
mosaicplot(table(all_data$FsizeD, all_data$Survived), main='Family Size by Survival', shade=TRUE)

## 2.3 Missing Age and Cabin

Sometimes the missingness of data can be a clue too. We create a feature called "MissingAgeCabin" to indicate if a person is missing both Age and Cabin data.

In [None]:
all_data$MissingAgeCabin[(is.na(all_data$Age))&(is.na(all_data$Cabin))] <- 1
all_data$MissingAgeCabin[(!is.na(all_data$Age)) | (!is.na(all_data$Cabin))] <- 0

## 2.4 Ticket Group Size

Notice some of these ticket numbers are the same. They might have bought the tickets at the same time. So they might be related or they might be friends. Create a feature called "TicketGroupSize" to indicate the size of group of people who share the same Ticket number.

In [None]:
all_data$TicketGroupSize <- mapvalues(all_data$Ticket, from = as.vector(as.data.frame(table(all_data$Ticket))$Var1),
                                     to = as.vector(as.data.frame(table(all_data$Ticket))$Freq))

# 3. Fill in Missing Data

## 3.1 Embarked

See first whose Embarked data is missing.

In [None]:
print(all_data[is.na(all_data$Embarked),])

It looks like only passenger 62 and 830 has missing Embarked data. They are both female, and their ticket fare are both 80 and they also have the same Ticket number. So it's likely that they embarked at the same place. 

In [None]:
ggplot(all_data[(!is.na(all_data$Embarked))&(!is.na(all_data$Fare)),], 
       aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
  geom_boxplot()

From the plot, it's more likely that they departed from port "C" based on the fare they paid. 

In [None]:
# Fill the Embarked data with "C"
all_data$Embarked[is.na(all_data$Embarked)] <- 'C'

In [None]:
# Fill in the one missing Fare value
all_data$Fare[1044] <- median(all_data[all_data$Pclass == '3' & all_data$Embarked == 'C', ]$Fare, na.rm = TRUE)

In [None]:
print(all_data$Embarked[is.na(all_data$Embarked)])

## 3.2 Age

There are quite a bit of missing age data.

In [None]:
print(sum(is.na(all_data$Age)))

We give two ways of filling in the missing data. One using random forest, the other using mice. Mice performs better than random forest.

In [None]:
# # Set a random seed
# set.seed(129)

# # Perform random forest to imputate Age
# predicted_age <- randomForest(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + FamilySize,
#                        data = all_data[!is.na(all_data$Age),], ntree = 1000)
# all_data$Age[is.na(all_data$Age)] <- predict(predicted_age, all_data[is.na(all_data$Age),])

In [None]:
# Use mice to imputate Age 
# Make variables factors into factors
factor_vars <- c('PassengerId','Pclass','Sex','Embarked',
                 'Title','LastName','Family','FsizeD')

all_data[factor_vars] <- lapply(all_data[factor_vars], function(x) as.factor(x))

# Set a random seed
set.seed(129)
# Perform mice imputation, excluding certain less-than-useful variables:
mice_mod <- mice(all_data[, !names(all_data) %in% 
                          c('PassengerId','Name','Ticket','Cabin','Family','LastName','Survived','SplitName')], 
                 method='rf') 
# Save the complete output 
mice_output <- complete(mice_mod)
# Replace Age variable from the mice model.
all_data$Age <- mice_output$Age

par(mfrow=c(1,2))
hist(all_data$Age, freq=F, main='Age: Original Data', 
  col='darkgreen', ylim=c(0,0.04))
hist(mice_output$Age, freq=F, main='Age: MICE Output', 
  col='lightgreen', ylim=c(0,0.04))

In [None]:
# Show new number of missing Age values
sum(is.na(all_data$Age))

In [None]:
# We look at the relation between age and survival
ggplot(all_data[1:891,], aes(Age, fill = factor(Survived))) + 
  geom_histogram()

# 4. Age Related Feature Engineering

In [None]:
# Create a Child feature
all_data$Child[all_data$Age < 10] <- "Child"
all_data$Child[all_data$Age >= 10]<- "Adult"

In [None]:
# Create a Mother feature
all_data$Mother <- 'Not Mother'
all_data$Mother[all_data$Sex == 'female' & all_data$Parch > 0 & all_data$Age > 18 & all_data$Title != 'Miss'] <- 'Mother'

In [None]:
# Factor the features
all_data$Child  <- factor(all_data$Child)
all_data$Mother <- factor(all_data$Mother)
all_data$Fsized <- factor(all_data$FsizeD)

# 5. Model

In [None]:
train_filled <- data.frame(all_data[1:891,],stringsAsFactors = TRUE)
test_filled <- data.frame(all_data[892:1309,],stringsAsFactors = TRUE)

In [None]:
# See all features
print(colnames(train_filled))

In [None]:
set.seed(754)

# Build the model (note: not all possible variables are used)
rf_full <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + 
                                            Fare + Embarked + Title + Child + Mother + FsizeD + 
                                            TicketGroupSize + MissingAgeCabin,
                                            data = train_filled, importance = TRUE, proximity=TRUE, do.trace= TRUE,
                                            ntree = 1000, nodesize = 100)

# Show model error
plot(rf_full, ylim=c(0,0.36))
legend('topright', colnames(rf_full$err.rate), col=1:3, fill=1:3)

## 5.1 Feature Selection

In [None]:
print(varImpPlot(rf_full))

Select the features Title, Sex, Pclass, TicketGroupSize, FsizeD, Fare, Child. This selection is made after a few rounds of trial and error together with some gussing based on common sense...

In [None]:
set.seed(754)

# Build the model (note: not all possible variables are used)
my_forest <- randomForest(as.factor(Survived) ~ Title + Sex + Pclass + TicketGroupSize + FsizeD + SibSp + Child,
                                            data = train_filled, importance = TRUE, proximity=TRUE, do.trace= TRUE,
                                            ntree = 1000, nodesize = 100)


# 6. Prediction

In [None]:
# Predict using the test set
prediction <- predict(my_forest, test_filled)

# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)
solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)

# Write the solution to file
write.csv(solution, file = 'improved_R_Solution.csv', row.names = FALSE)