
# Exploring Titanic Survival Rates

This file tests the notebook environment and serves as a practice for a competition submission.

## Loading and Setting Up of the data

In [None]:
#include libraries
library(ggplot2) #library for producing plots

system("ls ../input")
#load the training data:
df=read.csv("../input/train.csv",stringsAsFactors=FALSE) 

#Display a summary of all the variables and their type
str(df)

The previous code used the "stringsAsFactors = FALSE" argument so that I get to set up the variables myself. This actually forces the user to look at the content of each variable. Now to set up some factor variables:

In [None]:
#Change the survived variable to make summary tables prettier:
df$Survived=factor(df$Survived, 
                   levels=c(0,1),
                   labels =c("died","lived"))

## Effect of Gender on Survival
This section will explore how the gender affects the mortality:

In [None]:
df$Sex=factor(df$Sex) #change the gender variable to a factor
table(df$Survived,df$Sex) #See a summary mortality by gender

The above table shows that 1 in 4 women died and 4 in 5 men died. This means that the gender has a real effect on survival chances and  will be used as one of the inputs into our system.

## Effect of Age on Survival
This section will examine the survival rates based on the age. It makes sense that kids are more likely to survive. But what about the elderly? Let's explore the age groups and the survival percentage of each group.

In [None]:
options(repr.plot.width=5, repr.plot.height=3)#Plot size Options

#Determine age distribution
age_range=cut(df$Age, seq(0,100,10)) #Sub-divide the ange range into 10 year sections
qplot(age_range, xlab="Age Range", main="Age distribution on the Titanic") #plot age distributon

#Determine survival percentage:
ggplot(df, aes(x=Age, fill=Survived))+
  geom_histogram(binwidth = 5,position="fill")+
  ggtitle("Survival percentage amongst the age groups")

#check percentage of unknown age passengers:
print("Survival rate of passengers who's age is unknown:")
table(df$Survived[is.na(df$Age)]) 

#Replace the missing age entries with the average age
df$Age[is.na(df$Age)]=mean(df$Age, na.rm=TRUE)

The above graphs show that the survival percentage is highly dependent on age. Younger passengers have a higher survival rate. Elderly passengers for not. There are also over 150 entries that do not have an age value and of those only about 1 in 3 survived. At the end the age is replaced with the average age, which is one strategy to deal with missing values. This seemed appropriate since the survival chances of the missing value people were in the range of the average age as well.

## Embark Location
The next step is to examine/clean the embark location 

In [None]:
#Explore embark location
df$Embarked[df$Embarked==""]="S" #replace missing values with majority (S), highest chance of being right
df$Embarked=factor(df$Embarked, levels=c("S","C","Q")) #Set as factor in order of S->C->Q
table(df$Survived,df$Embarked) #show summary table of survival chances

The Titanic moved from Southampton to Cherbourg to Queenstown. No apparent trend is visible, but there could be one, where people from certain cities are placed into certain locations on the ship (but I won't spend time researching this). The embark location will be used in the model.

## Family Relationships
This section explores the likelyhood of survival if there is family aboard. It makes sense, since survivors may not have wanted to separate families.

In [None]:
print("Survival of people who have parents/children aboard")
table(df$Survived,df$Parch) #parent children

print("Survival of people who have siblings/spouses aboard")
table(df$Survived,df$SibSp) #siblings/spouse

It appares that people who had no family has a 1 in 3 survival chance. People with 1 family member about a 1 in 2. Once the number of family members increases though, the chance of dieing increases again.

## Passenger class and Fare
There should be a difference in between how many people survived depending on how much they paid and what class they were in. A histogram shows that the distribution of the fare is exponential. We will take the log of the fare to create something that looks more like a normal distribution.

In [None]:
print("Survival rate against class")
table(df$Survived,df$Pclass) #Summary of passenger vs. class

#Show the histogram of the log-fare
hist(log(df$Fare)) #histogram, which looks more normal than the skewed Fare distribution

#Some values have Fare=0, this is not good for the log-fare, so we change these values with
#the mean of the log-fare
df$logfare=log(df$Fare)
df$logfare[df$Fare==0] = mean(log( df$Fare[df$Fare>0])  )

#Show the survival as a function of log Fare
ggplot(df, aes(x=log(Fare), fill=Survived))+
  geom_histogram(binwidth = 0.5,position="fill")+
  ggtitle("Survival likelyhood vs. log-fare")

AS the log-fare increases, so did people's survival chances.

## Variable summary
The take away is: the more money you have, the younger you are, the more family you have (but not too much), the more likely you are to live.

# Model Training and Tuning
Now it is time to divide the dataset into training and testing data.

In [None]:
library(caret) #
set.seed(3456) #set a seed for reproducible results

trainIndex <- createDataPartition(df$Survived, p = .8,list=FALSE)
df_train=df[trainIndex,]
df_test=df[-trainIndex,]

Now it is time to train a model to the data. I will use C5.0 decision trees. First, let's try the default settings to see what the expected accuracy will be.

In [None]:
library(C50) #Import the C5.0 library

mc5=C5.0(Survived~Sex+Age+Embarked+logfare+Pclass+SibSp+Parch,
        data=df_train) #Train model

newval=predict(mc5, newdata=df_test) #Predict new values
confusionMatrix(newval, df_test$Survived) #Evaluate the perfromance

An accuracy of 0.84 and Kappa value of 0.65 are pretty good. Let's try tuning the model by adding a higher cost to the misclassified values:

In [None]:
error_cost=matrix(c(0, 5, 5, 0), nrow = 2)
mc5=C5.0(Survived~Sex+Age+Embarked+logfare+Pclass+SibSp+Parch,
         data=df_train,
         costs = error_cost)

newval=predict(mc5, newdata=df_test) #Predict new values
confusionMatrix(newval, df_test$Survived) #Evaluate the perfromance

That had no impact on the previously misclassified values. It did shift one of the misclassified value from 25 to 26. If there was more data, then a different weighting can be given to each class.

I played around with the C5.0Control() function to change train controls, but was personally not able to affect an increase in accuracy.

The next step will add boosting to the model, which generally helps with accuracy. A tuned value of 5 boosting iterations showed promising results.

In [None]:
mc5=C5.0(Survived~Sex+Age+Embarked+logfare+Pclass+SibSp+Parch,
         data=df_train,
         trials=5) #Number of boosting iterations

newval=predict(mc5, newdata=df_test) #Predict new values
confusionMatrix(newval, df_test$Survived) #Evaluate the perfromance

There was a minor reduction in misclassified values (by 1), which increased the accuracy. There was also a shift in which classes were misclassified. I will retrain the model on the entire data and run that model on test.csv and submit it.

#Final Submission Code
The final submission file is created via the following code (I will not go into the details too much)

In [None]:
#Importing testing data
dft=read.csv("../input/test.csv", stringsAsFactors = FALSE)

#Mandatory Data Manipulation prior to running the model:
#Age:
dft$Age[is.na(dft$Age)]=mean(dft$Age, na.rm=TRUE) 

#Embarked:
dft$Embarked[dft$Embarked==""]="S"
dft$Embarked=factor(dft$Embarked, levels=c("S","C","Q"))

#Missing Fare
dft$logfare=log(dft$Fare)
dft$logfare[is.na(dft$Fare)]= mean(log( dft$Fare[dft$Fare>0]), na.rm=TRUE  )



newval=predict(mc5, newdata=dft) #Predict the test data
dft$Survived=newval #add the predicted survival rates to dft
levels(dft$Survived)= c(0,1) #change the "survived" variable from died/lived to 0/1 as requested

write.csv(dft[c("PassengerId","Survived")], #select column names 
          file="submission.csv", #output file name
          row.names=FALSE, #do not print row names
          quote=FALSE) #do not encapsulate data by quotation marks