In [None]:
knitr::opts_chunk$set(echo = TRUE)



# Titanic Survivors Prediction using R

## Introduction
As a **beginner** of ML and DL, I tried to find a very basic project that you can do.  
I came across the "Titanic - Machine Learning from Disaster" competition from Kaggle.  
Since I do not have any experience with Python, I decided to use R to predict the survivors using R.  
Therefore, Titanic Survival Prediction Using R begins.  

## Steps
1. Imported train & test data sets

2. Merge two data sets and clean them up  
(ex. dealing with missing values)

3. Split the data to begin building a model  

4. (Cross Validation or some sort of a estimation, but I didn't do this part)

5. Build a model

6. Run the model using test data set


## References
https://www.kaggle.com/redhorse93/r-titanic-data/report  
https://www.youtube.com/watch?v=Zx2TguRHrJ  
  
  
***
***

### First, let's bring the packages that I need


In [None]:
library(dplyr) #for data pre-processing 
library(VIM) #for visualzing missing values 
library(randomForest) #for modeling



### Import Raw Data


In [None]:
t_train <- readr::read_csv("train.csv") 
t_test <- readr::read_csv("test.csv")
names(t_train)
names(t_test)


Since "test.csv" file is the test set that we need to test our model with, it does not have "Survived" column.   
Anyway, I want to combine these two sets and make sure if the data sets have any missing values, or I need to do any data pre-processing.


### Combining Two Data Sets


In [None]:
t_full <- dplyr::bind_rows(t_train, t_test)



(I was originally using "rbind()" function, but one of the references that I checked says "dplyr::bind_rows()" allows you not to worry about dimensions of the two data sets.  
Therefore, I didn't need to make "Survived" column in the test set and fill the column with "NA's".  
It automatically does it for me.)


### Chaning Character Variables to Factor Variables


In [None]:
str(t_full)
summary(t_full)
t_full <- t_full %>% 
            mutate(Survived = factor(Survived),
                   Pclass = factor(Pclass, ordered = T),
                   Name = factor(Name),
                   Sex = factor(Sex),
                   Cabin = factor(Cabin),
                   Embarked = factor(Embarked))
str(t_full)
summary(t_full)


In order to run codes, I need to change character variables to factor variables.


### Checking for Missing Values using VIM package


In [None]:
VIM::aggr(t_full, prop = FALSE, combined = TRUE, numbers = TRUE,
          sortVars = TRUE, sortCombs = TRUE)


(Didn't know about this package until one of the references that I stated had it.  
My original way of finding the missing values was using just "summary()" and "table(is.na())" function to see if certain columns have missing values, and use "which()" function to find the missing values and replace them with medians of each column.  
Those are actually my next steps, but I wanted to have some sort of visual indicator to find missing values easy.)


### Substituting Mode and Median Values to NA's


In [None]:
summary(t_full$Embarked)
t_full$Embarked <- replace(t_full$Embarked,
                           which(is.na(t_full$Embarked)),
                           "S")
summary(t_full$Embarked)

table(is.na(t_full$Age))
age.median <- median(t_full$Age, na.rm = T)
t_full[is.na(t_full$Age), "Age"] <- age.median
table(is.na(t_full$Age))


table(is.na(t_full$Fare))
fare.median <- median(t_full$Fare, na.rm = T)
t_full[is.na(t_full$Fare), "Fare"] <- fare.median
table(is.na(t_full$Fare))


For my model, I need to make sure that all variables(that I'm going to use) have no NA's.  
I'm sure there are much better way to replace NA's especially for "Age" variable.  
But I decided to substitute everything with median values(for "Embarked" variable, mode).


### Spliting the Full Data Set to Two Data Sets


In [None]:
t_train <- t_full[1:891, ]
t_test <- t_full[892:1309, ]
t_train$Survived
t_test$Survived


Since I replaced all NA's that needed to be replaced, I split the full data set back to it's original forms: one train set and one test set.  
Now using the train set, I'm going to build a simple randomForest model and run the test data using the model.


### Building a Predictive Model


In [None]:
eqtn <- "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
formula <- as.formula(eqtn)

t_model <- randomForest(formula, t_train, ntree = 500, mtry = 2.64575, nodesize = 0.01 * nrow(t_train))


I should do the cross-validation here, but it takes me a very long time to figure things out on Kaggle, Google, and YouTube.  
Like I said, I'm a beginner, and a lot of things just fly over my head, so I just wanted to skip that part and move on to running predicting test data using this model.

### Final Result


In [None]:
features_eqtn <- "Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
Survived <- predict(t_model, newdata = t_test)

PassengerId <- t_test$PassengerId
final_df <- as.data.frame(PassengerId)
final_df$Survived <- Survived

final_df
