The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

>In this notebook you'll work through the end to end machine learning pipeline and score in the top 1% of all participants in the Kaggle competition. 

Kaggle is a site that hosts competitive modeling competitions. In the applied space, a score in the top 5% of this competition is considered a resume worthy bullet point. 


In [None]:
# Import the libraries you'll need for this project
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

**XGBoost** is an open source machine learning library that belongs to a class of models called gradient boosters. XGBoost is a supervised machine learning algorithm.  Supervised learning models use existing data to look for patterns.  Two of the most common approaches used when building supervised learning models are classification and regression. XGBoost excels at building highly accurate classification and regression models very quickly.  This means that XGBoost is a top choice for building real world models against highly structured datasets. XGBoost has also become the gold standard for competitive modeling. 

> XGBoost has won almost every competition using structured datasets on Kaggle and other competitive modeling competitions.


In [None]:
# Create a variable called data to hold the titanic dataset
data = pd.read_csv("titanic.csv")
data

**Data Dictionary**


- [survived] Survival 0 = No, 1 = Yes (This is the target variable)
- [pclass] Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- [sex] Sex 
- [Age] Age in years 
- [sibsp] # of siblings / spouses aboard the Titanic 
- [parch] # of parents / children aboard the Titanic 
- [ticket] Ticket number 
- [fare] Passenger fare 
- [cabin] Cabin number 
- [embarked] Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton



In the next cell let's view the percentage of those who died versus those who survived. 

In [None]:
# import the libraries I need for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# plot survived vs dies
f,ax=plt.subplots(1,2,figsize=(19,8))
data['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Survived')
ax[0].set_ylabel('')
sns.countplot('Survived',data=data,ax=ax[1])
ax[1].set_title('Survived')
plt.show()

Not many survived this tradgedy. Out of 891 passengers in training set, only around 350 survived. 

> Only **38.4%** of the total training set survived the crash.  

Let’s craft a graph to view the survivors based on their genders. Spoiler alert. The men didn’t too well. 

In [None]:
#draw a bar plot of survival by sex
sns.barplot(x="Sex", y="Survived", data=data)

#print percentages of females vs. males that survive
print("Percentage of females who survived:", data["Survived"][data["Sex"] == 'female'].value_counts(normalize = True)[1]*100)

print("Percentage of males who survived:", data["Survived"][data["Sex"] == 'male'].value_counts(normalize = True)[1]*100)

In [None]:
# Create a variable called data to hold only the attributes we want. 
data = data [['Survived', 'Pclass', 'Sex','Age','SibSp','Parch']]

In [None]:
# After I view the dataset the only non-numeric attribute is sex
data.head(50)

In [None]:
# Use label encoding to covert sex to numbers
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder() 
data['Sex']= label_encoder.fit_transform(data['Sex']) 

In [None]:
# Analyze how many null values are in the dataset
data.isnull().sum()

In the code below I've chosen to drop all the NAN values instead of replacing them. 

> I did use mean value imputation, however, the model's performance was not as good as just dropping them. 

In [None]:
# Dropping all the NANs in the age attribute
data.dropna(subset=['Age'], how='all', inplace = True)

In [None]:
# Defining X and Y. 
X = data.drop('Survived', axis=1)
# The target variable is Survived. 
y = data['Survived']

In the code below I'm using train/test/split to create training and testing sets. The random_state parameter is used for reproducability. 

In [None]:
# Separate the data into disparate training and testing splits
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# Define the classifier. All models in SciKit-Learn are called classifiers. 
# I'm using the SciKit-Learn implementation of XGBoost on this project.
model = XGBClassifier(max_depth=4,n_estimators=50)
model.fit(X_train, y_train)

In [None]:
# Let's create a variable to hold our prediction against test dataset and make predictions.
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [None]:
# Importing metrics and using accuracy as the metric for this project
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))