# Predicting Survival on the Titanic (Classification)

<b>Dataset:</b> <i>Titanic_Clean.csv</i> <br>

<b>Description:</b><br>
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

<b>Features:</b>

- survival: Survival	0 = No, 1 = Yes
- pclass: Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
- sex: Sex	0: ”male", 1: ”female"
- Age: Age Group	1: "Young Adult", 2: "Student", 3: "Adult", 4: "Baby", 5: "Adult", 6: "Adult"
- sibsp: # of siblings / spouses aboard the Titanic	
- parch: # of parents / children aboard the Titanic	
- FareBand: Passenger fare	
- title: Title based on name	1: Mr, 2:Miss, 3: Mrs, 4:Master, 5:Royal, 6: Rare
- embarked: Port of Embarkation	 1 = Southampton, 2 = Cherbourg, 3 = Queenstown

<b>Objectives:</b>
- Load and Explore the Dataset
- Split into Training and Test Set (as per instructions)
- Build the following models using the Training Set:
    - KNN
    - Logistic Regression
    - Gaussian Naive Bayes
    - Decision Tree
- Print the Accuracy Score of each model using the Test Set

## Import Libraries

### Standard Libraries

In [2]:
#Data analysis libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

%matplotlib inline

#Visualization libraries
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

#ignore warnings (Optional)
import warnings
warnings.filterwarnings('ignore')

### Additional Libraries

In [3]:
#Train Test Split
from sklearn.model_selection import train_test_split

#Accuracy Score Metric
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#Required Algorithms
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 

from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score

## Load and Explore the Dataset

In [4]:
df = pd.read_csv("Titanic_Clean.csv")
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Embarked,Title,AgeGroup,FareBand
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.352413,0.523008,0.381594,1.361392,1.751964,4.636364,2.497194
std,257.353842,0.486592,0.836071,0.47799,1.102743,0.806057,0.635673,1.112838,1.35339,1.118156
min,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,223.5,0.0,2.0,0.0,0.0,0.0,1.0,1.0,4.0,1.5
50%,446.0,0.0,3.0,0.0,0.0,0.0,1.0,1.0,5.0,2.0
75%,668.5,1.0,3.0,1.0,1.0,0.0,2.0,2.0,6.0,3.0
max,891.0,1.0,3.0,1.0,8.0,6.0,3.0,6.0,7.0,4.0


### Inspect  Data

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Embarked,Title,AgeGroup,FareBand
0,1,0,3,0,1,0,1,1,4,1
1,2,1,1,1,1,0,2,3,6,4
2,3,1,3,1,0,0,1,2,5,2
3,4,1,1,1,1,0,1,3,5,4
4,5,0,3,0,0,0,1,1,5,2


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
SibSp          891 non-null int64
Parch          891 non-null int64
Embarked       891 non-null int64
Title          891 non-null int64
AgeGroup       891 non-null int64
FareBand       891 non-null int64
dtypes: int64(10)
memory usage: 69.7 KB


In [7]:
df.shape

(891, 10)

In [8]:
df["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

## Prepare Train and Test data

### Separate y (target) from x (predictor) columns
*Note: for the predictor columns, review the features to determine if any of the features should not be included in building the model*

In [9]:
X = df.drop(columns = ["Survived"], axis=1)
y = df["Survived"]

### Split into train and test partitions using the train_test_split function
test_size should be 22% and random_state = 20

In [10]:
#Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.22, random_state=20)

In [11]:
#Check shape to make sure it is all in order
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((694, 9), (197, 9), (694L,), (197L,))

### Scaling the Models

##### StandardScaler 

In [12]:
#Instantiate the Standard Scaler
scaler = StandardScaler()

#Fit the scaler to the training set
scaler.fit(X_train)

#Transform the training set
X_train_scaled = scaler.transform(X_train)

#Transform the test set
X_test_scaled = scaler.transform(X_test)

In [13]:
#Change to Pandas dataframe for easier viewing and manipulation of the data
X_train_sdf = pd.DataFrame(X_train_scaled, index=X_train.index, columns=X_train.columns)
X_test_sdf = pd.DataFrame(X_test_scaled, index=X_test.index, columns=X_test.columns)

In [14]:
X_train_sdf.head()

Unnamed: 0,PassengerId,Pclass,Sex,SibSp,Parch,Embarked,Title,AgeGroup,FareBand
479,0.117633,0.839195,1.328511,-0.460332,0.755825,-0.559283,0.211146,-2.675759,-0.453706
248,-0.774685,-1.552597,-0.752723,0.427086,0.755825,-0.559283,-0.687844,0.986917,1.345562
504,0.214204,-1.552597,1.328511,-0.460332,-0.469742,-0.559283,0.211146,-1.210688,1.345562
1,-1.728808,-1.552597,1.328511,0.427086,-0.469742,1.024973,1.110136,0.986917,1.345562
885,1.685949,0.839195,1.328511,-0.460332,5.658093,2.609229,1.110136,0.986917,0.445928


##### MinMaxScaler

In [15]:
#Instantiate the MinMax Scaler
minmax = MinMaxScaler()

#Fit the scaler to the training set
#Because it it is still not used by the system
minmax.fit(X_train)

#Transform the training set
X_train_scaled_mm = minmax.transform(X_train)

#Transform the test set
X_test_scaled_mm = minmax.transform(X_test)

In [16]:
#Change to Pandas dataframe for easier viewing and manipulation of the data
X_train_smm = pd.DataFrame(X_train_scaled_mm, index=X_train.index, columns=X_train.columns)
X_test_smm = pd.DataFrame(X_test_scaled_mm, index=X_test.index, columns=X_test.columns)

In [17]:
X_train_smm.head()

Unnamed: 0,PassengerId,Pclass,Sex,SibSp,Parch,Embarked,Title,AgeGroup,FareBand
479,0.538808,1.0,1.0,0.0,0.166667,0.0,0.2,0.0,0.333333
248,0.278965,0.0,0.0,0.125,0.166667,0.0,0.0,0.833333,1.0
504,0.566929,0.0,1.0,0.0,0.0,0.0,0.2,0.333333,1.0
1,0.001125,0.0,1.0,0.125,0.0,0.5,0.4,0.833333,1.0
885,0.995501,1.0,1.0,0.0,0.833333,1.0,0.4,0.833333,0.666667


## Build and Validate Models

#### Build models on the following algorithms and report ACCURACY SCORE on the test dataset
1. KNN 
2. Logistic Regression
3. Gaussian Naive Bayes
4. Decision Tree Classifier

*Note: Accuracy Score should be presented as a percentage*<br>
*Note: For models that have a random_state parameter, set random_state = 20*

### 1. KNN (k-Nearest Neighbors)

In [18]:
#Set the value of K
k = 4

#Instatiate the model
knn = KNeighborsClassifier(n_neighbors=k)

#Fit the model to the training set
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform')

##### Validating the Model

##### Cross Validation: KNN

In [19]:
print(cross_val_score(knn, X_train, y_train, cv=5))

[0.59285714 0.5971223  0.58273381 0.61594203 0.60869565]


In [20]:
print(np.mean(cross_val_score(knn, X_train, y_train, cv=5)))

0.5994701878248954


In [21]:
#Predict on the Test Set, 
y_pred_k = knn.predict(X_test)

In [22]:
#Get the Confusion Matrix and other metrics to test performance
print("Classification report for classifier %s:\n%s\n"
      % (knn, metrics.classification_report(y_test, y_pred_k)))

Classification report for classifier KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform'):
              precision    recall  f1-score   support

           0       0.62      0.83      0.71       123
           1       0.36      0.16      0.22        74

   micro avg       0.58      0.58      0.58       197
   macro avg       0.49      0.50      0.47       197
weighted avg       0.52      0.58      0.53       197




In [23]:
print("Confusion matrix:\n%s" % metrics.confusion_matrix(y_test, y_pred_k))

Confusion matrix:
[[102  21]
 [ 62  12]]


In [24]:
#Plotting for easy viewing
labels_knn = list(y_test.unique())
cmk = metrics.confusion_matrix(y_test, y_pred_k)
cm_dfk = pd.DataFrame(cmk,index=labels_knn, columns=labels_knn)
cm_dfk

Unnamed: 0,1,0
1,102,21
0,62,12


In [66]:
ak = accuracy_score(y_test,y_pred_k) * 100

In [67]:
print("Percentage: %s %%\n" % ak)

Percentage: 57.868020304568525 %



### 2. Logistic Regression

In [26]:
#Instantiate the Algorithm 
logreg = LogisticRegression(C=1e9, class_weight="balanced", solver='liblinear', random_state=20)

#Train/Fit the model
logreg.fit(X_train_smm, y_train)

LogisticRegression(C=1000000000.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=20,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [27]:
#Check the Trained Model Coefficients
print(logreg.coef_)

[[ 0.03447937 -1.57412503  2.40782536 -4.13148783 -2.26437637  0.34781113
   1.39922283 -2.55158554  1.40118895]]


In [28]:
#Create a DataFrame for easy understanding
coef = pd.DataFrame(X_train_smm.columns, columns=["Survived"])
coef['Coef'] = logreg.coef_.reshape(-1,1)
coef.head(10)

Unnamed: 0,Survived,Coef
0,PassengerId,0.034479
1,Pclass,-1.574125
2,Sex,2.407825
3,SibSp,-4.131488
4,Parch,-2.264376
5,Embarked,0.347811
6,Title,1.399223
7,AgeGroup,-2.551586
8,FareBand,1.401189


##### Validating the Model

In [29]:
y_pred_lgr = logreg.predict(X_test_smm)

In [30]:
#Get the Confusion Matrix and other metrics to test performance (model precision)
print("Classification report for classifier %s:\n%s\n"
      % (logreg, classification_report(y_test, y_pred_lgr)))

Classification report for classifier LogisticRegression(C=1000000000.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=20,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False):
              precision    recall  f1-score   support

           0       0.87      0.80      0.83       123
           1       0.70      0.80      0.75        74

   micro avg       0.80      0.80      0.80       197
   macro avg       0.78      0.80      0.79       197
weighted avg       0.81      0.80      0.80       197




In [31]:
print("Confusion matrix:\n%s" % confusion_matrix(y_test, y_pred_lgr))

Confusion matrix:
[[98 25]
 [15 59]]


In [32]:
#Predict the Probabilities
pred_prob_0 = logreg.predict_proba(X_test_smm)[:,0]
pred_prob_1 = logreg.predict_proba(X_test_smm)[:,1]

In [33]:
#Put all information on a DataFrame for analysis
df_results = X_test.copy()

df_results["Predicted_Class"] = y_pred_lgr
df_results["Predicted_Prob(0)"] = pred_prob_0
df_results["Predicted_Prob(1)"] = pred_prob_1

In [34]:
df_results.head()

Unnamed: 0,PassengerId,Pclass,Sex,SibSp,Parch,Embarked,Title,AgeGroup,FareBand,Predicted_Class,Predicted_Prob(0),Predicted_Prob(1)
347,348,3,1,1,0,1,3,6,3,1,0.355262,0.644738
674,675,2,0,0,0,1,1,5,1,0,0.826956,0.173044
791,792,2,0,0,0,1,1,3,3,1,0.44399,0.55601
836,837,3,0,0,0,1,1,4,2,0,0.810407,0.189593
56,57,2,1,0,0,1,2,4,2,1,0.120064,0.879936


In [64]:
alg= accuracy_score(y_test,y_pred_lgr) * 100

In [65]:
print("Percentage: %s %%\n" % alg)

Percentage: 79.69543147208121 %



### 3. Gaussian Naive Bayes

### Training the Model

In [36]:
#Instantiate the Algorithm
gnb = GaussianNB() #priors- prior probabilities 

#Train the model
gnb.fit(X_train_scaled,y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

#### Validating the Model

In [37]:
#Predict on the Test Set
#SK Learn - putting in a dataframe, usually binary
y_pred_nb = gnb.predict(X_test_scaled)
y_pred_nb

array([1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1],
      dtype=int64)

In [38]:
print(classification_report(y_test,y_pred_nb))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87       123
           1       0.78      0.81      0.79        74

   micro avg       0.84      0.84      0.84       197
   macro avg       0.83      0.84      0.83       197
weighted avg       0.84      0.84      0.84       197



In [39]:
print (confusion_matrix(y_test, y_pred_nb))

[[106  17]
 [ 14  60]]


In [40]:
#Plotting for easy viewing
labels_nb = list(y_test.unique())
cmb = metrics.confusion_matrix(y_test, y_pred_nb)
cm_dfb = pd.DataFrame(cmb,index=labels_nb, columns=labels_nb)
cm_dfb

Unnamed: 0,1,0
1,106,17
0,14,60


In [62]:
#Check performance metrics, accuracy scores, everything it has done
anb = accuracy_score(y_test,y_pred_nb) * 100

In [63]:
print("Percentage: %s %%\n" % anb)

Percentage: 84.26395939086294 %



### 4. Decision Tree

In [42]:
#Instantiate the Algorithm
clf = tree.DecisionTreeClassifier(criterion="gini", min_samples_split=4, min_samples_leaf=5,
            max_depth=10, random_state=20)

#Train the model
clf.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=4,
            min_weight_fraction_leaf=0.0, presort=False, random_state=20,
            splitter='best')

#####  Cross Validation: Decision Tree

In [43]:
print (cross_val_score(clf, X, y, cv=5))

[0.81005587 0.77094972 0.84831461 0.7752809  0.84745763]


In [44]:
print (np.mean(cross_val_score(clf, X, y, cv=5)))

0.81041174386576


#### Validating the Model

In [45]:
y_pred_dt = clf.predict(X_test)

In [46]:
#Check the performance metrics
print("{:.2f}".format(metrics.accuracy_score(y_test,y_pred_dt)))

0.78


In [47]:
print("Classification report for classifier %s:\n%s\n"
      % (clf, metrics.classification_report(y_test, y_pred_dt)))

Classification report for classifier DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=4,
            min_weight_fraction_leaf=0.0, presort=False, random_state=20,
            splitter='best'):
              precision    recall  f1-score   support

           0       0.82      0.82      0.82       123
           1       0.70      0.70      0.70        74

   micro avg       0.78      0.78      0.78       197
   macro avg       0.76      0.76      0.76       197
weighted avg       0.78      0.78      0.78       197




In [48]:
print("Confusion Matrix: \n%s" % metrics.confusion_matrix(y_test,y_pred_dt))

Confusion Matrix: 
[[101  22]
 [ 22  52]]


In [49]:
#Encode Confusion Matrix into a DataFrame
labels_dt = list(y_test.unique())
cm_dt = metrics.confusion_matrix(y_test, y_pred_dt)
cm_df_dt = pd.DataFrame(cm_dt,index=labels_dt, columns=labels_dt)
cm_df_dt

Unnamed: 0,1,0
1,101,22
0,22,52


In [60]:
#Check performance metrics, accuracy scores, everything it has done
a = accuracy_score(y_test, y_pred_dt) * 100
print("Percentage: %s %%\n" % a)

Percentage: 77.66497461928934 %

