<img src="https://github.com/djp840/MSDS_458_Public/blob/master/images/NorthwesternHeader.png?raw=1">

## MSDS422 Assignment 03:

<div class="alert alert-block alert-success">
    <b>More Technical</b>: Throughout the notebook. This types of boxes provide more technical details and extra references about what you are seeing. They contain helpful tips, but you can safely skip them the first time you run through the code.
</div>

### Data Dictionary Titanic Dataset


<br>
   
| Variable        | Description                        |  Details                  |  
| :---            |    :----:                          |     :----:                |   
| Survival        | Survival                           | 0 = No; 1 = Yes           |    
| pclass          | Passenger Class                    | 1 = 1st; 2 = 2nd; 3 = 3rd |     
| name            | First and Last Name                |                           |
| sex             | Sex                                |                           |	 
| age             | Age                                |                           | 	 
| sibsp 	  | Number of Siblings/Spouses Aboard  |                           | 	 
| parch           | Number of Parents/Children Aboard  |                           |
| ticket 	  | Ticket Number 	               |                           |
| fare 	          | Passenger Fare 	               |                           |
| cabin 	  | Cabin 	                       |                           | 
| embarked 	  | Port of Embarkation 	       |C = Cherbourg; Q = Queenstown; S = Southampton |
    

## Import packages 



In [None]:
import numpy as np 
import pandas as pd 

import statsmodels.formula.api as sm
from xgboost import XGBClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import make_scorer, accuracy_score, roc_auc_score, roc_curve 
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

import scikitplot as skplt
import seaborn as sns
from matplotlib import pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
sns.set(style="whitegrid", color_codes=True)
plt.rc("font", size=14)

In [None]:
%matplotlib inline

<div class="alert alert-block alert-info">
    <b>Suppress warning messages</b></div>

In [None]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

### Mount Google Drive to Colab Enviorment

In [None]:
#from google.colab import drive
#drive.mount('/content/gdrive')

### Load Data (Local Directory)

In [None]:
training_df = pd.read_csv("./data/MSDS422_03_train.csv")
test_df = pd.read_csv("./data/MSDS422_03_test.csv")

### Data Quality Review 

In [None]:
print("Shape:", training_df.shape,"\n")
print("Variable Types:") 
print(training_df.dtypes)

In [None]:
print("Shape:", test_df.shape,"\n")
print("Variable Types:") 
print(test_df.dtypes)

In [None]:
training_df.head()

## Exploritory Data Analysis (EDA) 

### Review Training and Test Dataset for Missing Values

In [None]:
training_df.isnull().sum()

In [None]:
test_df.isnull().sum()

<div class="alert alert-block alert-warning">
Attention to the <b>count</b> row for each column number of records
</div> 

In [None]:
training_df.describe()

In [None]:
sns.barplot(x="Sex", y="Survived", data=training_df)
plt.title("Distribution of Survival based on Gender")
plt.show()

total_survived_females = training_df[training_df.Sex == "female"]["Survived"].sum()
total_survived_males = training_df[training_df.Sex == "male"]["Survived"].sum()

print("Total people survived is: " + str((total_survived_females + total_survived_males)))
print("Proportion of Females who survived:") 
print(total_survived_females/(total_survived_females + total_survived_males))
print("Proportion of Males who survived:")
print(total_survived_males/(total_survived_females + total_survived_males))

In [None]:
sns.barplot(x="Pclass", y="Survived", data=training_df)
plt.ylabel("Survival Rate")
plt.title("Distribution of Survival Based on Class")
plt.show()

total_survived_one = training_df[training_df.Pclass == 1]["Survived"].sum()
total_survived_two = training_df[training_df.Pclass == 2]["Survived"].sum()
total_survived_three = training_df[training_df.Pclass == 3]["Survived"].sum()
total_survived_class = total_survived_one + total_survived_two + total_survived_three

print("Total people survived is: " + str(total_survived_class))
print("Proportion of Class 1 Passengers who survived:") 
print(total_survived_one/total_survived_class)
print("Proportion of Class 2 Passengers who survived:")
print(total_survived_two/total_survived_class)
print("Proportion of Class 3 Passengers who survived:")
print(total_survived_three/total_survived_class)

In [None]:
sns.barplot(x="Sex", y="Survived", hue="Pclass", data=training_df)
plt.ylabel("Survival Rate")
plt.title("Survival Rates Based on Gender and Class")

### Correlation Heatmap

<div class="alert alert-block alert-info">
<b>Seaborn Heatmap Documentation</b><br>
https://seaborn.pydata.org/generated/seaborn.heatmap.html
</div>

In [None]:
f,ax = plt.subplots(figsize=(10, 10))
sns.heatmap(training_df.corr(), annot=True, linewidths=0.5, fmt= '.2f',ax=ax)

## Preprocess Data for Analysis

### Fix Missing Values, Feature  Engineering

#### Encode Gender Male = 1, Female = 0

In [None]:
sex_map={"male":1,"female":0}
training_df["Sex"]=training_df["Sex"].map(sex_map)
test_df["Sex"]=test_df["Sex"].map(sex_map)

In [None]:
training_df["Sex"].head()

Passenger Class<b> (Pclass)</b> and Number of Siblings<b> (SibSp)</b> provide insights into the social-economic status differences. The <b>Age</b> of the passenger will be imputed by median age of each sub-group

### Training Dataset

In [None]:
training_df1=training_df.groupby(["Pclass","SibSp"])
training_df1_median=training_df1.median()
training_df1_median

### Test Dataset

In [None]:
test_df1=test_df.groupby(["Pclass","SibSp"])
test_df1_median=test_df1.median()
test_df1_median

### Function - Impute Age

In [None]:
def impute_age(dataset,dataset_med):
    for x in range(len(dataset)):
        if dataset["Pclass"][x]==1:
            if dataset["SibSp"][x]==0:
                return dataset_med.loc[1,0]["Age"]
            elif dataset["SibSp"][x]==1:
                return dataset_med.loc[1,1]["Age"]
            elif dataset["SibSp"][x]==2:
                return dataset_med.loc[1,2]["Age"]
            elif dataset["SibSp"][x]==3:
                return dataset_med.loc[1,3]["Age"]
        elif dataset["Pclass"][x]==2:
            if dataset["SibSp"][x]==0:
                return dataset_med.loc[2,0]["Age"]
            elif dataset["SibSp"][x]==1:
                return dataset_med.loc[2,1]["Age"]
            elif dataset["SibSp"][x]==2:
                return dataset_med.loc[2,2]["Age"]
            elif dataset["SibSp"][x]==3:
                return dataset_med.loc[2,3]["Age"]
        elif dataset["Pclass"][x]==3:
            if dataset["SibSp"][x]==0:
                return dataset_med.loc[3,0]["Age"]
            elif dataset["SibSp"][x]==1:
                return dataset_med.loc[3,1]["Age"]
            elif dataset["SibSp"][x]==2:
                return dataset_med.loc[3,2]["Age"]
            elif dataset["SibSp"][x]==3:
                return dataset_med.loc[3,3]["Age"]
            elif dataset["SibSp"][x]==4:
                return dataset_med.loc[3,4]["Age"]
            elif dataset["SibSp"][x]==5:
                return dataset_med.loc[3,5]["Age"]
            elif dataset["SibSp"][x]==8:
                return dataset_med.loc[3]["Age"].median() 

In [None]:
training_df["Age"]=training_df["Age"].fillna(impute_age(training_df,training_df1_median))
test_df["Age"]=test_df["Age"].fillna(impute_age(test_df,test_df1_median))

In [None]:
print(training_df.isnull().sum()), print(test_df.isnull().sum())

### Fix Missing Values Cabin

Missing Values will be filled by "U" = Unknown

In [None]:
training_df["Cabin"]=training_df["Cabin"].fillna("U")
test_df["Cabin"]=test_df["Cabin"].fillna("U")
training_df["Cabin"]=training_df["Cabin"].map(lambda x: x[0])
test_df["Cabin"]=test_df["Cabin"].map(lambda x: x[0])

In [None]:
training_df["Cabin"].head()

In [None]:
training_df.head()

In [None]:
training_df.shape

In [None]:
training_df.info()

In [None]:
test_df.info()

### Function - Cabin Assignment 
If passenger assigned to cabin "1" and "0" if passenger is not assigned

In [None]:
def cabin_assignment(dataset):
    dataset["Cabin A"]=np.where(dataset["Cabin"]=="A",1,0)
    dataset["Cabin B"]=np.where(dataset["Cabin"]=="B",1,0)
    dataset["Cabin C"]=np.where(dataset["Cabin"]=="C",1,0)
    dataset["Cabin D"]=np.where(dataset["Cabin"]=="D",1,0)
    dataset["Cabin E"]=np.where(dataset["Cabin"]=="E",1,0)
    dataset["Cabin F"]=np.where(dataset["Cabin"]=="F",1,0)
    dataset["Cabin G"]=np.where(dataset["Cabin"]=="G",1,0)
    dataset["Cabin T"]=np.where(dataset["Cabin"]=="T",1,0) 

In [None]:
cabin_assignment(training_df)
cabin_assignment(test_df)

In [None]:
training_df.info()

In [None]:
test_df.info()

### Fix Missing Values Embarked

Embarked has 2 missing values in training dataset

In [None]:
training_df["Embarked"]=training_df["Embarked"].fillna("S")

In [None]:
training_df["Embarked"].isnull().sum()

### Function - Embarked Location 
If passenger embarked from location assigned "1" and "0" if passenger is did not embark from location 

In [None]:
def embark_assignment(dataset):
    dataset["Embarked S"]=np.where(dataset["Embarked"]=="S",1,0)
    dataset["Embarked C"]=np.where(dataset["Embarked"]=="C",1,0) 

NOTE: Embarked on Q is when the rest of embarked are 0

In [None]:
embark_assignment(training_df)
embark_assignment(test_df)

### Fix Missing Values Fare

In [None]:
test_df["Fare"]=test_df["Fare"].fillna(np.mean(test_df["Fare"]))

In [None]:
test_df["Fare"].isnull().sum()

### Feature Engineering - Family Size

<b>Family Size</b> = (SibSp and Parch)<br>
Total number of family members on board for each passenger

| Variable        | Description                        |  
| :---            |    :----:                          |
| sibsp 	      | Number of Siblings/Spouses Aboard  | 	 
| parch           | Number of Parents/Children Aboard  |

In [None]:
training_df["FamilySize"] = training_df["SibSp"] + training_df["Parch"] + 1
test_df["FamilySize"] = test_df["SibSp"] + test_df["Parch"] + 1

In [None]:
training_df.head()

In [None]:
training_df.info()

### Drop Columns from Preprocessing 

In [None]:
training_df.drop(["Name","Ticket","PassengerId","Embarked","Cabin","SibSp","Parch"],inplace=True,axis=1)
test_df.drop(["Name","Ticket","Embarked","Cabin","SibSp","Parch"],inplace=True,axis=1)

In [None]:
training_df.tail()

## Model Development 

The Label = Survived, we removed the label from Training and Test Datasets

In [None]:
training_df1=training_df.copy()
test_df1=test_df.copy()

### Create Features and Label

In [None]:
x=training_df1.drop(["Survived"],axis=1)
y=training_df1["Survived"]

In [None]:
print(x.shape, y.shape)

### Split Training Dataset 

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

In [None]:
print(x_train.shape, x_test.shape)

In [None]:
x_train.head()

<div class="alert alert-block alert-success">
    <b>K-Folds cross-validator</b>:Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default)<br>
Each fold is then used once as a validation while the k - 1 remaining folds form the training set
</div>

<div class="alert alert-block alert-info">
<b>sklearn.model_selection.KFold</b><br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
</div>

In [None]:
k_fold = KFold(n_splits=5, shuffle=True, random_state=0)

### Functions - Support Model Performance Analysis 

In [None]:
def acc_score(model, input_data, output_data):
    return np.mean(cross_val_score(model,input_data,output_data,cv=k_fold,scoring="accuracy"))

In [None]:
def confusion_matrix_model(model_used, input_data, output_data):
    cm=confusion_matrix(output_data,model_used.predict(input_data))
    col=["Predicted Dead","Predicted Survived"]
    cm=pd.DataFrame(cm)
    cm.columns=["Predicted Dead","Predicted Survived"]
    cm.index=["Actual Dead","Actual Survived"]
    cm[col]=np.around(cm[col].div(cm[col].sum(axis=1),axis=0),decimals=2)
    return cm

In [None]:
def importance_of_features(model, input_data):
    features = pd.DataFrame()
    features['feature'] = input_data.columns
    features['importance'] = model.feature_importances_
    features.sort_values(by=['importance'], ascending=True, inplace=True)
    features.set_index('feature', inplace=True)
    return features.plot(kind='barh', figsize=(10,10))

In [None]:
def aucscore(model, input_data, output_data, has_proba=True):
    if has_proba:
        fpr,tpr,thresh=skplt.metrics.roc_curve(output_data,model.predict_proba(input_data)[:,1])
    else:
        fpr,tpr,thresh=skplt.metrics.roc_curve(output_data,model.decision_function(input_data))
    x=fpr
    y=tpr
    auc= skplt.metrics.auc(x,y)
    return auc

In [None]:
def plt_roc_curve(name, model, input_data, output_data, has_proba=True):
    if has_proba:
        fpr,tpr,thresh=skplt.metrics.roc_curve(output_data,model.predict_proba(input_data)[:,1])
    else:
        fpr,tpr,thresh=skplt.metrics.roc_curve(output_data,model.decision_function(input_data))
    x=fpr
    y=tpr
    auc= skplt.metrics.auc(x,y)
    plt.plot(x,y,label='ROC curve for %s (AUC = %0.2f)' % (name, auc))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim((0,1))
    plt.ylim((0,1))
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve (Train Data)")
    plt.legend(loc="lower right")
    plt.show()

### Create Logistic Regression Model

<div class="alert alert-block alert-success">
    <b>Logistic Regression</b>: In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’
</div>

<div class="alert alert-block alert-info">
<b>sklearn.linear_model.LogisticRegression</b><br>
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
</div>

In [None]:
log_reg=LogisticRegression()
log_reg.fit(x_train,y_train)

print("Accuracy: " + str(acc_score(log_reg, x_train, y_train)))
confusion_matrix_model(log_reg, x_train, y_train)

In [None]:
print(log_reg.coef_)

In [None]:
print(log_reg.intercept_)

In [None]:
plt_roc_curve("Logistic Regression",log_reg, x_train, y_train, has_proba=True)

<div class="alert alert-block alert-info">
<b>sklearn.metrics.plot_confusion_matrix</b><br>
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html
</div>

In [None]:
skplt.metrics.plot_confusion_matrix(y_test, log_reg.predict(x_test),normalize=True, figsize=(10,10))

In [None]:
logit_roc_auc = roc_auc_score(y_test, log_reg.predict_proba(x_test)[:,1])
fpr, tpr, thresholds = roc_curve(y_test, log_reg.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

### Create Support Vector Machine

<div class="alert alert-block alert-success">
<b>Support Vector Machines</b><br>
The advantages of support vector machines are:<br>
1. Effective in high dimensional spaces<br>
2. Still effective in cases where number of dimensions is greater than the number of samples<br>
3. Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient<br>
4. Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels<br>
</div>

<div class="alert alert-block alert-info">
<b>Support Vector Machines</b><br>
https://scikit-learn.org/stable/modules/svm.html <br>
    <b>Radial Basis Function</b><br>
    https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html#sphx-glr-auto-examples-svm-plot-rbf-parameters-py
</div>

In [None]:
SVC_rbf=SVC(kernel="rbf")
SVC_rbf.fit(x_train,y_train)

In [None]:
print("Accuracy: " + str(acc_score(SVC_rbf, x_train, y_train)))
confusion_matrix_model(SVC_rbf, x_train, y_train)

In [None]:
plt_roc_curve("Logistic Regression",SVC_rbf, x_train, y_train, has_proba=False)

In [None]:
skplt.metrics.plot_confusion_matrix(y_test, SVC_rbf.predict(x_test),normalize=True, figsize=(10,10))

In [None]:
plt_roc_curve("Logistic Regression",SVC_rbf, x_test, y_test, has_proba=False)

### Create K-Nearest Neighbour

<div class="alert alert-block alert-success">
    <b>Nearest Neighbors</b>: provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels
</div>

<div class="alert alert-block alert-info">
<b>Nearest Neighbors</b><br>
https://scikit-learn.org/stable/modules/neighbors.html
</div>

In [None]:
KNN=KNeighborsClassifier(n_neighbors=5)
KNN.fit(x_train,y_train)

print("Accuracy: " + str(acc_score(KNN, x_train, y_train)))
confusion_matrix_model(KNN, x_train, y_train)

In [None]:
print("Accuracy: " + str(acc_score(KNN, x_test, y_test)))
confusion_matrix_model(KNN, x_test, y_test)

In [None]:
plt_roc_curve("KNN (5)",KNN, x_train, y_train, has_proba=True)

In [None]:
skplt.metrics.plot_confusion_matrix(y_test, KNN.predict(x_test),normalize=True, figsize=(10,10))

In [None]:
logit_roc_auc = roc_auc_score(y_test, KNN.predict_proba(x_test)[:,1])
fpr, tpr, thresholds = roc_curve(y_test, KNN.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

### Training Summary Statistics - Logistic Regression / SVM / K-Nearest Neighbours

In [None]:
Classifiers=["Logistic Regression","K-Nearest Neighbours","Support Vector Machine"]
Acc=[acc_score(x, x_train, y_train)for x in [log_reg,KNN,SVC_rbf]]
auc_scores_prob=[aucscore(x, x_train,y_train, has_proba=True) for x in [log_reg,KNN]]
auc_scores_noprob=[aucscore(x, x_train,y_train, has_proba=False) for x in [SVC_rbf]]
auc_scores=auc_scores_prob[:3] + auc_scores_noprob + auc_scores_prob[3:]
cols=["Classifier","Accuracy","AUC"]
results = pd.DataFrame(columns=cols)
results["Classifier"]=Classifiers
results["Accuracy"]=Acc
results["AUC"]=auc_scores
results

### Test Summary Statistics - Logistic Regression / SVM / K-Nearest Neighbours

In [None]:
Classifiers=["Logistic Regression","K-Nearest Neighbours","Support Vector Machine"]
Acc=[acc_score(x, x_test, y_test)for x in [log_reg,KNN,SVC_rbf]]
auc_scores_prob=[aucscore(x, x_test, y_test, has_proba=True) for x in [log_reg,KNN]]
auc_scores_noprob=[aucscore(x, x_test, y_test, has_proba=False) for x in [SVC_rbf]]
auc_scores=auc_scores_prob[:3] + auc_scores_noprob + auc_scores_prob[3:]
cols=["Classifier","Accuracy","AUC"]
results = pd.DataFrame(columns=cols)
results["Classifier"]=Classifiers
results["Accuracy"]=Acc
results["AUC"]=auc_scores
results

#### Tuning Parameters with GridSearchCV

<div class="alert alert-block alert-info">
<b>sklearn.model_selection.GridSearchCV</b><br>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
</div>

#### HyperParameter Search 

<div class="alert alert-block alert-info">
<b>sklearn.linear_model.LogisticRegression</b><br>
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
</div>

In [None]:
parameters = {"class_weight": ["None", "balanced"], 
              "max_iter": [25,50,75,100],
              "penalty": ["l1", "l2", "elasticnet", "none"], 
              "solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga"]
             }

In [None]:
grid_cv = GridSearchCV(log_reg, parameters, scoring = make_scorer(accuracy_score))
grid_cv = grid_cv.fit(x_train, y_train)

In [None]:
print("Our optimized Logistic Regression model is:")
grid_cv.best_estimator_

#### Build Logistic Regression Model with Parameter Tuning

In [None]:
logreg_clf_GSCV = LogisticRegression(C=1.0, class_weight='None', dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=50,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
logreg_clf_GSCV.fit(x_train, y_train)

In [None]:
print("Accuracy: " + str(acc_score(logreg_clf_GSCV, x_train, y_train)))
confusion_matrix_model(logreg_clf_GSCV, x_train, y_train)