# 🛳 Titanic EDA + Elementary Machine Learning Models
## Work in Progress
This is my first proper notebook uploaded to this site. Any insight and feedback is appreciated. All visualizations will be produced with the Plotly library. I am a beginner at this and this is my first comprehensive entry to a Kaggle competition, but I hope it helps somewhat.

First, as a preprocessing measure, I decided to combine the `SibSp` and `Parch` columns into one that describes the number of family members an individual had aboard. Missing/`NaN` values were also taken care of with the `interpolate()` function provided by pandas, and the method used is `linear`.

In [1]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.feature_selection import VarianceThreshold

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

train = pd.read_csv("../input/titanic/train.csv")
family_column = train['SibSp'] + train['Parch']
train['Family'] = family_column
train = train[['Survived', 'Pclass', 'Sex', 'Age', 'Family', 'Embarked', 'Fare']]
train['Age'] = train['Age'].interpolate()
train['Fare'] = train['Fare'].interpolate()
train.head(5)

Unnamed: 0,Survived,Pclass,Sex,Age,Family,Embarked,Fare
0,0,3,male,22.0,1,S,7.25
1,1,1,female,38.0,1,C,71.2833
2,1,3,female,26.0,0,S,7.925
3,1,1,female,35.0,1,S,53.1
4,0,3,male,35.0,0,S,8.05


Let's look at some summary statistics for our data.

In [2]:
train.describe()

Unnamed: 0,Survived,Pclass,Age,Family,Fare
count,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.726061,0.904602,32.204208
std,0.486592,0.836071,13.902353,1.613459,49.693429
min,0.0,1.0,0.42,0.0,0.0
25%,0.0,2.0,21.0,0.0,7.9104
50%,0.0,3.0,28.5,0.0,14.4542
75%,1.0,3.0,38.0,1.0,31.0
max,1.0,3.0,80.0,10.0,512.3292


# Some Essential Info About the Survivors
This may help discover relationships and will familiarize us with the data.

In [3]:
print(str(round(np.mean(train['Survived']) * 100)) + "% of the passengers on the RMS Titanic survived.\n")
print(str(round((sum((train[train['Sex'] == 'female'])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors were female.\n")
print(str(round((sum((train[train['Pclass'] == 1])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors were first class.")
print(str(round((sum((train[train['Pclass'] == 2])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors were second class.")
print(str(round((sum((train[train['Pclass'] == 3])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors were third class.\n")
print(str(round((sum((train[train['Age'] <= 20])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors were 20 or younger.")
print(str(round((sum((train[(train['Age'] > 20) & (train['Age'] < 50)])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors were between 20 and 50.")
print(str(round((sum((train[train['Age'] >= 50])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors were 50 or older.\n")
print(str(round((sum((train[train['Family'] == 0])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors had no family members aboard.")
print(str(round((sum((train[train['Family'] >= 3])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors had three or more family members aboard.\n")
print(str(round((sum((train[train['Embarked'] == 'S'])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors embarked from Southampton.")
print(str(round((sum((train[train['Embarked'] == 'C'])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors embarked from Cherbourg.")
print(str(round((sum((train[train['Embarked'] == 'Q'])['Survived']) / sum(train['Survived'])) * 100)) + "% of the survivors embarked from Queenstown.")

38% of the passengers on the RMS Titanic survived.

68% of the survivors were female.

40% of the survivors were first class.
25% of the survivors were second class.
35% of the survivors were third class.

27% of the survivors were 20 or younger.
64% of the survivors were between 20 and 50.
9% of the survivors were 50 or older.

48% of the survivors had no family members aboard.
9% of the survivors had three or more family members aboard.

63% of the survivors embarked from Southampton.
27% of the survivors embarked from Cherbourg.
9% of the survivors embarked from Queenstown.


# Visualizing this Info
First, I'll create a stacked bar graph displaying the class and sex of those who survived. It is no surprise that the majority of survivors were first class, and that there is an even distribution between males and females. 

In [4]:
survivors = train[train['Survived'] == 1]
female_survivors = survivors[survivors['Sex'] == 'female']
male_survivors = survivors[survivors['Sex'] == 'male']
classes = ['First Class', 'Second Class', 'Third Class']
female_classes = female_survivors['Pclass'].value_counts(sort=False, normalize=True).to_list()
male_classes = male_survivors['Pclass'].value_counts(sort=False, normalize=True).to_list()
fig = go.Figure(data=[
    go.Bar(name='Female', x=classes, y=female_classes),
    go.Bar(name='Male', x=classes, y=male_classes)])
fig.update_layout(barmode='stack', width=400, height=400, title="Class and Sex of Survivors Ratios")
fig.show()

This one is a bit more interesting. Queenstown embarkers make up an overwhelming amount of the third class. Queenstown was one of the major transatlantic ports. Most of the first class members are from Cherbourg, located in France. The distribution of classes among Southampton embarkers is relatively uniform.

In [5]:
s_port = survivors[survivors['Embarked'] == 'S']
c_port = survivors[survivors['Embarked'] == 'C']
q_port = survivors[survivors['Embarked'] == 'Q']

s_classes = s_port['Pclass'].value_counts(sort=False, normalize=True).to_list()
c_classes = c_port['Pclass'].value_counts(sort=False, normalize=True).to_list()
q_classes = q_port['Pclass'].value_counts(sort=False, normalize=True).to_list()

fig = go.Figure(data=[
    go.Bar(name='Southampton', x=classes, y=s_classes),
    go.Bar(name='Cherbourg', x=classes, y=c_classes),
    go.Bar(name='Queenstown', x=classes, y=q_classes)])
fig.update_layout(barmode='stack', width=450, height=400, title="Class and Embarking Port of Survivors Ratios")
fig.show()

Plotly overlaid histograms don't seem to be supported on Kaggle, but the shape of the age distribution for non-survivors is lower but pretty similar to the one for survivors.

In [6]:
fig = px.histogram(train, x='Age', y='Survived', color='Survived', marginal='box', opacity=0.75, 
                   hover_data=train.columns, title='Ages of Survived and Dead Groups')
fig.update_layout(width=700, height=400)
fig.show()

There also don't seem to be any key discrepancies between the number of family members on board for survivors and non-survivors, which is mildly surprising.

In [7]:
fig = px.histogram(train, x='Survived', y='Family', color='Survived', marginal='box', opacity=0.75, 
                   hover_data=train.columns, orientation='h', title='Number of Family Members Aboard for Survived and Dead Groups')
fig.update_layout(width=700, height=400)
fig.show()

We can expect that the one guy who paid way more for this than everybody else did survived.

In [8]:
fig = px.histogram(train, x='Fare', y='Survived', color='Survived', marginal='box', opacity=0.75,
                  hover_data=train.columns, title='Fare Distribution Among Survivors and Non-Survivors')
fig.update_layout(width=700, height=400)
fig.show()

# Preparing the Data for a Machine Learning Model and Feature Selection
## Dummy Data
Now, we want to get our data in a proper format before we feed it into a model. Clearly this will be cumbersome to tackle with all the categorical features we have. One approach to address this is generating **dummy data**. The pandas `get_dummies()` function will use one-hot encoding to convert the categorical data, so new columns will be generated. We will add prefixes to make our columns more readable.

In [9]:
titanic_dummies = pd.get_dummies(train, columns=['Pclass', 'Sex', 'Embarked'], prefix=['Class', 'Sex', 'Port'])
titanic_dummies.head(5)

Unnamed: 0,Survived,Age,Family,Fare,Class_1,Class_2,Class_3,Sex_female,Sex_male,Port_C,Port_Q,Port_S
0,0,22.0,1,7.25,0,0,1,0,1,0,0,1
1,1,38.0,1,71.2833,1,0,0,1,0,1,0,0
2,1,26.0,0,7.925,0,0,1,1,0,0,0,1
3,1,35.0,1,53.1,1,0,0,1,0,0,0,1
4,0,35.0,0,8.05,0,0,1,0,1,0,0,1


## Feature Selection
We will use this feature selector to remove all low-variance features, i.e. features whose variance doesn't meet a given threshold. This is good because we don't want unwanted noise prior to our prediction. By default, VarianceThreshold removes all features that have 0 variance, or have the same values in all samples. Since we have all boolean features, we can remove all features that are either one or zero in 80% or more in all of the samples. The variance of boolean features is given by: *Var(x) = p(1 - p)*, and this is the threshold we will use for our selector. 
`get_support(indices=True)` will maintain the column names in our new generated data.

In [10]:
sel = VarianceThreshold(threshold=0.8 * (1 - 0.8))
sel.fit_transform(titanic_dummies)
fitted = titanic_dummies[titanic_dummies.columns[sel.get_support(indices=True)]]
fitted.head(5)

Unnamed: 0,Survived,Age,Family,Fare,Class_1,Class_2,Class_3,Sex_female,Sex_male,Port_S
0,0,22.0,1,7.25,0,0,1,0,1,1
1,1,38.0,1,71.2833,1,0,0,1,0,0
2,1,26.0,0,7.925,0,0,1,1,0,1
3,1,35.0,1,53.1,1,0,0,1,0,1
4,0,35.0,0,8.05,0,0,1,0,1,1


Below, we can see that our feature selector removed two columns, namely `Port_Q` and `Port_C`.

In [11]:
print('Original DF shape vs feature-selected DF shape: ' + str(titanic_dummies.shape) + ', ' + str(fitted.shape))

Original DF shape vs feature-selected DF shape: (891, 12), (891, 10)


# Model 1
## Linear Support Vector Classifier
Now we're ready to pass our data into classification models. The first one we will employ is the linear SVC, since we know that it is effective for high-dimensional data. It's also memory-efficient! We will be going with a simple binary linear classifier.

The underlying idea of the SVM is to find the hyperplane that best differentiates two classes. Support vectors are the coordinates of each observation in the *n*-dimensional space. We also want the distance between the hyperplane and the nearest data point (called the margin) to be maximized (note that the SVM selects the hyperplane that **best segregates the two classes** prior to maximizing the margin). 
### Example SVC Visual
![scikit-learn.org/stable/_images/sphx_glr_plot_iris_svc_0011.png](https://scikit-learn.org/stable/_images/sphx_glr_plot_iris_svc_0011.png)

Finally, we can split our training and testing data and fit our model.

In [12]:
SVC_classifier = SVC(kernel='linear')
features = fitted[fitted.columns[1:]]
label = fitted[fitted.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(features, label, test_size=0.2)
SVC_classifier.fit(X_train, Y_train)

SVC(kernel='linear')

Predict the label for the testing features. This `predict` function will return an array.

In [13]:
y_pred = SVC_classifier.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0])

### Evaluating the Model
We will be utilizing model evaluation metrics to quantify the performance of our linear SVC on the testing dataset we created.
#### K-Fold Cross Validation to Get Accuracy
Systemically create *k* train/test splits and average the results. This is similar to bootstrapping. 

In [14]:
def cross_val(model, X_test, Y_test, cv):
    cross_val_scores = cross_val_score(model, X_test, Y_test, cv=cv)
    print("10-Fold Cross Validation Scores: " + str(list(cross_val_scores)))
    print("Accuracy: %0.2f (+/- %0.2f)" % (cross_val_scores.mean(), cross_val_scores.std() * 2))
    
cross_val(SVC_classifier, X_test, Y_test, 10)

10-Fold Cross Validation Scores: [0.6666666666666666, 0.7222222222222222, 0.7777777777777778, 0.7777777777777778, 0.7777777777777778, 1.0, 0.7222222222222222, 0.8333333333333334, 0.8333333333333334, 0.8235294117647058]
Accuracy: 0.79 (+/- 0.17)


#### Plot Confusion Matrix
Finally, let's plot the confusion matrix for our model. 
* Every observation in our dataset is presented in exactly one box.
* It is a 2x2 matrix because there are two response classes. **

In [15]:
tn, fp, fn, tp = confusion_matrix(Y_test, y_pred).ravel()
print((tn, fp, fn, tp))

def plot_confusion_matrix(Y_true, Y_pred):
    cm = list(confusion_matrix(Y_true, y_pred))
    x = ['Pred. Not Survived', 'Pred. Survived']
    y = ['Not Survived', 'Survived']
    cm_text = [['TN', 'FP'], ['FN', 'TP']]
    fig = ff.create_annotated_heatmap(cm, x=x, y=y, annotation_text=cm_text, colorscale="aggrnyl")
    fig.update_layout(title="Confusion Matrix", width=400, height=400)
    fig.show()

plot_confusion_matrix(Y_test, y_pred)

(96, 15, 21, 47)


#### Compute Metrics from this Confusion Matrix

In [16]:
print("Model accuracy score from Conf. Matrix: " + str((tp + tn) / float(tp + tn + fp + fn)))
print("True accuracy score: " + str(accuracy_score(Y_test, y_pred)))

Model accuracy score from Conf. Matrix: 0.7988826815642458
True accuracy score: 0.7988826815642458


##### True Positive Rate

This is a straightfoward metric. Now, we'll look at something a little more interesting: **sensitivity**, which we want to maximize. When the true value is positive, how often is the prediction correct? This is also known as the "true positive rate" or "recall". It is calculated as follows:

In [17]:
sensitivity = tp / float(fn + tp) # These are all positive.

print("Recall score: " + str(sensitivity))

Recall score: 0.6911764705882353


##### True Negative Rate
Next, calculate the **specificity score**. This is the opposite of the sensitivity score. When the true value is negative, how often is the prediction correct? How "specific" is the classifier in predicting for positive true values?

In [18]:
specificity = tn / float(tn + fp) # These are all negative.

print("Specificty score: " + str(specificity))

Specificty score: 0.8648648648648649


Our classifier is moderately sensitive and highly specific. 

# Model 2
## K-Nearest-Neighbors Classifier
Like the SVC, K-Nearest-Neighbors is more commonly used in classification, and it's quite simple to understand. It makes its estimation for a the class a data point belongs to based on the classes of the data points that surround it. When all the closest data points to an observation belong to a certain class, we can say with a good confidence level that observation also belongs to that class.

However, a drawback of this model is that its calculation time increases significantly with a greater number of dimensions. Linear SVC, on the other hand, has the benefit of working well with high-dimensional data.

![KNN](https://www.analyticsvidhya.com/wp-content/uploads/2014/10/K-judgement.png)

As far as choosing *k* goes, a common practice is to either choose an odd number or *sqrt(n)*, where *n* is the total number of observations.


In [19]:
optimal_k = int(round(np.sqrt(len(X_train))))

neigh = KNeighborsClassifier(n_neighbors=optimal_k)
neigh.fit(X_train, Y_train)

y_pred_knn = neigh.predict(X_test)
y_pred_knn

array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0])

### It was that easy. Now let's do our evaluations.

In [20]:
cross_val(neigh, X_test, Y_test, 10)

10-Fold Cross Validation Scores: [0.5555555555555556, 0.6111111111111112, 0.5555555555555556, 0.7222222222222222, 0.6111111111111112, 0.6111111111111112, 0.6666666666666666, 0.7222222222222222, 0.7777777777777778, 0.7058823529411765]
Accuracy: 0.65 (+/- 0.14)


In [21]:
plot_confusion_matrix(Y_test, y_pred_knn)

In [22]:
def get_confusion_metrics(Y_true, Y_pred):
    tn, fp, fn, tp = confusion_matrix(Y_test, y_pred).ravel()
    print("Model accuracy score from Conf. Matrix: " + str((tp + tn) / float(tp + tn + fp + fn)))
    print("True accuracy score: " + str(accuracy_score(Y_test, y_pred)))
    
    sensitivity = tp / float(fn + tp)
    print("Recall score: " + str(sensitivity))
    
    specificity = tn / float(tn + fp)
    print("Specificty score: " + str(specificity))
    
get_confusion_metrics(Y_test, y_pred_knn)

Model accuracy score from Conf. Matrix: 0.7988826815642458
True accuracy score: 0.7988826815642458
Recall score: 0.6911764705882353
Specificty score: 0.8648648648648649


Our KNN model is moderately sensitive and highly specific. Let's try decreasing our number of neighbors. We'll use 5.

In [23]:
neigh5 = KNeighborsClassifier(n_neighbors=5)
neigh.fit(X_train, Y_train)

y_pred_knn5 = neigh.predict(X_test)
y_pred_knn5

array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0])

In [24]:
cross_val(neigh5, X_test, Y_test, 10)

10-Fold Cross Validation Scores: [0.6666666666666666, 0.4444444444444444, 0.7777777777777778, 0.7777777777777778, 0.6666666666666666, 0.5555555555555556, 0.6111111111111112, 0.6666666666666666, 0.6666666666666666, 0.7058823529411765]
Accuracy: 0.65 (+/- 0.19)


We seem to lose accuracy when we go with a value for *k* that is not the optimal value. Since we know now that KNN doesn't work with high-dimensional data, let's try reducing the dimensions of our training dataset. We will accomplish this through PCA. Standardization will be accomplished within that step, so we don't have to worry about that.

### Dimensionality Reduction
#### Standardization
We must begin by **standardizing** the data, since standardization will project our original data onto directions which maximize the variance. 

In [25]:
features_list = list(fitted.columns[1:])
vals = fitted.loc[:, features_list].values
vals = StandardScaler().fit_transform(vals)

# Now each value in the feature dataset is standardized!
vals

array([[-0.55604973,  0.05915988, -0.50244517, ..., -0.73769513,
         0.73769513,  0.61930636],
       [ 0.59548094,  0.05915988,  0.78684529, ...,  1.35557354,
        -1.35557354, -1.61470971],
       [-0.26816707, -0.56097483, -0.48885426, ...,  1.35557354,
        -1.35557354,  0.61930636],
       ...,
       [-0.5200644 ,  1.29942929, -0.17626324, ...,  1.35557354,
        -1.35557354,  0.61930636],
       [-0.26816707, -0.56097483, -0.04438104, ..., -0.73769513,
         0.73769513, -1.61470971],
       [ 0.16365693, -0.56097483, -0.49237783, ..., -0.73769513,
         0.73769513, -1.61470971]])

In [26]:
standardized_fitted = pd.DataFrame(vals, columns=features_list)
standardized_fitted.head(5)

Unnamed: 0,Age,Family,Fare,Class_1,Class_2,Class_3,Sex_female,Sex_male,Port_S
0,-0.55605,0.05916,-0.502445,-0.565685,-0.510152,0.902587,-0.737695,0.737695,0.619306
1,0.595481,0.05916,0.786845,1.767767,-0.510152,-1.107926,1.355574,-1.355574,-1.61471
2,-0.268167,-0.560975,-0.488854,-0.565685,-0.510152,0.902587,1.355574,-1.355574,0.619306
3,0.379569,0.05916,0.42073,1.767767,-0.510152,-1.107926,1.355574,-1.355574,0.619306
4,0.379569,-0.560975,-0.486337,-0.565685,-0.510152,0.902587,-0.737695,0.737695,0.619306


Now, we have our standardized features in a tabular format. We now need to project our 9-dimensional data to *n*-dimensional **principal components**. Before we do that, however, we need to pick *n*, which should be the value at which the inflection point occurs between the principal components accounting for a large amount of the variance and 100% of the variance. We will use `explained_variance_ratio_` to do this, and we'll plot it.

In [27]:
pca = PCA().fit(standardized_fitted)

xi = np.arange(1, 10, step=1)
y = np.cumsum(pca.explained_variance_ratio_)

fig = px.line(x=xi, y=y,)
fig.add_trace(
    go.Scatter(
        mode='markers',
        x=[6],
        y=[0.9565056],
        marker=dict(
            color='red',
            size=10,
            opacity=0.5
        ),
        showlegend=False
    )
)
fig.update_layout(width=600, height=400, xaxis_title='# Components', yaxis_title='PC Variance for Whole Dataset', title='PCA Explained Variance Ratio for Fitted Data')
fig.show()

There we have it. Our ideal number of components for PCA will be 6.

In [28]:
pca_titanic = PCA(n_components=6)
principal_components_titanic = pca_titanic.fit_transform(standardized_fitted)

pc_columns = ['PC' + str(i + 1) for i in range(6)]

principal_components_df = pd.DataFrame(data=principal_components_titanic,
                                      columns=pc_columns)
principal_components_df.insert(0, 'Survived', label)
principal_components_df

Unnamed: 0,Survived,PC1,PC2,PC3,PC4,PC5,PC6
0,0,-1.684986,-0.176241,0.388269,0.502101,-0.251362,-0.309237
1,1,3.082300,0.173417,0.957498,-0.901569,0.409824,0.157361
2,1,0.128093,-1.854545,0.091560,-0.990977,-0.953629,-0.804506
3,1,2.540324,-0.006222,0.047839,-0.096122,-1.139241,-0.822386
4,0,-1.626427,0.374577,0.302398,-0.268082,-0.708341,-0.092730
...,...,...,...,...,...,...,...
886,0,-0.680532,0.893417,-2.350930,0.246004,0.697930,-0.117974
887,1,2.112929,-0.343750,-0.072645,-0.232006,-0.528676,-1.886115
888,0,0.448402,-2.462327,0.341742,0.459740,-0.911467,0.120457
889,1,0.692788,1.858278,0.979534,-0.309567,1.422249,-0.406185


Now, we have a 6-dimensional DataFrame, and our KNN classifier should perform better. For the sake of curiosity, I'm going to get the explained variance per principal component.

In [29]:
print('Explained variation per principal component: {}'.format(pca_titanic.explained_variance_ratio_))

Explained variation per principal component: [0.27732709 0.21337444 0.17131804 0.12682745 0.09461536 0.07304328]


Since this number of dimensions is still pretty high for KNN, I'm going to predict that we won't see a marked increase in the accuracy, but let's test it out! We'll have to go through the process of train-test splitting again.

In [30]:
features_reduced_dim = principal_components_df[principal_components_df.columns[1:]]
label_reduced_dim = principal_components_df[principal_components_df.columns[0]]

X_train, X_test, Y_train, Y_test = train_test_split(features_reduced_dim, label_reduced_dim, test_size=0.2)

neigh = KNeighborsClassifier(n_neighbors=optimal_k)
neigh.fit(X_train, Y_train)

y_pred_knn = neigh.predict(X_test)
y_pred_knn

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1])

### Evaluate Again

In [31]:
cross_val(neigh, X_test, Y_test, 10)

10-Fold Cross Validation Scores: [0.8333333333333334, 0.9444444444444444, 0.7777777777777778, 0.8888888888888888, 0.7777777777777778, 0.6666666666666666, 0.7222222222222222, 0.7777777777777778, 0.8888888888888888, 0.8235294117647058]
Accuracy: 0.81 (+/- 0.16)


In [32]:
get_confusion_metrics(Y_test, y_pred_knn)

Model accuracy score from Conf. Matrix: 0.6145251396648045
True accuracy score: 0.6145251396648045
Recall score: 0.4426229508196721
Specificty score: 0.7033898305084746


That's a little bit of an improvement, even though it didn't do as well as our linear SVC. It wasn't a **necessity** since we already applied feature selection before, but since our accuracy, and more notably, our recall score, increased, we shall proceed with our 6-dimensional data. We've already cut the number of dimensions in half since we started.