## Machine Learning - Titanic

The titanic dataset is a popular dummy dataset. 

In notebook, we will explore the basic concepts of Machine Learning using the Python library [SciKit Learn](https://scikit-learn.org/stable/index.html) and the Titanic dataset. You will find many tutorials online that use this dataset and library to explore Machine Learning concepts.



In [None]:
import pandas as pd
import numpy as np 
from plotnine import *

import warnings
warnings.filterwarnings('ignore')


### Load the data

Read in the `titanic.csv` data set again.

In [None]:
# Load titanic.csv
df = pd.read_csv('titanic.csv')
df

The first thing we need to do is code the pclass and gender variables numerically. Let's use the following scheme:
- pclass: 1,2,3
- gender: 0=male, 1=female, and let's call the column called "female" to remind us which is which

In [None]:
# recode the pclass and gender variables so they are numeric
df['pclass'] = df.pclass.replace({'1st': 1, '2nd': 2, '3rd': 3})
df['female'] = df.gender.replace({'male': 0, 'female': 1})
df.head(3)

## 1. Logistic Regression with Scikit-Learn

Let's look at the documentation and use various functions from there!
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Also helpful if the sklearn documentation seems overwhelming, check out investigate.ai
- https://investigate.ai/classification/intro-to-classification/#Logistic-Classifier


In [None]:
# Import the classifier from scikit-learn
from sklearn.linear_model import LogisticRegression

# Create a new classifier (in this case it is just a logisitic regression)
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)

In [None]:
# Fit the data to the model

X = df[['pclass', 'female']]
y = df['survived']

clf.fit(X, y)

In [None]:
# predictions of who survived
clf.predict(X)

In [None]:
# predictions as probabilities
clf.predict_proba(X)

In [None]:
# probabilities of what? (survival)
# this helps interpret the results above
clf.classes_

In [None]:
# coefficients (logs of odds ratios)
clf.coef_

In [None]:
# coefficients of what?
# ...coefficients of features
clf.feature_names_in_

In [None]:
# accruacy...
# how is this calculated?
clf.score(X,y)

### 2. Metrics for what makes a good model

In [None]:
from sklearn.metrics import confusion_matrix, recall_score, precision_score, \
                            accuracy_score, f1_score

In [None]:
# calculate from below
confusion_matrix(clf.predict(X), y)

In [None]:
accuracy_score(clf.predict(X), y)

In [None]:
precision_score(clf.predict(X), y)

In [None]:
recall_score(clf.predict(X), y)

In [None]:
f1_score(clf.predict(X), y)

### 3. Test-Train Split

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

X = df[['pclass', 'female']]
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2)

# define model
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000) 

# fit model on training data
clf.fit(X_train, y_train)

# scores
print("accuracy_score", accuracy_score(y_test, clf.predict(X_test)).round(2))
print("precision_score", precision_score(y_test, clf.predict(X_test)).round(2))
print("recall_score", recall_score(y_test, clf.predict(X_test)).round(2))
print("f1_score", f1_score(y_test, clf.predict(X_test)).round(2))


### 4. Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score(clf, X, y, cv=10)

# Cross validation on accuracy scores
scores

In [None]:
print(f"{scores.mean().round(2)} accuracy with a standard deviation of {scores.std().round(2)}")

#### Cross Validation with other scores

In [None]:
from sklearn.model_selection import cross_validate

In [None]:
# 5-fold cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1']
scores = cross_validate(clf, X, y, scoring=scoring)
pd.DataFrame(scores).round(2)

## Comparing models

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X,y)
scores = cross_val_score(clf, X, y, cv=10)
print(f"{scores.mean().round(2)} accuracy with a standard deviation of {scores.std().round(2)}")


In [None]:
# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X,y)
scores = cross_val_score(clf, X, y, cv=10)
print(f"{scores.mean().round(2)} accuracy with a standard deviation of {scores.std().round(2)}")


In [None]:
# Multi-layer perceptron (a type of Neural Network ¯\_(ツ)_/¯)
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier()
clf.fit(X,y)
scores = cross_val_score(clf, X, y, cv=10)
print(f"{scores.mean().round(2)} accuracy with a standard deviation of {scores.std().round(2)}")