# MICROCREDENTIAL PROGRAM MACHINE LEARNING ASSESSMENT

The Titanic embarked from England in April 1912 and was destined for New York City. However it did not make it. In this assessment you will implement a machine learning analysis on data about the passengers on board to predict their survival.

The data is from Kaggle:
https://www.kaggle.com/c/titanic

Survival is a **binary outcome**, meaning there are two possible discrete outcomes (exactly zero or exactly 1, unlike for example a house price which can be any positive value). There are many different approaches to take to predict binary outcomes.

Most of the analysis is filled out but throughout the notebook there are questions for you to answer and some code for you to fill in. To answer the written questions just click on the markdown cells and type your answers there.

Towards the end you will decide which machine learning algorithm to implement in order to predict survival of the Titanic passengers.

This reflects the house regression code from the ML3 session so looking through that notebook can be helpful to assist you on this assessment:\
https://github.com/akshayghosh-acenet/IntroMachineLearning3

And of course if you have any questions whether they are conceptual or about coding feel free to send me a message (:

In [None]:
# IMPORTS

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# you can import more libraries here or thoughout the code

In [None]:
# LOAD DATA

fn_train = 'train_titanic.csv'
fn_test = 'test_titanic.csv'

train_data = pd.read_csv(fn_train) # this has known outcomes, used for training and tuning
test_data = pd.read_csv(fn_test) # this does not have known outcomes

## EDA

With the data loaded, we now want to do the very important first step of any ML analysis: exploratory data analysis or EDA.

There are many different things that can be done in EDA, here we will look at the basic statistics (i.e. mean, median, etc) along with histograms of the numerical data.

In [None]:
# a good first step is to do dataframe.head() to see generally what the data looks like

train_data.head(15)

In [None]:
# here we use dataframe.describe() to get statistics on each of the columns

train_data.describe()

### Variables explained:

**Below are explanations of the variables/parameters in the dataset. Think about which ones you think should be included in the analysis, you will need to decide this later.**

| Variable | Definition | Key |
|---|---|---|
| Survived | Did they survive | 0 = no, 1 = yes |
| Pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| Sex | sex | - |
| Age | Age in years | - |
| SibSp| # of siblings/spouses on board | - |
| Parch | # of parents/children on board | - |
| Ticket | Ticket number | - |
| Fare | Passenger fare | - |
| Cabin | Cabin Nuber | - |
| Embarked | Port left from | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
# EDA

# here we use a for loop to plot some histograms

for parameter in ['Pclass','Sex','Age','Fare','SibSp','Parch']:
    plt.figure()
    plt.hist(train_data[parameter],bins = 20)
    plt.xlabel(parameter)
    plt.show()

## QUESTION 1: _How many NaNs?_

An important part of analysis is determining what percentage of your dataset are NaN values (not a number, i.e. a missing value). In the cell below do the following:

**i. Print the name of each column along with its number of NaN values.**

**ii. Out of the total number of datapoints ($n_{rows} \times n_{cols}$), what percentage of these are NaNs? Calculate the total number of NaNs and divide by the total number of datapoints and multiply this result by 100 to get it in a percentage.**

In [None]:
# assess NaNs

# print each column name along with its number of NaN values here:



# calculate percentage of NaNs in the dataset here:

## QUESTION 2: _Feature selection_

Here you are to select the features (i.e. columns or variables) that you would like to include in the analysis. You will also split the dataset into training and validation data (in this case the "testing" dataset is the data with unknown outcomes, just like in the housing regression from the ML3 session).

**i. Select the features by creating a list of strings of the feature names that you would like to include. In the cell below the code write a sentence about why you decided to include or not include each featuure (there are some that do not need to be included but this is up to your discretion!).**

**ii. Split the feature array and target array ($X$ and $y$ respectively) into training and validation subsets.**

In [None]:
# split for training and testing
from sklearn.model_selection import train_test_split

features = # PUT LIST OF FEATURES HERE

X = train_data[features] # set the features to be the columns in the list
y = train_data['Survived'] # set the target to be Survived

# split the training data into training and validation subsets
X_train, X_val, y_train, y_val = # SPLIT DATA HERE

# this is just calling the testing features X_test for symmetry with the training data. 
# there is no y_test, that is for us to calculate
X_test = test_data[features]

### Question 2 continued:

Feature justification:

**iii. After the dash write a sentence or two on why you decided to include/exclude that variable for predicting the survival outcome. Think about, does this information have an effect on surviving the Titanic sinking?**

PassengerId -

Pclass -

Name -

Sex -

Age -

SibSp -

Parch -

Ticket -

Fare -

Cabin -

Embarked -

## QUESTION 3: _Data preprocessing_

Here the data preprocessing is set up just like how we did in the ML3 session.

It is mostly filled out, but here is what you need to do:

**i: Research for and select a method to impute the numerical data. Here is a starting point**
https://scikit-learn.org/stable/modules/impute.html

**ii: Research for and select a method to scale the numerical data.**

In [None]:
# PREPROCESSING STEPS FOR NUMERICAL AND CATEGORICAL FEATURES

from sklearn.pipeline import Pipeline
from sklearn.impute import # import imputing functions here
from sklearn.preprocessing import # import scaling and encoding functions here
from sklearn.compose import ColumnTransformer

# this gets the numeric features as a list, these will either be an integer or a float
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns

# similarly this gets the categorical features as a list. these are strings and not numbers, for example could be 'yes' or 'no'
categorical_features = X.select_dtypes(include=['object']).columns

# this defines how we want to transform the numerical features

# !!!PUT YOUR IMPUTER AND SCALERS FOR THE NUMERICAL DATA HERE!!!:
NUMERICAL_IMPUTER = # find a sklearn imputer and put it here like NUMERICAL_IMPUTER = ImputerMethod(Parameters)
NUMERICAL_SCALER = #  find a sklearn scalar and put it here like NUMERICAL_SCALER = ScalerMethod(Parameters)

numeric_transformer = Pipeline(steps=[
    ('imputer', NUMERICAL_IMPUTER), # here we replace every missing or NaN value with the mean of the features
    ('scaler', NUMERICAL_SCALER) # here we scale every feature such that its mean is 0 and standard deviation is 1
])


# this defines how we want to transform the categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), # here we replace every missing or NaN value with the most commonly occuring label
    ('onehot', OneHotEncoder(handle_unknown='ignore')) # this converts the categorical variables
])


# this now applies the column transforms we just defined
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features), # define the process as the name 'num', and tell sklearn to apply the numeric transformer we defined the the numeric features we got into the list
        ('cat', categorical_transformer, categorical_features) # same with the categorical variables
    ])

### Question 3 continued

Answer the following questions in a couple of sentences:

**iii. What is imputation and why is it useful/important in data preprocessing?**

**iv. Explain what one-hot encoding and when you would implement it in data preprocessing.**

**v. Explain why scaling the numerical data is important in data preprocessing.**

## QUESTION 4: _Model selection_

Now for the fun part! Here you are to determine which sklearn ML model to implement for the analysis.

Some example libraries to find useful models are `sklearn.ensemble` or `sklearn.linear_model`.

This link could be a useful start for determining a model: https://scikit-learn.org/stable/tutorial/machine_learning_map/

Note that you do  not have to be limited to `sklearn` for your model! There are other Python ML libraries if you want to venture out and choose another model/algorithm.

**i. Choose and implement a machine learning model to predict survival (remember it is a _binary outcome_) for the passengers on board the Titanic.**

In [None]:
# build model

# IMPORT MODEL HERE

model = # DEFINE MODEL HERE
model_name = '' # put a name for your model to input to the Pipeline function


### Question 4 continued:

**ii. Write a short paragraph justifying why you chose the model that you did.**

In [None]:
# define full model

full_model = Pipeline(steps=[('preprocessor', preprocessor),
                              (model_name, model)])

In [None]:
# fit model to data

full_model.fit(X_train, y_train)
None # I just put this here to suppress the output of full_model.fit(X_train, y_train). if you want you can delete this line and see what happens

In [None]:
# make predictions

y_val_pred = full_model.predict(X_val)

In the cell below the accuracy metrics are calculated by comparing what is called the "ground truth" (the known values for survival in `y_val`, to the predicted values that were obtained by applying the model to `X_val` (we call this `y_val_pred`).

Feel free to go back and select a different model, or adjust the parameters, and see how the accuracy metrics and confusion matrix change.

In [None]:
# accuracy metrics

from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
from sklearn.metrics import ConfusionMatrixDisplay


y_true = y_val
y_pred = y_val_pred

# Calculate metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

# Calculate Confusion Matrix
conf_matrix = confusion_matrix(y_true, y_pred) 

# Print the results
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("Confusion matrix:")

# this plots the confusion matrix
cmd = ConfusionMatrixDisplay(conf_matrix)
fig, ax = plt.subplots(dpi = 300)
cmd.plot(ax=ax)

In [None]:
# here we use the model to make predictions on the test data!

y_unknown_predict = full_model.predict(X_test)

### QUESTION 5:

**i. Write a sentence or two explaining each of the following:**

Accuracy - 

Precision - 

Recall - 

Confusion matrix -

## BONUS QUESTION 6 (OPTIONAL): _Hyperparameter tuning_

**i. In the cells below tune the parameters in the model you selected and recalculate the accuracy metrics. Make a comment on the method for hyperparameter tuning (i.e. grid search, random search, etc) that you chose.**

In [None]:
# hyperparameter tuning goes here and below