# Whale and Dolphin Classification Project

Authors:
- Victor Möslein
- Maren Rieker
- Reed Garvin
- Dinah Rabe

This Notebook is one of three core notebooks of the Whale and Dolphin Classification Project for the "Machine Learning" class at the Hertie School of Governance. It focuses on the application of classic machine learning models to the task at hand. There is one other notebook concerned with data preprocessing and another that focuses on the application of a deep learning model. 

The code of this nootebook partly follows the chapter on Classification from the book "Hands-on Machine Learning with Scikit-Learn, Keras, and Tensorflow" by Aurélien Géron.

In [None]:
## Setup: System settings and packages

In [1]:
# Python ≥3.5 is required

import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

import random
import numpy as np
from numpy import load
import pandas as pd
from numpy import savez_compressed
import os
import timeit
import seaborn as sns
import pickle
import PIL


# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# to make this notebook's output stable
np.random.seed(42)

In [None]:
full_data_switch_on = False # if the full data set should be used, this switch need to be set to true

## Define paths to data and for output

In [3]:
# path to clean data folder
ROOT_PATH_DATA = "input/04_cleaned/"

# where to save figures
ROOT_PATH_FIG = "output/ml_models/01_figures"
os.makedirs(ROOT_FIGS, exist_ok=True)

# where to save output

ROOT_OUTPUT = "output/ml_models/"
OUTPUT_PATH_TRAIN_EVAL = os.path.join(ROOT_OUTPUT + "02_training_set_evaluation")
OUTPUT_PATH_TEST_EVAL = os.path.join(ROOT_OUTPUT + "03_test_set_evaluation")
OUTPUT_PATH_HYPPAR_TUN = os.path.join(ROOT_OUTPUT + "04_hyperparamter_tuning")
OUTPUT_PATH_RUN_TIME = os.path.join(ROOT_OUTPUT + "05_runtime_stats")

# function to save figures

def save_fig(fig_id, SAVE_PATH=ROOT__PATH_FIG, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(SAVE_PATH, fig_id + "." + fig_extension)
    print(">... Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

NameError: name 'ROOT_FIGS' is not defined

## Loading and splitting training data

In [29]:
labels_df = pd.read_csv(ROOT_PATH_DATA + "train/clean_sample_train.csv", sep = ';')

In [30]:
labels_full = labels_df["species"]

In [31]:
pic_ids_full = labels_df["image"]

In [32]:
# load npz files ## has to be adjusted to the actual one
img_data = np.load("input/04_cleaned/train/img_data_sample_224.npz")
img_data_full = img_data["arr_0"]

In [33]:
# Split into training and test set - 10.000 test set / 40.000 full training set

##Victor: das funktioniert nicht mit unserem Sample Set weil wir da species mit nur 1 haben, 
## aber stratify = labels sollte es verhältnismäßig zu den species splitten

from sklearn.model_selection import train_test_split
img_data_train_full, img_data_test, labels_train_full, labels_test, pic_ids_train_full, pic_ids_test = train_test_split(img_data_full , labels_full, pic_ids_full, stratify=labels_full, test_size=0.2, random_state=42)


ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

In [None]:
# Split into training and validation set - 30.000 training set / 10.000 validation set
from sklearn.model_selection import train_test_split
img_data_train, img_data_val, labels_train, labels_val, pic_ids_train, pic_ids_val = train_test_split(img_data_train_full , labels_train_full, pic_ids_train_full, train_size=30000, random_state=42)


## Implementing base line model

In [None]:
def train_clasf(classifier_x, img_data_train, labels_train):        
    # set name of classifier
    classifier_name = classifier_x.__class__.__name__
    
    # train model
    print(">... Starting training of", classifier_name)
    start_time = timeit.default_timer()
    classifier_x.fit(img_data_train, labels_train)
    time_elapsed = timeit.default_timer() - start_time
    
    print(">... Classifier {} sucessfully trained in {} seconds.".format(classifier_name, round(time_elapsed,3)))
        

In [None]:
from sklearn.linear_model import LogisticRegression

classifier_LR = LogisticRegression(random_state=42)
train_clasf(classifier_LR, img_data_train, labels_train)

In [None]:
## haben wir da eine Präferenz?

#in the multiclass case, the training algorithm uses the one-vs-rest (OvR) 
#scheme if the ‘multi_class’ option is set to ‘ovr’, 
#and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. 
#‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.
#default is "auto"

## Evaluating base line model ("compute metrics on train AND dev") 

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score

In [None]:
# store predictions of classifier ## Victor: ich war mir nicht ganz sicher ob das so richtig ist
pred_train = classifier_LR.predict(img_data_train)
pred_val = classifier_LR.predict(img_data_val)

# evaluate classifier and store metrics 
evaluation_scores = {}
    evaluation_scores["Precision Score Train"] = precision_score(labels_train, pred_train).round(3)
    evaluation_scores["Precision Score Validation"] = precision_score(labels_val, pred_val).round(3)
    evaluation_scores["Recall Score"] = recall_score(labels_train, pred_train).round(3)
    evaluation_scores["Recall Score"] = recall_score(labels_val, pred_val).round(3)
    evaluation_scores["F1 Score"] = f1_score(labels_train, pred_train, average=None).round(3) ## ich glaube average NONE ist richtig für multiclass, bitte nochmal checken
    evaluation_scores["F1 Score"] = f1_score(labels_val, pred_val, average=None).round(3)


In [None]:
# save evaluation scores 
def store_eval_score(image_df):
    savez_compressed(OUTPUT_PATH_TRAIN_EVAL + "/evaluation_scores"+str(classifier_name)+".npz",image_df)
    print("file successfully stored in: output/ml_models/02_training_set_evaluation")


In [None]:
store_eval_score(evaluation_scores)

In [None]:
## Multiclass Confusion Matrix

##Victor: ich weiß nicht ob uns das was bringt, aber es könnte theoretisch zeigen falls es bestimmte classes
## gibt die schwieriger zu predicten sind bzw. zu mehr Fehlern führen

In [None]:
multilabel_confusion_matrix(labels_val,pred_val,labels=labels_val)

In [None]:
# Inspecting the errors 
output_dict = {}
output_array = np.c_[pic_ids_val, labels_val, pred_val] ## adjust name of pic_ids and labels depending on train or val
    
# Create error array with specific error
err_type_arr = np.array([])
for i in range(len(output_array)):
     if output_array[i,1] != output_array[i,2]:
        err_type_arr = np.append(err_type_arr, "error")
    else:
        err_type_arr = np.append(err_type_arr, "No error")

error_table_pd = pd.DataFrame(output_array)
error_table_pd.rename(columns = {0:'Picture ID', 1:"Label", 2:"Predicted"}, inplace = True)
error_table_pd["Error Check"] = err_type_arr

# print filtered error table
print(error_table_pd.loc[error_table_pd["Error Check"].isin("error")].sort_values(by=["Label", "Picture ID"]))


In [None]:
# def function for saving the filtered error table

def store_error_table(image_df):
    savez_compressed(OUTPUT_PATH_TRAIN_EVAL + "/error_table"+str(classifier_name)+".npz",image_df)
    print("file successfully stored in: output/ml_models/02_training_set_evaluation")


In [None]:
# transform pd frame into dictionary for saving
output_dict["error_table"] = error_table_pd

# saving the error table
store_error_table(output_dict)

## Implementing RandomForest Classifier as advanced model

In [None]:
from sklearn.ensemble import RandomForestClassifier

#### Victor: setzen wir direkt eine max_depth= und n_estimators=500 ? 
## es gibt so viele hyperparameter für RF, bspw auch class_weight - da wusste ich auch nicht genau ob wir das nutzen?
## oder ob wir generell all diese variablen erst beim hyperparameter tuning anschauen

classifier_RF = RandomForestClassifier(max_depth= ,n_jobs=-1,random_state=42) 
train_clasf(classifier_RF, img_data_train, labels_train)

## Evaluating RandomForest Classifier 

In [None]:
# store predictions of classifier ## Victor: ich war mir nicht ganz sicher ob das so richtig ist
pred_train = classifier_RF.predict(img_data_train)
pred_val = classifier_RF.predict(img_data_val)

# evaluate classifier and store metrics 
evaluation_scores = {}
    evaluation_scores["Precision Score Train"] = precision_score(labels_train, pred_train).round(3)
    evaluation_scores["Precision Score Validation"] = precision_score(labels_val, pred_val).round(3)
    evaluation_scores["Recall Score"] = recall_score(labels_train, pred_train).round(3)
    evaluation_scores["Recall Score"] = recall_score(labels_val, pred_val).round(3)
    evaluation_scores["F1 Score"] = f1_score(labels_train, pred_train, average=None).round(3) ## ich glaube average NONE ist richtig für multiclass, bitte nochmal checken
    evaluation_scores["F1 Score"] = f1_score(labels_val, pred_val, average=None).round(3)

In [None]:
store_eval_score(evaluation_scores)

In [None]:
## Multiclass Confusion Matrix

##Victor: ich weiß nicht ob uns das was bringt, aber es könnte theoretisch zeigen falls es bestimmte classes
## gibt die schwieriger zu predicten sind bzw. zu mehr Fehlern führen

In [None]:
multilabel_confusion_matrix(labels_val,pred_val,labels=labels_val)

In [None]:
## victor - ich weiß nicht ob wir das machen können, aber vielleicht kannst du es versuchen
## https://github.com/harsh1kumar/learning/blob/master/machine_learning/santander_trxn_prediction/02_trxn_pred_rf_basics.ipynb
## in dem Notebook unter Step 5 visualisiert der die ROC-AUC für die einzelnen Trees 
## um die model performance in bezug auf die Anzahl der Trees zu zeigen - find ich ganz cool
## bin mir nicht ganz sicher, ob das mit multi-class auch funktioniert

## Hyperparamter Tuning RandomForest Classifier

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Selecting Parameters of Random Forest Classifier for Grid Search

## hey victor hier war ich mir nicht so sicher was wir alles auswählen sollen 
## ich hab jetzt mal die genommen von denen ich gelesen habe, dass sie wichtig sind
# hab den artikel gelesen: https://towardsdatascience.com/random-forest-hyperparameters-and-how-to-fine-tune-them-17aee785ee0d

## bei max_features war ich mir aber nicht sicher auf was wir beschränken sollen - es gäbe auch noch "sqrt" (None nimmt alle)

params_grid_RF = [
    {"n_estimators": [10, 100, 250],
     "criterion_list":["gini","entropy"]
     "max_features": ["auto", "log2", None],
     "bootstrap": [True, False]}    
]


In [None]:
## Victor: hier müssen wir wieder die max depth fixen

classifier_RF_hyper = RandomForestClassifier(max_depth= ,n_jobs=-1, random_state=42)

In [None]:
# Perform Grid-Search

## Victor: hier war ich mir unsicher bzgl scoring und refit - bitte checken

grid_search = GridSearchCV(estimator = classifier_RF_hyper, 
                           param_grid = params_grid_RF, 
                           cv=3,
                           scoring=["precision", "recall", "accuracy"],
                           refit = "precision",
                           n_jobs = 4,
                           verbose = 3,
                           return_train_score = True)
    
grid_search.fit(img_data_train, labels_train)

In [None]:
print(grid_search.best_params_)
best_hyppar_dict = {}
best_hyppar_dict["Grid_Search_Best_Params"] = grid_search.best_params_

In [None]:
def store_best_hyperpar(image_df):
    savez_compressed(OUTPUT_PATH_HYPPAR_TUN + "/best_hyperpar"+str(classifier_name)+".npz",image_df)
    print("file successfully stored in: output/ml_models/04_hyperparamter_tuning")

In [None]:
store_best_hyperpar(best_params_dict)

## Evaluate tuned classifier

In [None]:
## define the tuned classifier with the best hyperparamters ## these here are still random ones

classifier_RF_tuned = RandomForestClassifier(class_weight='balanced',
                                      criterion='gini',
                                      max_depth=55,
                                      max_features='log2',
                                      min_samples_leaf=0.005,
                                      min_samples_split=0.005,
                                      n_estimators=190)

classifier_RF_tuned.fit(image_data_train,labels_train)

In [None]:
# store predictions of classifier ## Victor: ich war mir nicht ganz sicher ob das so richtig ist
pred_train = classifier_RF_tuned.predict(img_data_train)
pred_val = classifier_RF_tuned.predict(img_data_val)

# evaluate classifier and store metrics 
evaluation_scores = {}
    evaluation_scores["Precision Score Train"] = precision_score(labels_train, pred_train).round(3)
    evaluation_scores["Precision Score Validation"] = precision_score(labels_val, pred_val).round(3)
    evaluation_scores["Recall Score"] = recall_score(labels_train, pred_train).round(3)
    evaluation_scores["Recall Score"] = recall_score(labels_val, pred_val).round(3)
    evaluation_scores["F1 Score"] = f1_score(labels_train, pred_train, average=None).round(3) ## ich glaube average NONE ist richtig für multiclass, bitte nochmal checken
    evaluation_scores["F1 Score"] = f1_score(labels_val, pred_val, average=None).round(3)

In [None]:
store_eval_score(evaluation_scores)

## Speed improvements through dimensionality reduction

### Checking Feature importance of RF

In [None]:
# Defining the function to plot the digits of feature importance
# Adapted from Aurelien Geron:

pix_res = 224

def plot_digit(data):
    image = data.reshape(pix_res, pix_res)
    plt.imshow(image, cmap = mpl.cm.hot,
               interpolation="nearest")
    plt.axis("off")

In [None]:
# Adding feature importances from 3 RGB values to one pixel
feature_imp_sum = np.empty([(int(len(classifier_RF_tuned.feature_importances_)/3)),])

for itr in range(int(len(classifier_RF_tuned.feature_importances_)/3)):
    r = int(itr*3)
    g = int(r+1)
    b = int(g+1)
    feature_imp_sum[itr] = classifier_RF_tuned.feature_importances_[r] + classifier_RF_tuned.feature_importances_[g] + classifier_RF_tuned.feature_importances_[b]


In [None]:
# Plotting feature importance sum for every pixel to a plot and save it
plot_digit(feature_imp_sum)

cbar = plt.colorbar(ticks=[feature_imp_sum.min(), feature_imp_sum.max()])
cbar.ax.set_yticklabels(['Not important', 'Very important'])

In [None]:
save_fig("RandomForest_feature_importance_plot_full_data")

### Performing Principal Component Analysis on the training data

In [None]:
from sklearn.decomposition import PCA

In [None]:
img_data_full.shape

In [None]:
# defining to keep 99% of the variance of the data 
pca = PCA(.99)
img_data_full_red = pca.fit_transform(img_data_full)

In [None]:
# checking how many features are left 
img_data_full_red.shape

### Evaluate speed tuned RF classifier

In [None]:
# Victor: muss man dann hier nochmal das ganze training (direkt mit den gefundenen hyperparametern) und 
# evaluation nochmal machen? Schon oder?
# und dann besteht die Gefahr, dass die Hyperparameter nicht mehr ideal sind und man die dann nochmal
# fine tunen muss (so wars bei Thilo) aber falls die predictions nicht zu schlecht sind können wir auch argumentieren
# dass wir das aufgrund von run-time and scope limitations of our project nicht mehr machen

## Final Evaluation on Test Set

In [None]:
## test baseline and untuned RF on test set

In [None]:
## test tuned RD on test set