## Machine Learning Workshop

This workshop is a demo for machine learning in Python.  
The goal for this session is not to train an amazing ML model, but to explain the typical steps in a machine learning project and to provide some background and further reading.

### Setup

- install conda: https://docs.conda.io/projects/conda/en/latest/user-guide/install/download.html
- create a conda environment required packages (in console): `conda create -n ml_workshop scikit-learn matplotlib ipykernel`
- Activate conda environment and start a Jupyter notebook with the right environment:

        conda activate ml_workshop
        python -m ipykernel install --user --name ml_workshop --display-name "ML Workshop"
        jupyter notebook
- Alternative: run in VSCode with Python and Jupyter notebook extensions
- if you don't want to set up a local version, you can try this browser-based notebook: https://jupyter.org/try-jupyter/retro/notebooks/?path=notebooks/Intro.ipynb

### Other popular Python libraries for ML
- `numpy`, `pandas`: for data manipulation
- `scikit-learn` for ML algorithms, model evaluation, resampling
- `tensorflow`, `keras`, `pytorch` for Deep Learning
- `matplotlib` for visualization

### The iris dataset

The iris dataset contains measurements of 3 different types of flowers. It contains 150 data points (50 of each class).
For more background, see https://en.wikipedia.org/wiki/Iris_flower_data_set

Sepals and Petals | Iris setosa | Iris versicolor | Iris virginica
- | - | - | -
![Sepals and Petals](iris_figs/220px-Petal-sepal.jpg) | ![Iris setosa flower](iris_figs/Kosaciec_szczecinkowaty_Iris_setosa.jpg) | ![Iris versicolor flower](iris_figs/220px-Iris_versicolor_3.jpg) |![Iris virginica flower](iris_figs/Iris_virginica.jpg) 

4 features: length and width of sepals and petals (different parts of the flower)  
3 classes: setosa, versicolor and virginica

In [None]:
from sklearn import datasets

iris = datasets.load_iris()
iris

## Resampling

The goal for a classification model is to learn the relationship between the features and the class label (so we can apply it to new data points later). At the same time, we want to understand how well our model performs.    
If we evaluate the model on the data it was trained on, a model that simply memorizes the training data will get a perfect score - this is useless! We need to find a better way to evaluate our model's performance.  

One method for this evaluation is resampling. The general idea here is to train the model on a subset of the available data, and evaluate it on a different subset. There are many methods for this, see https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets and also https://en.wikipedia.org/wiki/Cross-validation_(statistics). 

In this tutorial, we will use a simple train/test-split.

![Dataset split in training and test set](iris_figs/train_test.png) 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.33)
print(f"Number of rows / features in training data: {X_train.shape}")
print(f"Number of training labels: {y_train.shape}")

print(f"Number of rows / features in test data: {X_test.shape}")
print(f"Number of test labels: {y_test.shape}")

## Model selection

There are tons of different options for ML algorithms to choose from - see for example https://scikit-learn.org/stable/supervised_learning.html. \
Selecting the right algorithm can be crucial for the performance of the system, but how do we decide which algorithm works best? \
Often times, this comes down to a few factors:
- previous experience on a task / similar dataset
- current state of the art (based on scientific literature)
- experimentation (try a bunch of things and see what works best)  

For this tutorial, we will try three classification algorithms: [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree), [Support Vector Machine](https://en.wikipedia.org/wiki/Support-vector_machine) and [k-nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

clf1 = DecisionTreeClassifier()
clf2 = SVC()
clf3 = KNeighborsClassifier(n_neighbors=3)
classifiers, clf_names = [clf1, clf2, clf3], ["Decision Tree", "Support Vector Machine", "k-nearest neighbors"]

for classifier in classifiers:
    classifier = classifier.fit(X_train, y_train)

In [None]:
# Model evaluation - for more information about the confusion matrix, see https://en.wikipedia.org/wiki/Confusion_matrix
# scikit-learn offers many metrics out of the box - see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics for an overview

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

for classifier, name in zip(classifiers, clf_names):
    y_pred = classifier.predict(X_test)

    print(f"Accuracy on test data for {name}: {accuracy_score(y_true=y_test, y_pred=y_pred)*100}%")
    print(confusion_matrix(y_true=y_test, y_pred=y_pred))


## Retraining on full dataset after model selection

Once we are happy with the estimated performance of our selected model, we can retrain it on the entire dataset.  
In many cases having more training data can improve model performance, so it would be a waste not to use the test data.  
You may skip this step in some instances (for example if model training is very expensive computationally) and use the model you trained previously on the training data.

In [None]:
clf1 = clf1.fit(iris.data, iris.target)

In [None]:
# some analysis for the Decision Tree

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
%matplotlib inline  

plot = plot_tree(clf1, feature_names=iris.feature_names, class_names=iris.target_names)
plt.savefig("iris_figs/decision_tree.png", format="png", dpi=300, facecolor='white', transparent=False)

## Hyperparameter tuning

Many ML algorithms have their own parameters, which can greatly improve (or decrease) the model performance. As part of the model selection process, we often train models with different hyperparameters to see what works best. \
There are many techniques for selecting hyperparameter configurations (see https://en.wikipedia.org/wiki/Hyperparameter_optimization), these are very common:
- Grid Search (exhaustive sweep over pre-defined hyperparameter values - very inefficient)
- Random Search (sampling random hyperparameter configurations)
- Bayesian Optimization (sampling "promising" hyperparameter configurations based on previous evaluations)  

Personal recommendation: I'm a big fan of Bayesian Optimization, for example with the `optuna` package. \
The following example showcases how a bad choice of hyperparameter can destroy model performance.

In [None]:
# k-nearest neighbors predicts the majority class of the k closest training data points (based on a distance measure)
bad_classifier = KNeighborsClassifier(n_neighbors=100)
bad_classifier.fit(X_train, y_train)
y_pred = bad_classifier.predict(X_test)
print(f"Accuracy on test data for kNN: {accuracy_score(y_true=y_test, y_pred=y_pred)*100:.2f}%")
# Because we consider 100 neighbors and the training dataset only contains 100 rows, we always predict the most common class
print(confusion_matrix(y_true=y_test, y_pred=y_pred))

## Related topics not covered in this session

- Data engineering - getting high quality training data is the most important step in any machine learning project
- Over-/Underfitting, Bias-Variance tradeoff
- Deployment - or if you like buzzwords, [MLOps](https://en.wikipedia.org/wiki/MLOps)
- Online experimentation - using statistical techniques like [A/B testing](https://en.wikipedia.org/wiki/A/B_testing)
- Monitoring - model performance may change / degrade over time (e. g. due to changing usage patterns following a global pandemic)
- Privacy / Responsible AI / Compliance - especially important in large organizations