# Hands On 3.1 - Machine learning and classification

In this laboratory you will learn about classification problems and how they can be approached using a
category of tree-based models. In particular, you will use a decision tree from scikit-learn. You will see
it in action with different datasets and understand its points of strength and weaknesses. Then, you will
implement your own version of a random forest (already given), starting from scikit-learn’s decision trees.

## 1. Wine classification

In [None]:
from sklearn.datasets import load_wine
import numpy as np

dataset = load_wine()

print(dataset.keys())
X = dataset["data"]
y = dataset["target"]
feature_names = dataset["feature_names"]

In [None]:
print(dataset['DESCR'])

In this exercise, you will use sklearn’s `DecisionTreeClassifier` to build a decision tree for the wine dataset. You can read more about this class on the [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).


1. Based on your $X$ and $y$, answer the following questions:

- How many records are available?
- Are there missing values?
- How many elements does each class contain?

In [None]:
## Write your code here ##
num_records = 
num_missing = 

print(f"Number available records: {num_records}")
print(f"Number missing values: {num_missing}")

print(f"Elements does each class") # print a list or whatever you want


2. Create a `DecisionTreeClassifier` object with the default configuration (i.e. without passing any
parameters to the constructor). Train the classifier using your $X$ and $y$.

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = 

3. Now that you have created a tree, you can visualize it. Sklearn offers two functions to visualize decision trees. The first one, `plot_tree()`, plots the tree in a matplotlib-based, interactive window.
An alternative way is using `export_graphviz()`, which exports the tree as a DOT file. DOT
is a language for describing graph (and, as a consequence, trees). From the DOT code, you can
generate the resulting visual representation either using specific Python libraries. We recommend using the first approach, which is faster.

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Idea: can you increase the plot size? ax can be something you already used

#####
p = plot_tree(clf, ax=None)

After you successfully plotted a tree, you can take a closer look at the result and draw some conclusions. In particular, what information is contained in each node? Take a closer look at the leaf
nodes. Based on what you know about overfitting, what can you learn from these nodes?

4. Given the dataset $X$, you can get the predictions of the classifier (one for each entry in X) by calling the `predict()` of `DecisionTreeClassifier`. Then, use the `accuracy_score()` function (which you can import from `sklearn.metrics`) to compute the accuracy between two lists of values (`y_true`,
the list of “correct” labels, and `y_pred`, the list of predictions made by the classifier). Since you
already have both these lists ($y$ for the ground truth, and the result of the `predict()` method for the prediction), you can already compute the accuracy of your classifier. What result do you get? Does
this result seem particularly **high/low**? Why do you think that is?

In [None]:
from sklearn.metrics import accuracy_score


### ⚠️ Try to answer this question before going on

5. Now, we can split our dataset into a training set and a test set. We will use the training set to train a model, and to assess its performance with the test set. Sklearn offers the `train_test_split()`
function to split any number of arrays (all having the same length on the first dimension) into two
sets. You can refer to the [official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to understand how it can be used. You can use an
80/20 train/test split. If used correctly, you will get 4 arrays: `X_train`, `X_test`, `y_train`, `y_test`.

    Try to compute the balance between the classes for train and test sets, is it fine?
> **Hint:** Consider attributes such as `shuffle` and `stratify` to answer this question.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
print("train")
for c in [0,1,2]:
    print(c, len(y_train[y_train==c])/len(y_train))
 
print("\ntest")
for c in [0,1,2]:
    print(c, len(y_test[y_test==c])/len(y_test))

6. Now, train a new model using (`X_train`, `y_train`) and compute the accuracy with (`X_test`,
`y_test`). How does this value compare to the previously computed one? Is this a more reasonable
value? Why? 

    This should give you a good idea as to why training and testing on the same dataset returns meaningless results. Try also to use the `classification_report` function, which returns various metrics (including the accuracy) for each of the classes of the problem.


In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns

clf = DecisionTreeClassifier()


#######

print(classification_report(y_test, y_pred))
g = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, square=True)


## Model selection

7. So far, you have only used decision trees using the default configuration. The “default” decision tree might not be the best option in terms of performance to fit our dataset. Let's try now to perform a “grid search”: you will define a set of possible configurations and, for each configuration, build a classifier. The idea is to test the performance of each classifier and identify that configuration that produces the best model.
Based on the documentation and on our theoretical knowledge, we can find a list of all parameters you to modify.

In [None]:
from sklearn.model_selection import ParameterGrid

params = {
    "max_depth": [None, 2, 3, 4, 5, 6],
    "min_impurity_decrease": [0, .01, .03, .07, .09, .11]
}

accuracies = []
grid = ParameterGrid(params)
    
###################



print(f"The best model is made with {grid[np.argmax(accuracies)]}")
print(f"getting accuracy of {max(accuracies)}")

g = sns.heatmap(pd.DataFrame(grid).assign(acc=accuracies).pivot(columns='max_depth', index='min_impurity_decrease', values='acc'), square=True)

8. With the previous exercise, we have defined what the **best hyperparameters** configuration should be. We also have an estimate of how good the best performing configuration is (the one we obtained on the test set). The problem, is that this estimate is, in a way, biased: **we have selected the configuration that performs best on our test set**. Much like there is a risk of overfitting on the training data because we learn it "too well", there is also a risk of overfitting on the data we use for the validation, if we keep tweaking and adjusting our model to optimize the performance on that dataset.

    For this reason, we typically use **three** separate datasets: the *training*, the *validation* and the *test* set. We use the first one for training the model, the second one to select which model to use (of the many we can create) and the last one to get a final estimate of how good our classifier is.

    This is why, in cases where data is limited, we typically use the so-called **k-fold cross validation**. In short, we use the training set for both training and validation, by rotating the data we use for either operation.

    We can now split our original dataset (X, y) into a test set (`X_test`, which we will set aside and use only for the final evaluation) and a set (`X_train_valid`) that we will use for both training and validation (through k-fold cross validation).

    At each iteration a different fold is used for the validation, while the rest of X_train_valid will be used for the training. This means that, for each classifier, we will get k accuracies, which we then need to aggregate to obtain a single, **overall accuracy**.

In [None]:
from sklearn.model_selection import KFold

X_train_valid, X_test,\
y_train_valid, y_test = train_test_split()


In [None]:
best_config = list(ParameterGrid(params))[np.argmax(accuracies)]
clf = DecisionTreeClassifier(**best_config)
clf.fit(X_train_valid, y_train_valid)
accuracy_score(y_test, clf.predict(X_test))

## 2. More challenging tasks (MNIST)

We are now facing a problem for which our DecisionTree is not enough, we will then implement something stronger your own version of a random forest, using the trees available from `scikit-learn`. You will then train the random forest using the MNIST dataset and assess its performance
compared to decision trees.

1. Load the MNIST dataset into memory. We have a total of  70,000 digits images. You have to split it into a training set (60,000 digits) and a test set (10,000 digits).

In [None]:
from sklearn.datasets import fetch_openml

dataset = fetch_openml("mnist_784")
X = dataset["data"].values
y = dataset["target"].values

In [None]:
fig, ax = plt.subplots(1,3, figsize=(14, 4))
for i, rnd in enumerate(np.random.randint(0, len(X), 3)):
    ax[i].imshow(X[rnd].reshape(28, 28))
    ax[i].set_title(f"Label: {y[rnd]}")

In [None]:
X_train, X_test, y_train, y_test = 

2. Train a single decision tree (with the default parameters) on the training set, then compute its
accuracy on the test set.

In [None]:
clf = 

Let's try to implement a random forest. A random forest is an ensemble approach: it trains multiple trees on different portions of the dataset. This **lowers**
the chance of overfitting on the dataset (the single tree might overfit its portion of data, but the overall “forest” will likely not).

Each tree, bases each split decision using a subset of all features. The size of this subset, is often selected to be the square root of the total number of features available, but different. This parameter can be defined for each decision tree through the **`max_features`** parameter. When building a tree, a random sample of `max_features` features will be extracted and used to select the split. Another important parameter for random forests is the number of trees used (here `n_estimators`).

During the prediction of a new list of points, each tree of the random forest will make its prediction. Then, through majority voting, the overall label assignment is made. _Majority voting is just a fancy way of saying that the class selected by the highest number of trees is selected_

![](https://www.memecreator.org/static/images/memes/5153568.jpg)

In [None]:
from scipy.stats import mode # performs the majority vote strategy

class MyRandomForestClassifier:
    def __init__(self, n_estimators=10, max_features='sqrt'):
        self.trees = [ DecisionTreeClassifier(max_features=max_features)\
                       for _ in range(n_estimators) ]
    
    def fit(self, X, y):
        for tree in self.trees:
            # Here we are taking a subset of X, but with ALL the features
            subset = np.random.choice(range(X.shape[0]),\
                                      size=X.shape[0],\
                                      replace=True)
            tree.fit(X[subset], y[subset])
    
    def predict(self, X):
        # get predictions of all trees for X 
        predictions = [ tree.predict(X).astype(int) for tree in self.trees ]
        result = mode(predictions, axis=0, keepdims=True)[0][0]
        return [str(v) for v in result] # label is a string

3. Now train your random forest with the 60,000 points of the training set and compute its accuracy against the test set. How does the random forest behave? How does it compare to a decision tree? How does this performance vary as the number of estimators grow? Try values from 10 to 100 (with steps of 10) for ``n_estimators``.

In [None]:
clf = MyRandomForestClassifier(??)

## How many instruments do we have?

Concerning classification, we can test many algorithms which may have better depending on the task. On the [Official documentation](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) of `sklearn` you can find a list of the most used, which will be useful for the next part of the HandsOn.