In [None]:
from IPython.core.display import Image, display

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  

In [None]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## Degree in Data Science and Engineering, group 96
## Machine Learning 2
### Fall 2023

&nbsp;
&nbsp;
&nbsp;
# Lab 1. Neighbors & Trees

&nbsp;
&nbsp;
&nbsp;

**Emilio Parrado Hernández**

Dept. of Signal Processing and Communications

&nbsp;
&nbsp;
&nbsp;




<img src='http://www.tsc.uc3m.es/~emipar/BBVA/INTRO/img/logo_uc3m_foot.jpg' width=400 />

In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
data.target[[10, 50, 85]]

In [None]:
data.data.std(0)

# A real world problem: Breast Cancer

## UCI repository of ML datasets
[Breast Cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) is one of the classic benchmark problems in the [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php). In general, the usual way to compare the performances of different machine learning algorithms consists in to evaluate their performances in solving bechmark problems. There is another approximation, the field of *Statistical Learning Theory* (SLT), that tries to analize these performances through the use of **bounds** on the generalization capabilities of these algorithms. Being pragmatical, although SLT seems a quite more robust approach, real experience teaches that the estimation of performances based on *benchmarks* predict more accurately what practitioners experience when they put this algorithms to work with real data.

The UCI repository is a key reference for the design and development of general purpose machine learning algorithms, as it enables to esily gather intuitions about the performance of such algorithms in different situations.  




# 1. Loading data
 
1. Read data (from file, database)
2. Separate observations from *targets*
3. Divide data into two sets: **training** and **test**



## 1.1. Read data

This dataset can be directly loaded from sklearn. First read the description of the task and features here https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

In [None]:
from sklearn.datasets import load_breast_cancer
# read data from sklearn and create a pandas dataframe
data_dic = load_breast_cancer()
data = data_dic['data'] # the observations
print("The data set is formed by {0:d} observations in {1:d} dimensions".format(data.shape[0], data.shape[1]))
targets = data_dic['target'] # the targets
columns = [cc for cc in data_dic['feature_names']] +['target']
data_df = pd.DataFrame(np.vstack((data.T, targets)).T, columns=columns) 
data_df.head() # print 5 rows of the dataframe

In [None]:
data_df.describe() # print simple statistics of all the rows in the dataframe

## 1.2. Separate observations from targets

In this case this is already done by sklearn. The method [`load_breast_cancer`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) returns a python dictionary where key `data` contains the observations and key `target` their corresponding targets.

## 1.3. Divide data into training and test sets

The key indicator of the good performance of a machine learning model is its **generalization capability**. This means that the model outputs correct inferences about data not used during the training phase. A common way of addressing this point is to split the data into two disjoint partitions:
- the **training set**, observations used by the **training algorithm** to optimize the model (remember, fix values to the free parameters of the model)

- the **test set** is a separate set that is processed with an  **already trained model**. We use the test set to assess the **generalization capabilities** of the model. A model presents a good generalization when its performance in the test set is similar to its performance in the training set.




Anyway, don't forget that:
- **the test set is just another set**. It means that when we eventually put the model **in production** we still need to monitor its performance with the different test sets that we will be getting (every day, hour, week, etc)
- Sometimes we try to refine a model  **using the performance in the test sets**. This too usual practice introduces biases in the estimation of the performances of the model as we are  **feeding back information from the test**. Somehow this means that the test is taking part in the training, so the test set can't be considered 100% independent from the training process.




In some datasets the training/test split is already defined. In cases where the data comes in a single set (like this Breat Cancer problem) the data scientist needs to propose a division. It is common to leave a larger proportion of data for training purposes than for test. Usual sizes for these sets are
- 50% training, 50% test
- 70% training, 30% test
- 80% training, 20% test

The trade-off you have to take into account is the following:
- A larger training data means you optimizer will have more information to find suitable values for the parameters. You need a significantly larger number of data compared to the number of free parameters of the model in order to experience a robust learning.
- A larger test set means that your estimation of the generalization capabilities of the model with new, unseen data will be more reliable. 

Sklearn has a built-in method to carry out the splitting of a data set in training and test sets: [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Explore it to split your data in a 70% for training, 30% for testing partition. Fix the value of parameter `random_state` to ease reproducibility (I usually use 42).

In [None]:
#############
#           #
# YOUR CODE #
#           #
#############


# 2. First models

The first step is to start exploring solutions by learning simple models. We use the 3 families that were reviewed in the lecture: kNN, Decision Trees and Random Forest.

Before starting with the learnings with must decide a metric to evaluate the performance of the algorithms. We will use the **classifiction accuracy**  (discussed also in the lecture) as it is the default score in sklearn.


## 2.1 First model with $k$NN

Explore different combinations of the **number of neighbors** $k$ **with and without weighting** the vote of each neighbor. 
- Use plots of accuracy versus $k$ to present the results
- Decide the configuration of the best model and print the accuracy achieved by this best model in the training and in the test set.

In [None]:
#############
#           #
# YOUR CODE #
#           #
#############

from sklearn.neighbors import KNeighborsClassifier

#range for the number of neighbors to be explored
v_nn = [1,2,3,4,5,6,7,8,9,10,20,30,40,50,60,70,80,90,100,200,350]

# store the accuracy predicting the training set with uniform weighting of votes
acc_entr = np.empty(len(v_nn))

# store the accuracy predicting the test set with uniform weighting of votes
acc_test = np.empty(len(v_nn))

# store the accuracy predicting the training set with votes weighted by inverse distance
acc_entr_w = np.empty(len(v_nn))

# store the accuracy predicting the test set with votes weighted by inverse distance
acc_test_w = np.empty(len(v_nn))


#main loop
for inn, n_neighbors in enumerate(v_nn):
    #instantiate model with uniform voting
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    #train model
    knn.fit(x_train, y_train)
    #compute scores
    acc_entr[inn] = knn.score(x_train, y_train)
    acc_test[inn] = knn.score(x_test, y_test)
    #instantiate model with weightd voting
    knn_w = KNeighborsClassifier(n_neighbors=n_neighbors, weights='distance')
    #train model
    knn_w.fit(x_train, y_train)
    #compute scores
    acc_entr_w[inn] = knn_w.score(x_train, y_train)
    acc_test_w[inn] = knn_w.score(x_test, y_test)
    


In [None]:
#####################
# CODE FOR PLOTTING #
#####################
plt.figure()
plt.plot(v_nn, acc_entr, label='Acc. train')
plt.plot(v_nn, acc_test, label='Acc test')
plt.plot(v_nn, acc_entr_w, label='Acc. train, weighted')
plt.plot(v_nn, acc_test_w, label='Acc. test, weighted')
_ = plt.xlabel('k')
_ = plt.ylabel('Acc.')
_ = plt.legend()

plt.grid()
best_k = v_nn[np.argmax(acc_test)]
plt.show()
print("Best k, unit vote for the test set is {0:d}, Acc of  {1:.2f}".format(best_k, np.max(acc_test)))
best_k_w = v_nn[np.argmax(acc_test_w)]
plt.show()
print("Best k, weighted vote for the test set is {0:d}, Acc of  {1:.2f}".format(best_k_w, np.max(acc_test_w)))


## 2.2 First model with Decision Trees

Explore different combinations of the **maximum number of leaf nodes** $k$ 
- Use plots of accuracy versus the maximum number of leaves to present the results
- Decide the configuration of the best model and print the accuracy achieved by this best model in the training and in the test set.

In [None]:
#############
#           #
# YOUR CODE #
#           #
#############


## 2.3. First model with Random Forests

Explore different combinations of the **number of trees in the forest** and of the **maximum number of leaves in each tree**. 
- For the maximum number of leaves, use just 3 values: 
  - 3
  - the number you selected as better choice for the decision tree
  - A reasonable number between 3 and the second choice (look at the accuracy vs number of leaves plot to get insights for this choice).

- Use plots of accuracy versus the number of trees to present the results
- Hint: explore the number of trees using logarithmic jumps between 1 and 1000
- Decide the configuration of the best model and print the accuracyachieved by this best model in the training and in the test set.

In [None]:
#############
#           #
# YOUR CODE #
#           #
#############


# 3. Simple transformations of features

Normalization of the features is one of the most used pre-processing techniques in machine learning. If we consider each feature a **random variable**, the normalization transforms it into a random variable with **zero mean** and **unit variance**.
$$
x_i \longrightarrow \frac{x_i - \mathbb E\{x\}}{\mbox{std dev}\{x\}}
$$
There is a sklearn module that implements normalization for us: [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

`StandardScaler` basic methods `fit` and `transform` will do the work for us:



In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #instantiate
scaler.fit(x_train)  # fit with training data
x_train_s = scaler.transform(x_train)
x_test_s = scaler.transform(x_test)
print("Training set")
print("Means before -> Means after")
for ii in range(x_train.shape[1]):
  print("{0:.3f}  ->  {1:.3f}".format(x_train[:,ii].mean(0),x_train_s[:,ii].mean(0)))
print("")
print("Std dev before -> Std dev after")
for ii in range(x_train.shape[1]):
  print("{0:.3f}  ->  {1:.3f}".format(x_train[:,ii].std(0),x_train_s[:,ii].std(0)))
print("")

print("Test set")
print("Means before -> Means after")
for ii in range(x_test.shape[1]):
  print("{0:.3f}  ->  {1:.3f}".format(x_test[:,ii].mean(0),x_test_s[:,ii].mean(0)))
print("")
print("Std dev before -> Std dev after")
for ii in range(x_train.shape[1]):
  print("{0:.3f}  ->  {1:.3f}".format(x_test[:,ii].std(0),x_test_s[:,ii].std(0)))
print("")



# REMEMBER TO FIT YOUR SCALER ONLY WITH TRAINING DATA
# DO NOT USE TEST DATA TO FIT THE SCALER

## 3.1 $k$NN with normalized data
Repeat the study in section 2.1 using the scaled data

In [None]:
#############
#           #
# YOUR CODE #
#           #
#############


## 3.2 Decision Trees with normalized data
Repeat the study in section 2.2 using the scaled data

In [None]:
#############
#           #
# YOUR CODE #
#           #
#############


## 3.3 Random Forests with normalized data
Repeat the study in section 2.3 using the scaled data

In [None]:
#############
#           #
# YOUR CODE #
#           #
#############


# 4. Pipelines

Scikit learn provides with a easy and clean way to automatize the scaling before the use of a machine learning method, the [**pipelines**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). Read the documentation about the use of pipelines and understand the following example.

In [None]:
from sklearn.pipeline import Pipeline

# definition of the pipeline with the list of methods to be connected
# each member of the list is a tuple: ('name of the stage', StageConstructorMethod(arg1, arg2, ...))
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('kNN', KNeighborsClassifier(n_neighbors=2, weights='distance'))])
"""
Fitting the pipeline performs a sequential invocation of the fit methods of 
all the connected stages. The output of the previous stage serves as input for
the next one.
"""
pipe.fit(x_train, y_train)

# evaluation of the scaler + regressor
train_risk = pipe.score(x_train, y_train)
test_risk = pipe.score(x_test, y_test)
print("Acc in the training set after scaling: {0:.2f}".format(train_risk))
print("Acc in the test set after scaling: {0:.2f}".format(test_risk))

# 5. MinMax Scaler
Another alternative to the normalization is to scale each feature so that its range of values lays between 0 and 1. Scikit Learn module [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) does this job for us.

Repeat sections 3.1, 3.2 and 3.3 but using a MinMax Scaler instead of a StandardScaler and connecting scaler and regressor with a pipeline.

In [None]:
#############
#           #
# YOUR CODE #
#           #
#############



# 6. Wrapping up
Discussion, general results
- What was the best model to solve the Breast Cancer problem?
- How significant are the differences in performance?
- Which is the impact of scaling the features in the three methods?
- Is there any significant difference in performance in the two scalers?

About $kNN$
- Discuss the impact of scaling.
- How is the behavior of $k$NN as $k$ increases?

About Decision Trees
- Impact of scaling.
- Grow and draw a tree with just 4 or 5 leaf nodes and discuss if the features used for these first splits make sense

About Random Forest
- Impact of scaling the features
- Discuss how varies the performance with the number of leaf nodes per tree and the size of the forest.