# Supervised Learning - Part 1

Assume there is an unknown function $F^*$ that maps $X_n$'s to $Y$'s such that $Y=F^*(X)$. 

Let's call $F^*$ our *Target* function.

We generally can't hope to ever observe $F^*$ directly, but we have examples (or samples) of its inputs and its output.

![](./images/some_data_annotated.png)

The goal of a **Supervised Learning** algorithm is to find a good *Model* $\hat{F}$ of the unknown *Target* $F^*$ based on these examples.


## How do algorithms search for models?


Algorithms **learn** from data by computing an *estimate* $\hat{F}$ from the data examples. This can be thought of as *fitting* a function to the data examples. Mathematically, this amounts to solving an optimization problem of the following form:

$\hat{F(X)} = \underset{\theta}{\operatorname{argmin}} L(Y,f(X;\theta))$


Where $L(Y,f(X;\theta))$ is a function that measures the error made by some function $f(X;\theta)$ in approximating $Y$. In other words, algorithms will try to find optimal values for a set of parameters $\theta$ such that the error in using $f$ evaluated at example inputs $X$ to approximate example outputs $Y$ is minimized. This process is called **Training** a model and the examples used to train the model are called the **Training Set**. 

There are many options of error measure $L$. The choice of $L$ depends on the type of problem (or type of Task) and the choice of algorithm. $L$ is called the **Loss** function, and it's also sometimes called the **Energy** or **Cost** function.

Different algorithms will focus on different classes of function $f$ with a different number of parameters $\theta$ to approximate $Y$.

Here is a visual representation of these ideas.

## Imagine a space inhabited by functions...

![](./images/model_space.gif)

$\hat{F_1 (x)}$ and $\hat{F_2 (x)}$ are a models of $F^*(x)$. Both are close to the real thing, but are still not exactly it. There is an idea of **error**.

A good estimate of $F^*(x)$, given a dataset, is one that **minimizes** this error


Different algorithms will be most effective at finding models within different parts of this space. Some algorithms are more restrictive and are ONLY able to search in very limited parts of the space.

![](./images/function_space.png)

**DISCLAIMER**: This diagram is merely illustrative and does not reflect the real overlaps and boundaries of different classes of functions.

## How do we measure error?

An error measure is a quantity that denotes how close two functions are to one another. A straightfoward way of measuring this distance between functions, is to look at how different their Y coordinates are for the same X:

![](./images/Error1.png)

Now we need to turn these point-wise distances into a single measure that representes the distance between two whole functions. One idea would be to simply take the **average** of the point-wise distances

![](./images/Error2.png)

Notice however that sometimes the point-wise distance is positive ($F*$ is *above* than $\hat{F}$), and sometimes it is negative ($F*$ is *below* than $\hat{F}$). This can be very misleading, as by summing negative numbers and positive numbers together, one might get a measure of the total distance that is quite small, even exactly **zero**, when the two functions are clearly not equal to each other for all X's.

To solve this problem, we would like a meaure that avoids summing positive and negative numbers together. We would like something that make the point-wise distances always be positive, not matter what function is above or below the other for a given X. Here, we will use the average of the *squared* point-wise distances, or the **mean squared error**:

![](./images/Error3.png)

There are many other valid error measures that are useful in different scenarios in Machine Learning. We will see some of them later in this workshop, but for now, let's direct our attention to the last missing piece of the puzzle...

## How do we minimize the error measure?


The error measure is itself a function $L(Y,f(X;\theta))$ called the **Loss Function**. It takes in a function $f(X)$ and its parameters $\theta$ and outputs the error we incurr in when using this $f(X;\theta)$ to approximate $Y$. Remember that here $X$ and $Y$ are the examples/observed values in your dataset.

When we say we want to minimize the error, what we are really saying is we want to find what values of the parameters $\theta$, when plugged into the function $f$ evaluated at our data $X$, get us closest to our data $Y$.

In other words, we want to find *critical points* of the function $L$ with respect to $\theta$.

Sometimes, depending on the choice of function $f(X;\theta)$, it is possible to solve this problem analytically by taking the derivative of $L(Y,f(X;\theta))$ with respect to $\theta$ and setting it to zero. However, for many choices of $f(X;\theta)$, calculating the derivative by hand is impractical, so we use **numerical methods** instead.

In general, these numerical methods consist in choosing an initial value for the parameters $\theta$, and then slightly changing them in a direction that hopefully descreases the error. Then rinse and repeat a large number of times until the method converges to a critical point, or we get close enough to one such that we are satisfied with the results.

Visually, here is what we get when we plot different values of $\theta$ against the error in using a linear function with just one parameter $f(X)=\theta X$ to approximate $Y=2 X$. If we start with $\theta=0$, the error goes down as we move $\theta$ closer to the true value, hits zero when $\theta=2$ and starts increasing when we move the value away from the true value:

![](./images/convex.png)

## Also, about that first assumption we made...

We started this section by assuming there was an unknown function $F^*$ that maps $X_n$'s to $Y$'s such that $Y=F^*(X)$, remember that?

But in real life, things are messy. Whether you are taking measures in a physical experiment, observing animal behaviour, or whatever your scientific data collection edeavour may be, you will rarely get data where this assumption holds exactly.

In other words, for Machine Learning to be useful in real life, we have to loosen that assumption a little bit. We will instead assume the following:

There is an unknown fuction $F^*$ that **approximately** maps $X_n$'s to $Y$'s such that $Y=F^*(X) + \epsilon$. Where $\epsilon$ is a random variable that follows some unknown probability distribution.

All this means is our Y's will fluctuate around some function $F^*(X)$ instead of following it exactly. 

That is, instead of seeing data like this:

![](./images/fofx_exact.png)

You will (much) more often than not see data like this:

![](./images/fofx_noisy.png)


Now let's see how these ideas work in practice.

First, let's pick an arbitrary function of a single variabe $x$ to use as our target function, say, $F^*(x)=\cos(\frac{3}{2}\pi x)$

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)

f = lambda x: np.cos(1.5 * np.pi * x)

Next, let's generate 50 data samples consisting of pairs of values: a random value for $x$ between 0 and 1, and $F^*(x)$ plus random noise. 

In [None]:
samples = [(x,f(x) + np.random.normal(0,1) * 0.1) for x in np.sort(np.random.uniform(0,1,50))]

Let's plot both the target function and the samples we've generated to see what this looks like.

In [None]:
X,Y = zip(*samples) 

X_test = np.linspace(0,1,100).reshape(-1,1)

plt.scatter(X,Y,label='Samples')
plt.plot(X_test,f(X_test),color='r',label='Target')

plt.legend(loc='upper right')

Now imagine we didn't know what the target function is. Let's try out different machine learning algorithms and see if we can figure it out using only our samples (blue dots).

First, let's try out an algorithm called Linear Regression. Using this algorithm means we will restrict our search for models to the space of functions of the form $Y = ax + b$.

In [None]:
# LINEAR REGRESSION

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

X = np.array(X).reshape(-1,1) # Scikit-learn needs this to work with only one input/feature

linear_model.fit(X,Y)

In [None]:
Y_hat = linear_model.predict(X_test)

In [None]:
plt.scatter(X,Y,label='Samples')
plt.ylim((-2,2))
plt.plot(X_test, Y_hat, color='g',label="Linear Fit")
plt.legend(loc="upper right")
print("In-sample Performance:", linear_model.score(X,Y))

The green line looks nothing like the red line from before, as expected since we know the target function is not a straight line. Since our data displays a "curvy" pattern, we should try an algorithm that searches in part of the function space that includes curvy functions.

Let's try Linear Regression again, but this time we'll widen our search space to include functions of the form $Y=ax^2 + bx + c$.

In [None]:
# LINEAR REGRESSION WITH POLYNOMIAL BASIS EXPANSIONS

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2) # Set the Degree of the polynomial to 2

square_model = LinearRegression()

X_poly = poly.fit_transform(X)  #Transform X into [X , X^2]

square_model.fit(X_poly,Y)

X_poly_test = poly.fit_transform(X_test)

plt.scatter(X,Y,label='Samples')
plt.ylim((-2,2))
plt.plot(X_test, square_model.predict(X_poly_test), color='g',label='Square Fit')
plt.legend(loc="upper right")
print("In-sample Performance:", square_model.score(X_poly,Y))

This looks better!

However, the shape of the curve is still not quite right. Let's try an algorithm that widens the search space even more to include pretty much any kind of curvy function imaginable: Neural Networks.

In [None]:
# NOW LET'S TRY A NEURAL NETWORK

from sklearn.neural_network import MLPRegressor

mlp_model = MLPRegressor(hidden_layer_sizes=(100,50,25,10), max_iter=2000) #Neural Networks are flexible, Try different numbers of layers and neurons....

mlp_model.fit(X,Y)

Y_hat = mlp_model.predict(X_test)

plt.scatter(X,Y,label='Target')
plt.ylim((-2,2))
plt.plot(X_test, Y_hat, color='g',label="MLP")
plt.legend(loc="upper right")
print("In-sample Performance:", mlp_model.score(X,Y))

## How do you know if your Model is any good?

In the first example above, the model was obviously bad... the model is not even trying to (because it can't) catch the "curvy" pattern in the data. This phenomenon is called **Underfitting**.  

In the second example, the model looks close!

In the third example you have to try different parameters, but you can get very, very close to the target!

What we did above was, borrowing from statistical jargon, training models by minimizing the error on **in-sample** data... but what we are actually interested in is how our model performs on **out-of-sample** data! We care about predicting $Y$ in the population!

So are these good models of $F^*$? 

The answer depends on what you are trying to predict. If you are concerned with predicting only points inside the $[0,1]$ interval, then yes, these models seem approximate well points that were not included in the original sample:

In [None]:
plt.plot(X_test, f(X_test), color='r',label="Target")
plt.scatter(X,Y,label='Samples')
plt.ylim((-2,2))
plt.plot(X_test, square_model.predict(X_poly_test), color='g',label='Square Fit')
plt.legend(loc="upper right")
print("Out-of-sample Performance:", square_model.score(X_poly_test,f(X_test)))

In [None]:
plt.plot(X_test, f(X_test), color='r',label="Target")
plt.scatter(X,Y,label='Target')
plt.ylim((-2,2))
plt.plot(X_test, Y_hat, color='g',label="MLP")
plt.legend(loc="upper right")
print("Out-of-sample Performance:", mlp_model.score(X_test,f(X_test)))

But here's what our models looks like when we extend the range of the plot a bit to show points farther away from the sample we used to train:

In [None]:
new_X = np.linspace(-1,2,100).reshape(-1,1)

new_Y = f(new_X)

new_X_poly = poly.fit_transform(new_X)

In [None]:
plt.scatter(new_X,new_Y,label="Target")

plt.plot(new_X, linear_model.predict(new_X), color='r',label="Linear Fit")
plt.plot(new_X, square_model.predict(new_X_poly), color='g',label="Square Fit")
plt.plot(new_X, mlp_model.predict(new_X), color='y',label="MLP Fit")

plt.legend(loc="upper right")

In general, the in-sample error, or **Training Error**, is not a very good measure of how well the model will do on out-of-sample data. In fact, models can have an arbitrarily small **Training Error** and be very bad at predicting out-of-sample values. To see this, consider what would happen if you were to try the example above with a very wiggly function:

In [None]:
#LET'S TRY FITTING A DEGREE 100 POLYNOMIAL TO OUR SAMPLES

poly_too_high = PolynomialFeatures(degree=100)

too_high_model = LinearRegression()

X_poly = poly_too_high.fit_transform(X)  #TRANSFORM X INTO X^n

too_high_model.fit(X_poly,Y)

X_poly_test = poly_too_high.fit_transform(X_test)

plt.scatter(X,Y,label="Samples")
plt.ylim((-2,2))
plt.plot(X_test, too_high_model.predict(X_poly_test), color='g',label="Degree 100 Fit")
plt.legend(loc="upper right")
print("In-sample Performance:", too_high_model.score(X_poly,Y))

In [None]:
plt.plot(X_test, f(X_test), color='r',label="Target")
plt.scatter(X,Y,label="Samples")
plt.ylim((-2,2))
plt.plot(X_test, too_high_model.predict(X_poly_test), color='g',label="Degree 100 Fit")
plt.legend(loc="upper right")
print("Out-of-sample Performance:", too_high_model.score(X_poly_test,f(X_test)))

In [None]:
#AND THEN ZOOMING OUT

new_X_too_high = poly_too_high.fit_transform(new_X)

plt.scatter(new_X,new_Y,label="Target")
plt.plot(new_X, too_high_model.predict(new_X_too_high), color='r',label="Linear Fit")
plt.ylim((-5,5))

As you can see, the ***training*** error here is low - the model actually passes right through most of the points in the sample! But it is also obvious that this model is not very accurate at predicting points outside the sample, whether that's close to the range of the sample or not. 

So in practice you will need to have not one, but TWO datasets. One to train your model on and another to validate it. This second dataset is called a **Test Set** and it is a collection of inputs and outputs you obtained from the same source as your training data, but that you **did not use to train your model.**

It is the error in this dataset, called the **Test Error** that matters when we talk about the quality of our predictions, and hence the quality of our model.

Concretely, you will train your model on the **Training Set**, looking to minimize the **Training Error**. Then you will pick up the model you trained and plug in the inputs from your **Test Set**. You will then use the outputs you get from your model and compare them to the outputs in your **Test Set**. This will allow you to compute your **Test Error**.

Rinse and repeat until you're statisfied with the performance of your model on the **Test Set**!

A common rule-of-thumb is to break your initial data set in two chunks: about 80% of all examples go into your **Training Set**, the remaining 20% are set aside for your **Test Set**.

Another popular approach is to break your initial data set in THREE chunks: a **Training Set**, a **Validation Set** and a **Test Set**, where the **Validation Set** is not used directly to train the model, but the error in this set is used as the yardstick to *fine-tune* a model, so it participates indirectly in the training.


Let's look into some examples with real datasets to see how this works:

## Real Example 1: Flower Species Classification

In the toy examples we've seen above, we used a mathematical function to generate points and used different algorithms to try and approximate it. In that case, both our input X and output Y were quantitative (a real number representing a quantity). In Machine Learning this type of task, i.e. predicting a quantitative output, is called **Regression**.

Now let's look at an example where the output is no longer quantitative, but *categorical*, meaning that the output variable represents categories of things - a task called **Classification**.

We will use a popular Python Machine Learning library called **scikit-learn** to show how it works in practice.

### The Iris dataset

This example uses a classic dataset called the "Iris" dataset. It contains a number of measurements of different species of Iris flowers along with a label indicating which of 3 species of Iris the measurements came from.

We will train a model to take in measurements as inputs and predict the species.

Let's take a look at the data:

In [None]:
from pandas import read_csv

headers = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']

iris_dataset = read_csv('./data/iris.csv', names = headers)

In [None]:
print(iris_dataset) # A GOOD FIRST STEP IS TO LOOK AT THE ACTUAL DATA

In [None]:
print(iris_dataset.describe()) # THEN COMPUTE SUMMARY STATISTICS

In [None]:
print(iris_dataset.groupby('species').size()) # ARE THERE IMBALANCES IN THE OUTPUT CATEGORIES?

In [None]:
from pandas.plotting import scatter_matrix

scatter_matrix(iris_dataset) # PLOT ALL VARIABLES 2 BY 2... ARE THERE ANY VISIBLE PATTERNS?

In [None]:
# LET'S CREATE A TRAINING SET AND A TEST SET

from sklearn.model_selection import train_test_split

X = iris_dataset.values[:,0:4]
Y = iris_dataset.values[:,4]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

In [None]:
# NOW LET'S TRY DIFFERENT ALGORITHMS - FIRST A LINEAR MODEL: LOGISTIC REGRESSION

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

lr_model = LogisticRegression(solver='liblinear', multi_class='ovr')

lr_model

In [None]:
lr_model.fit(X_train, Y_train)

In [None]:
Y_hat = lr_model.predict(X_test)

Y_hat

In [None]:
print("This model got", accuracy_score(Y_test, Y_hat)*100, "% of predictions right.")

In [None]:
cm = confusion_matrix(Y_hat, Y_test)
ConfusionMatrixDisplay(cm).plot()

In [None]:
# ANOTHER MODEL: TREE CLASSIFIERS

from sklearn.tree import DecisionTreeClassifier

tree_model = DecisionTreeClassifier()

tree_model.fit(X_train, Y_train)

Y_hat = tree_model.predict(X_test)

cm = confusion_matrix(Y_hat, Y_test)
ConfusionMatrixDisplay(cm).plot()

print("This model got", accuracy_score(Y_test, Y_hat)*100, "% of predictions right.")


In [None]:
# ONE MORE: SUPPORT VECTOR MACHINES (SVM)

from sklearn.svm import SVC

svm_model = SVC(kernel = 'linear')

svm_model.fit(X_train, Y_train)

Y_hat = svm_model.predict(X_test)

cm = confusion_matrix(Y_hat, Y_test)
ConfusionMatrixDisplay(cm).plot()

print("This model got", accuracy_score(Y_test, Y_hat)*100, "% of predictions right.")

For more options of Algorithms, see: https://scikit-learn.org/stable/supervised_learning.html

## Exercise 1: Wine Classification

Now it's your turn. This next dataset contains a number of measurements done in a chemical analysis of 3 different types of wine.

You will train a Support Vector Machine model to predict the type of wine based on the measurements. Use the example above as inspiration.

In [None]:
from pandas import read_csv
from sklearn.svm import SVC

headers = ['wine_type','alcohol', 'malic_acid','ash','alcalinity_of_ash','magnesium',
           'total_phenols','flavanoids','nonflavanoid_phenols','proanthocyanins','color_intensity','hue','OD280_OD315','proline']

wine_dataset = read_csv('./data/wine.csv', names = headers)

# PRINT SUMARY STATISTICS

# PRINT THE DATASET

# LOOK FOR IMBALANCES - THE OUPUT VARIABLE IS 'wine_type'

# LOOK FOR VISUAL PATTERNS

# CREATE TRAINING AND TEST SETS - THE OUTPUT VARIBALE 'wine_type' IS ON COLUMN 0

# FIT A MODEL TO THE TRAINING SET

# CHECK ITS PERFORMANCE ON THE TEST SET


## Real Example 2: Image Classification

In the Flower Species Classification example we had a dataset with numerical features/inputs and a categorical response/output. The categories in our example appeared as text, but the code transformed them into numbers under the hood. What other types of data can be encoded as numbers?

The answer is: pretty much anything can. 

Let's look at a Classification problem using images. We will train a model to classify images as either Apples or Oranges:

In [None]:
import numpy as np
import os
from PIL import Image 
import matplotlib.pyplot as plt

an_apple = Image.open('./Fruit-Images-Dataset/Training/apples/0_100.jpg')

print("This is how the computer 'sees' the image: \n")

print(np.array(an_apple)[:,:,0], "\n")

In [None]:
plt.imshow(np.array(an_apple)[:,:,:]) # TRY PRINTING THE THREE DIFFERENT CHANNELS - [:,:,0], [:,:,1], [:,:,2]
print("This is how we see it:")

In [None]:
an_orange = Image.open('./Fruit-Images-Dataset/Training/oranges/0_100.jpg')

plt.imshow(an_orange)

In [None]:
# LET'S CREATE OUR TRAINING SET!

apples_dir = './Fruit-Images-Dataset/Training/apples/'
oranges_dir = './Fruit-Images-Dataset/Training/oranges/'

#os.listdir(apples_dir)

In [None]:
# LOAD APPLES STEP BY STEP - WITH ILLUSTRATIVE VARIABLE NAMES

apples = []

for filename in os.listdir(apples_dir):
    
    apple_full_path = apples_dir + filename
    
    apple = Image.open(apple_full_path)
    
    apple_flattened = np.array(apple).flatten()
    
    apple_resized = apple_flattened/255
    
    a_training_example = (apple_resized, "apple")
    
    apples.append(a_training_example)

apples

In [None]:
# LOAD ORANGES WITH A PYTHON ONE-LINER... CAN YOU SEE HOW THIS DOES THE SAME THING AS THE FOR-LOOP BLOCK ABOVE?

oranges = [ (np.array(Image.open(oranges_dir + img)).flatten()/255,"orange") for img in os.listdir(oranges_dir) ]

In [None]:
# ZIP SEPARATES ARRAYS FROM LABELS

X_train,Y_train = zip(*(apples + oranges))

In [None]:
# HERE'S HOW ZIP WORKS:

my_list = [(np.array([1,2,3]), "one"),(np.array([4,5,6]), "two")]

a,b = zip(*my_list)

a

In [None]:
# STACK "STACKS" VECTORS INTO A MATRIX

X_train = np.stack(X_train)

In [None]:
# HERE'S HOW STACK WORKS

print(a)

np.stack(a)

In [None]:
# NOW THE TEST SET

apples_dir = './Fruit-Images-Dataset/Test/apples/'
oranges_dir = './Fruit-Images-Dataset/Test/oranges/'

apples = [ (np.array(Image.open(apples_dir + img)).flatten()/255,"apple") for img in os.listdir(apples_dir) ]

oranges = [ (np.array(Image.open(oranges_dir + img)).flatten()/255,"orange") for img in os.listdir(oranges_dir) ]

X_test,Y_test = zip(*(apples + oranges))

X_test = np.stack(X_test)


In [None]:
# THEN TRAIN A LOGISTIC REGRESSION MODEL AND SEE HOW IT DOES

lr_model = LogisticRegression(solver='liblinear', multi_class='ovr')

lr_model.fit(X_train, Y_train)

Y_hat = lr_model.predict(X_test)

cm = confusion_matrix(Y_hat, Y_test)
ConfusionMatrixDisplay(cm).plot()

print("This model got", accuracy_score(Y_test, Y_hat)*100, "% of predictions right.")


## Exercise 2: Handwritten Digit Classification

Your turn again. This next dataset is another classic: the MNIST dataset of handwritten digits.

In this exercise, you will train a model of your choice on images of handwritten numbers that sometimes look alike when people write them: 0, 6 and 8.

Use all the examples we've seen so far as insipration.

In [None]:
# START BY CREATING YOUR TRAINING SET...

zero_dir = './MNIST-Dataset/Training/0/'
six_dir = './MNIST-Dataset/Training/6/'
eight_dir = './MNIST-Dataset/Training/8/'


X_train,Y_train = ####

In [None]:
# ...AND YOUR TEST SET

zero_dir = './MNIST-Dataset/Test/0/'
six_dir = './MNIST-Dataset/Test/6/'
eight_dir = './MNIST-Dataset/Test/8/'

X_test, Y_test = ####

In [None]:
# TRAIN YOUR MODEL AND EVALUATE ITS PERFORMANCE ON THE TEST SET

my_model = ####

Y_hat = my_model(####)

plot_confusion_matrix(my_model,X_test, Y_test)

print("This model got", accuracy_score(Y_test, Y_hat)*100, "% of predictions right.")

## Summarizing what we've seen so far

Based on what we've covered so far, a Machine Learning program will generally have the following elements:

1. A class of functions $f(x;\theta)$ that the algorithm will use to try and approximate the taget $F^*$. 

    a. $f(x;\theta)$ can be a restrictive class of functions such as linear functions ($\beta_0 + \sum{\beta_i X_i}$) or a very flexible one, such as Neural Networks (we will see what they look like on the next notebook)
    
2. A Training Set: a dataset containing examples of inputs and outputs of interest.  

3. A Test Set: another dataset with examples of inputs and outputs that are not used to train the model.

4. A measure of the error in using $f(x;\theta)$ to approximate $F^*$, called the Loss function $L(Y,f(X;\theta))$. $L$ is used to train the model on the training set (see point 5) and can be used to measure performance on the Test Set. 

5. A way of solving the optimization problem: $\hat{F(X)} = \underset{\theta}{\operatorname{argmin}} L(Y,f(X;\theta))$

In the **scikit-learn** examples above, you will notice that 4, 5 and certain aspects of 1 are done mostly under the hood, leaving very little control up to you.

Next we turn to a more modern Machine Learning library that is better suited for high performance and solving difficult problems: **PyTorch**.
