As always, we start with loading necessary packages and defining some helper functions. Please evaluate the cells below.

In [None]:
import numpy as np
import cv2
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# Set figsize so that images are large enough
plt.rcParams['figure.figsize'] = [20, 10]

In [None]:
# Function to show an RGB image
def imshow_rgb(img_bgr):
    img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
    plt.imshow(img_rgb)
    
# Function to make colorbars appear nicer
def colorbar(mappable):
    ax = mappable.axes
    fig = ax.figure
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="5%", pad=0.05)
    return fig.colorbar(mappable, cax=cax)

def plotHistogram(hist):
    plt.bar(list(range(len(hist))),hist[:])                  # Plot the histogram as a bar plot
    plt.title('histogram')
    plt.xlabel("histogram bin"), plt.ylabel("pdf")


## Basic concepts of machine learning and object recognition

Why is it difficult for a computer to recognize images given to it? The human visual system can function so effortlessly that it is easy to misjudge the difficulty of replicating its functions using a computer. There's a legend about the famous computer scientist Marvin Minsky who, in 1966 at MIT, asked his undergraduate student Gerald Jay Sussman to "spend the summer linking a camera to a computer and getting the computer to describe what it saw". Needless to say, it took several decades of work before computers could approach or match human-level performance on some of these tasks (most recently due to advances in the field of deep learning since about 2010).

The essential problem is that, although computers are excellent at storing and reproducing images, they treat raw images as simply a jumble of numbers. To illustrate, in the following piece of code, we display an image of a shoe.

In [None]:
img_bgr = cv2.imread('Data_Tutorial6/object_images/9_r45.png')

imshow_rgb(img_bgr)

By displaying the data as an image, we simply see a shoe, and we get a false impression of how difficult the problem is. Instead, we should realise that the machine encodes images as a sequence bits (0's and 1's). Below we show the first 1000 bits (of a total of 663552 bits) used to encode the above image.


In [None]:
#print(len(np.unpackbits(img_bgr))) # Uncomment this line if you'd like to confirm the number of bits in the image.

np.unpackbits(img_bgr)[:1000]

If we see the problem from this perspective, it becomes clear how far we have to go. The machine does not even have an 
inbuilt concept of space or color, let alone high-level ideas like "shoe-ness".

Fortunately, the field of machine learning comes to the rescue (at least partially). In this module, we will discuss
some of the key concepts from this field that will allow us to:
* Summarise the content of an image using more "meaningful" numbers. These are known as *features*.
* Make a decision about what class of object an image represents by comparing these features against each other. This method is called the Nearest-Neighbour classifier, and is the first of many classifiers we will look at. It is especially nice to start with, since it is simple to understand, but can also give good results, as long as we choose good features!
* Testing a model to see how well it performs.

# Features

The idea behind "features" is actually quite simple. We saw that the numbers that represent a raw image are not particularly meaningful for decision-making. So, why don't we *summarise* the contents of the image using more informative numbers? The process of summarisation is called *feature extraction*. 

To illustrate this idea, let's start with a toy problem. Suppose we want to decide whether or not an apple is fully ripe. Also assume that the particular variety of apple we are considering should be as red as possible, and as large as possible, for it to be considered ripe. We would like a system like the following:

![direct](figures/direct.svg) 

But, we just saw that it is really difficult to use the raw image directly to make decisions. So, why don't we add another stage, a feature extraction stage? This feature extraction stage takes the raw image, and outputs value summarising its contents. In this toy problem, there are two obvious choices. We can extract the area occupied by the apple in an image, as well as the average degree of redness in the apple's image pixels.

![indirect](figures/indirect.svg)  

Suddenly, the process seems a lot easier. We need two things. One is a piece of code that extracts the features. The other is some decider that looks at the area and degree of redness, and outputs a decision.

What's more, we actually just did something very interesting. We broke the machine learning problem into choosing features, and choosing a decision process. In principle, we can select features independently of the decision process, and the decision process independently of the features. It shouldn't surprise you that over the course of decades, researchers have suggested a great variety of feature extraction systems, and a great variety of decision processes (classifiers) that can, to a large extent, be mixed and matched. Your challenge is to learn about features that may be useful to your task, and pair those features with a suitable classifier. Think about this as a toolbox of options, and we will be adding to your toolbox with each exercise that we do in the coming modules. 

In reality, not all feature / classifier combinations work well, so you do have to have some familiarity with your tools in order to use them effectively. Fortunately, with experience you will develop a feel for which features / classifier combinations might work on your task, and we'll provide some guidance to get you started. It then becomes a case of testing your ideas, and seeing what works best. That last sentence is important. Because we have all these options, we need to be able to compare their performance to choose the best option. One of our tasks will be to learn how to perform such comparisons fairly, especially with respect to predicting how well a particular approach will perform in real life. 

# Toy Apples   

Let's illustrate the previous concepts more concretely by actually implementing those ideas. In reality, quality grading on pictures of apples is actually much more complicated. Consider just the following issues:
* The lighting conditions under which pictures are taken can vary greatly if taken in the field. Think for example of taking a picture at dawn vs. noon. Not only do the direction of light and brightness change greatly, but the dawn light's *color* is also a bit different. So, the "degree of redness" is a bit of a difficult concept if not thought through more carefully.
* Quality grading is actually usually done by humans, and humans are inherently subjective in their judgements. People often neglect to take into account that people make mistakes, or at least can differ substantially in their opinion (especially in boundary cases).
* The positioning of the camera affects how big the apple looks in the image (and therefore the image area).
* Finding apples in amongst the leaves is maybe a bit easier for red apples, but imagine you need to do this for Granny Smith apples, a much more difficult problem!

But, fortunately we are going to drop these details in favour of giving you a taste of how to break down the machine learning problem. We will then gradually build up your ability to deal with conditions like those mentioned above.

## Dataset

Let's write a function that will make simplified fake images that we can work on. The block below defines a function that creates a fake image of an apple. Conveniently, these apples are perfect circles in the image, and the background is absolute black. We also pretend that the camera is equally close to the core of each apple, so that the area of the apple in the image is not affected by camera position.

The function `make_fake_image` starts by choosing a random number between 0 and 1, and calls this the "ripeness" ($\rho$ in the equations below). The closer to 1 this number is, the redder and larger the apple will be. The function then chooses a random center for the apple (close to the image center). A radius and colour are chosen for the apple using the relationships:

$$ Red = \rho $$
$$ Green = 1 - \rho $$
$$ Blue = 0 $$
$$ Radius = 20\rho + 50 $$

We also add some noise to the redness, greeness and radius (blueness remains 0). The exact details of the noise aren't important for now, just keep in mind that the above relationships aren't exact due to this noise.

The function then draws the apple given the above noisy characteristics. It returns that image, along with the *EXACT* ripeness level. This ripeness level represents the true state of the apple, and is therefore called the "ground truth" in machine learning. 

In our case, the ground truth is known because we are generating images based on these values. In real life, you might collect images of a lot of apples, and also have a team of human quality graders assign a ripeness score to each apple. Those ripeness scores would then be called the ground truth.

In the next code cell, we define the function for generating fake apple images, and also display 5 randomly generated images.

In [None]:
def make_fake_image():
    ripeness = np.random.uniform()
    image = np.zeros((256,256,3), dtype=np.float32) # This makes a black image with resolution 256x256. Type type is float32, so brightnesses range from 0.0 to 1.0
    cx = int(np.random.uniform() * 30 - 15 + 127)
    cy = int(np.random.uniform() * 30 - 15 + 127)
    radius = int(np.random.uniform() * 10 + ripeness * 20 + 50)
    
    redness = np.clip(ripeness + np.random.uniform() * 0.2, 0, 1)
    greenness = 1 - np.clip(ripeness + np.random.uniform() * 0.2, 0, 1)
    color = (0, greenness, redness)
    
    #cv2.circle(img, center, radius, color[, thickness[, lineType[, shift]]]) 
    cv2.circle(image, (cx, cy), radius, color, -1)

    return image, ripeness
    
plt.rcParams["figure.figsize"] = (20,20)
    
for n in range(5):
    plt.subplot(1,5,n+1)
    fake_image, fake_ripeness = make_fake_image()
    imshow_rgb(fake_image)
    plt.title("Fake Apple with Ripeness %f" % fake_ripeness)

## Feature Extraction

Now that we have a dataset, we can start writing the feature extraction stage. We need to extract two pieces of information. One is the area of the apple within the image (in pixels), and the degree of redness of the apple. The key to measuring these two quantities is to differentiate the foreground (the apple) from the background (in this case, perfect blackness). To do this, we have to extract a binary mask of the apple. In the toy example, this is much easier than a real case, but there are ways of extracting a foreground mask in more complicated cases also.

We will run through the analysis step by step to help you understand the entire process. Afterwards we will package the entire feature extraction process in a convenient function.

Firstly, let's generate one fake image to work with.

In [None]:
plt.rcParams["figure.figsize"] = (5,5)

fake_image, fake_ripeness = make_fake_image()
imshow_rgb(fake_image)

Our first job is to separate the background (black pixels) from foreground (apple). The easiest way to do this is to look at the brightness of a pixel. We first convert the color image to a grayscale image.

In [None]:
plt.rcParams["figure.figsize"] = (5,5)

grayscale = cv2.cvtColor(fake_image, cv2.COLOR_BGR2GRAY) # Convert the color image to grayscale
plt.imshow(grayscale, cmap="gray", vmax = 1.0)

You can see that the apple is now a gray blob. The problem is that this is not a mask (grayscale images in general aren't). We can print out which values are in the grayscale using the following command.

In [None]:
print(set(grayscale.flatten()))

So, in the grayscale image we have 0.0's and some other value between 0.0 and 1.0. Instead we would like the values to be False for background and True for foreground (apple). To convert this to a mask, we can threshold on the gray value. Anything greater than 0.0 should be foreground, so we can write.

In [None]:
mask = grayscale > 0.0
plt.imshow(mask, cmap="gray", vmax = 1.0)

We can check the values in mask using

In [None]:
print(set(mask.flatten()))

Finally, we are in a position to get the area and the average color. If we use True as a number, it is interpreted as 1.0, and if we use False it is interpreted as 0.0. So, to get the number of foreground pixels, we only need to take the sum over all the pixels.

In [None]:
area = np.sum(mask)
print(area)

We can find the degree of redness by averaging over the red channel values where the apple is located. We do this by taking the red channel, then selecting the apple pixels using `[mask]`, and then averaging over the red values in the mask (this is actually a constant anyway in our toy example). We need to select the apple pixels, because averaging over the background as well would make the apple redness appear less than it actually is (since the background is entirely black).

In [None]:
red_channel = fake_image[:,:,2]
redness = np.mean(red_channel[mask])
print(redness)

Now, let's package the above procedure in a function of its own, and demonstrate the results on some test images.

In [None]:
def toy_extract_features(image):
    grayscale = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # Convert the color image to grayscale
    grayscale = image.mean(axis=2) # This takes the average over the color channels, effectively
    apple_mask = grayscale != 0  # This creates a mask of all parts of the image that are non-zero (this works in the toy problem, because the background is perfectly black)
                
    area = np.sum(apple_mask)      # This sums the amount of pixels that are "on" in the mask, which is equivalent to the number of apple pixels (and thus the apple's area in pixels)

    red_channel = image[:, :, 2]   # This takes the red channel of the image (remember, in OpenCV, the channels are numbered 0 for blue, 1 for green and 2 for red)
    redness = np.mean(red_channel[apple_mask]) # This takes the mean of the red component over all pixels in the mask, ignoring the background pixels
    
    return area, redness

plt.rcParams["figure.figsize"] = (20,20)
for n in range(5):
    plt.subplot(1,5,n+1)
    fake_image, fake_ripeness = make_fake_image()
    fake_area, fake_redness = toy_extract_features(fake_image)
    imshow_rgb(fake_image)
    plt.title("Area=%.0f Redness=%.3f" % (fake_area, fake_redness))

So, we now have a feature extraction stage defined in the function `toy_extract_features`. Let's have a look at the features themselves. Below we generate a 100 images, and then perform feature extraction on each image. We then plot a point at the location `(area, redness)` on a two-dimensional plane. We also color the point according to the ground truth ripeness.

In [None]:
np.random.seed(2) # This line makes the random numbers predictable, so that we always get the same dataset back

plt.rcParams["figure.figsize"] = (8,8)
all_areas = []
all_rednesses = []
all_ripenesses = []

for n in range(300):
    fake_image, fake_ripeness = make_fake_image()
    fake_area, fake_redness = toy_extract_features(fake_image)
    all_areas.append(fake_area)
    all_rednesses.append(fake_redness)
    all_ripenesses.append(fake_ripeness)

all_areas = np.array(all_areas)
all_rednesses = np.array(all_rednesses)
all_ripenesses = np.array(all_ripenesses)
    
plt.scatter(all_areas, all_rednesses, c=all_ripenesses)
plt.xlabel("Measured area")
plt.ylabel("Measured redness")
cbar = plt.colorbar()
cbar.set_label("True ripeness")

What we've done above is a visualization of the "feature space". There are two features, so our feature space is two-dimensional. Each image produces one pair of features (area / redness), so each images corresponds to a single point in the feature space.

Feature spaces can have any number of dimensions (and usually there are a lot!), based simply on how many features we decide to calculate.  

Now let's suppose we want to predict whether or not an apple is "sufficiently ripe". Each apple has a "ripeness score" between 0 and 1, and we decide that an apple is sufficiently ripe if it has a ripeness score greater than 0.7. 

Now, notice something in the above scatterplot. There is a clear trend in the colors of each point. You can see that the ripeness increases towards the top right of the scatter plot. This is good, because we are trying to make a prediction about the ripeness! When we see a new image, we can calculate what the area and redness of the apple is, and use its position in the feature space to decide something about the apple's ripeness.

Let's actually go through this process. We get a new "unseen" image. We then extract the features from this image and plot it in the feature space using a big red cross, along with all the points we previously plotted. Rerun this cell a few times to see what happens with different input images. 

In [None]:
unseen_image, unseen_ripeness = make_fake_image()
unseen_area, unseen_redness = toy_extract_features(unseen_image)
plt.subplot(1,2,1)
imshow_rgb(unseen_image)
plt.title("Unseen Input Image")

plt.subplot(1,2,2)
plt.scatter(all_areas, all_rednesses, c=all_ripenesses)
plt.xlabel("Measured area")
plt.ylabel("Measured redness")
cbar = plt.colorbar()
cbar.set_label("True ripeness")
plt.scatter([unseen_area], [unseen_redness],marker="x",c=[[1.0,0.0,0.0]], s=[100],linewidth=5)


__Exercise:__ Write down critera for the features on which you would decide wetter a apple is ripe enough to pick. That is, what region in the feature space contains the ripe apples?

As you can see, even though we are dealing with a previously unseen image, after feature extraction the point in feature space is more or less where we would expect. If the apple is small and green, we see the cross on the bottom left. If it is big and red, we see it at the top right. This demonstrates that we can use the features to make predictions about ripeness!

Now, recall we said that a ripeness of 0.7 will be classified as "sufficiently ripe". Let's plot the points that have known ripenesses greater than 0.7

In [None]:
sufficient = np.array(all_ripenesses)>0.7
plt.scatter(all_areas[np.logical_not(sufficient)], all_rednesses[np.logical_not(sufficient)])
plt.scatter(all_areas[sufficient], all_rednesses[sufficient])
plt.xlabel("Measured area")
plt.ylabel("Measured redness")

plt.legend(["Unripe", "Ripe"])

This scatter plot now contrast ripe apples with unripe apples, rather than displaying degree to which the apple is ripe. Assigning an image into one of two (or more) classes is known as a classification problem. In this case, there are two classes: ripe and unripe. In the next section we will discuss how to build a simple classifier that, nevertheless, can already give good performance in certain classes of problem.

## Classifiers

So, suppose we have an unseen image, it seems fairly easy to decide whether it contains a ripe or unripe apple. If the unseen image has features in amongst the "ripe" points, the unseen image is probably of a ripe apple. If the unseen image has features in amonst the "unripe" points, the unseem image is probably of an unripe apple.

Now, what do we mean by "in amongst the ripe/unripe" points? Remember, we have to be mathematically precise. 

Also, the answer is not necessarily that clearcut. Note that the boundary between the unripe and ripe points in the scatter plot is not well defined, and you see some ripe points amongst otherwise unripe points, and vice versa.

This is the point where we begin to talk about __classifiers__. Classifiers operate on features and decide whether or not the features belong to one of a group of possible classes. For example, we want a classifier that decides whether an unseen image's features fall under the class "ripe" or the class "unripe".

There are a huge number of potential classifiers available you could use, and we'll introduce more and more as the course progresses.

### Nearest Neighbour Classifiers

We start with probably the simplest classifier, known as the Nearest Neighbour classifier. Before becoming a bit more technical, let's discuss how a Nearest Neighbour classifier would work in the toy example. First we would collect a set of apple images known as the training set. We need to provide a known label (ground truth) for each apple (whether or not its ripe). Now we imagine plotting each training apple in the feature space as demonstrated in the previous scatter plots the plot each training apple in the feature space, as repeated below

In [None]:
sufficient = np.array(all_ripenesses)>0.7
plt.scatter(all_areas[np.logical_not(sufficient)], all_rednesses[np.logical_not(sufficient)])
plt.scatter(all_areas[sufficient], all_rednesses[sufficient])
plt.xlabel("Measured area")
plt.ylabel("Measured redness")

plt.legend(["Unripe", "Ripe"])

The nearest neighbour system is now ready for use in the field. When we get an unseen image, we would extract features from it. We would then plot these features in the feature space along with the training apple points. We then find the training example closest to the unseen image's point, and assign the class of that training example to the new unseen image. This is basically a formal version of our earlier intuition. New apples amongst the ripe ones in the feature space are probably ripe, and vice versa.

Let's review. The idea behind Nearest Neighbour is that, when we get a new image, we calculate its features, and then check those features against a set of features in a database. We calculate the distance between the new image's features and each individual entry in the database, and choose the database item with the smallest distance. We then say it is probable that the input image belongs to the same class as the database point, and output the database point's class as our decision.

Just for clarity, if $A_n$ and $A_d$ are the areas of the new image and the database item respectively, and $R_n$ and $R_d$ are the new image and database item's rednesses respectively, we calculate the distance $D$ using the theorem fo Pythagoras. In the literature this is called the Euclidean distance:

$$ D = \sqrt{(A_n-A_d)^2+(R_n-R_d)^2}. $$

The nearest neighbour to $A_n$ is then the database item for which $D$ is the lowest.

We could also perform what's called a K nearest neighbour search, where we find, for example, the five (if $K=5$) nearest neighbours and classify the new image by majority vote from these 5 neighbours.

In this module, we will use the [Scikit Learn machine learning library](https://scikit-learn.org) in which several classifiers are implemented, (amongst other kinds of machine learning models). 

Let's build a nearest neighbour model.

In [None]:
import sklearn
import sklearn.neighbors
classifier = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
print(classifier)

The variable `classifier` now contains a nearest neighbour classifier.  We set n_neighbors equal to 1, because we are searching just for the closest nearest neighbor.

The classifier still doesn't have anything in its database of points, so we need to provide it some. This database of points is called the training set, because these points are used to train the classifier. Later when we see unseen points, those will be part of the test set that we use to see how good the classifier is performing on previously unseen data.

Scikit learn expects you to provide the training database in a particular order. You need to provide the features in the form of an N-by-M matrix, where N is the number of entries in the dataset, and M is the number of features in each of those dataset points. The following code packages our features in such an array.

In [None]:
X = np.array([all_areas, all_rednesses]).T
X.shape

We also need to provide the class labels for each of the points in X using a vector of length $N$.

In [None]:
y = sufficient
y.shape

Finally, we are ready to "train" the classifier. In Scikit Learn, this is performed using the "fit" method on your classifier.

In [None]:
classifier.fit(X,y)

After training, it's really easy to get a prediction for a given image area and redness, we just execute

In [None]:
pred_label = classifier.predict([[unseen_area,unseen_redness]])
print(pred_label)

The predicted label is False if the point is seen as unripe, and True if it is seen as ripe. In the following block of code, we generate an unseen image. We then plot a red X if the classifier predicts a ripe apple, and a green X if the classifier thinks it is unripe. Run the block a few times and see how the color changes depending on where in the feature space the unseen image's features lie.

In [None]:
unseen_image, unseen_ripeness = make_fake_image()
unseen_area, unseen_redness = toy_extract_features(unseen_image)
plt.subplot(1,2,1)
imshow_rgb(unseen_image)
plt.title("Unseen Input Image")

plt.subplot(1,2,2)
plt.scatter(all_areas, all_rednesses, c=y)
plt.xlabel("Measured area")
plt.ylabel("Measured redness")

pred_label = classifier.predict([[unseen_area,unseen_redness]])
if pred_label:
    pred_color = [1.0, 0.0, 0.0]
else:
    pred_color = [0.0, 1.0, 0.0]
    
plt.scatter([unseen_area], [unseen_redness],marker="x",c=pred_color, s=[100],linewidth=5)

Let's look at this in more detail. Suppose we had direct control over the input area and redness, let's see which point is the nearest neighbour used to make the classifier decision. Run the below cell. Two sliders should appear allowing you to adjust the unseen area and redness independently. This will move around the X representing the unseen image. In addition, a blue line is drawn between the X and the item in the dataset that is considered the closest match.

In [None]:
classifier.fit(X,y)
def interact_nn(unseen_area=(np.min(all_areas), np.max(all_areas)), unseen_redness=(0.0, 1.0)):
    plt.scatter(all_areas, all_rednesses, c=y)
    plt.xlabel("Measured area")
    plt.ylabel("Measured redness")

    pred_label = classifier.predict([[unseen_area,unseen_redness]])
    if pred_label:
        pred_color = [1.0, 0.0, 0.0]
    else:
        pred_color = [0.0, 1.0, 0.0]
    
    plt.scatter([unseen_area], [unseen_redness],marker="x",c=pred_color, s=[100],linewidth=5)
    
    closest_distance, closest_dataset_item = classifier.kneighbors([[unseen_area,unseen_redness]])

    closest_area = X[closest_dataset_item,0]
    closest_redness = X[closest_dataset_item,1]
    
    plt.plot([unseen_area, closest_area], [unseen_redness, closest_redness], 'b-')
    
interact(interact_nn)

__Question (match direction):__
There is something strange in the match being made, can you see the match is almost always a nearly vertical line? Why is this the case? How do we fix this?

Solution:

Look at the x-axis and y-axis. Note that the scales are very different. The x-axis scale is in the range 8000 to 16000 (more or less. The y-axis scale is from 0 to 1. Think a bit about what that means for a straight line plotted on this figure. Distances travelled parallel to the x-axis are much greater than for the y-axis by several orders of magnitude. So, we get vertical matches, because vertical lines "cost less" in terms of distance travelled. 

The way to fix this issue is to rescale and center the dataset so that both axes have more or less the same scale, and that the data is centered around more or less the same value. We can use sklearn's RobustScaler to perform centering and scaling. Note that the scatter plot below is now roughly in the range -1 and 1 on both axes, and the center point is more or less at (0, 0).

In [None]:
scaler = sklearn.preprocessing.RobustScaler()
X = np.array([all_areas, all_rednesses]).T
scaler.fit(X)

X2 = scaler.transform(X)

plt.scatter(X2[:,0],X2[:,1])

Now experiment again with the nearest neighbour system. Can you see how it is now behaving differently?

In [None]:
classifier.fit(X2,y)

def interact_nn2(unseen_area=(np.min(X2[:,0]), np.max(X2[:,0])), unseen_redness=(np.min(X2[:,1]), np.max(X2[:,1]))):
    plt.scatter(X2[:,0], X2[:,1], c=y)
    plt.xlabel("Normalized area")
    plt.ylabel("Normalized redness")

    pred_label = classifier.predict([[unseen_area,unseen_redness]])
    if pred_label:
        pred_color = [1.0, 0.0, 0.0]
    else:
        pred_color = [0.0, 1.0, 0.0]
    
    plt.scatter([unseen_area], [unseen_redness],marker="x",c=pred_color, s=[100],linewidth=5)
    
    closest_distance, closest_dataset_item = classifier.kneighbors([[unseen_area,unseen_redness]])

    closest_area = X2[closest_dataset_item,0]
    closest_redness = X2[closest_dataset_item,1]
    
    plt.plot([unseen_area, closest_area], [unseen_redness, closest_redness], 'b-')

interact(interact_nn2)

## Decision Boundary

In your experiments moving the cross around, you'll have noticed that there are points where the predicted class switches around. The points in the feature space where the decision changes from one class to another is called the decision boundary. 

The following code looks at every location on the scatter plot, and classifies it into the unripe / unripe class, plotting the location as blue or red depending on the outcome. Take a careful look at the boundary between red and blue regions. That line is called the decision boundary.

In [None]:
classifier.fit(X2,y)
grid_X, grid_Y = np.meshgrid(np.arange(-1.5, 1.5, 0.01), np.arange(-1.5, 1.5, 0.01))

pred_label = classifier.predict(np.array([grid_X.flatten(),grid_Y.flatten()]).T)
pred_label = pred_label.reshape(grid_X.shape)

plt.pcolormesh(grid_X, grid_Y, pred_label, cmap="bwr")

plt.scatter(X2[:,0], X2[:,1], c=y)
plt.xlabel("Normalized area")
plt.ylabel("Normalized redness")
plt.show()


Notice how, in the region where the ripe / unripe points meet, the red and blue boundary snakes around the ripe/unripe dataset items, depend on which class of item is closest? 

The capacity of the model to capture complex decisions is determined by how complex a decision boundary it can learn. Does that mean we should aim to use a model that can learn incredibly complex decision boundaries? Unfortunately, this is not the case. The problem is a decision boundary might seem great when you plot your original dataset on the same plot, but might look terrible if we instead plot a set of previously unseen data! We will return to this point another time, when we have the chance to compare different classifiers.

Have a look at [Scikit learn's comparison of different classifiers.](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py) You can see that different classifiers can give radically different decision boundaries, even though they are operating on the same dataset.


## Testing Models

We can now:
* Generate data for training the model, as well as testing it.
* Extract simple features from a given image.
* Train a nearest neighbour model based on a database of features with ground truth. 
* We know that we usually need to rescale the features so that the axes have roughly the same inherent scale.

But, we haven't really tested the model properly, we just sent in single examples to get an idea of how the model is working. This is not enough to get an idea of how well the system would perform on previously unseen data. This is essential, because it is really easy to get a overly optimistic impression of a model's performance by just looking at how it operates on data it has already seen.

In this section, we will look at the most basic kind of test procedure. Let's generate a dataset of 300 images apple images and their associated features and class labels.

In [None]:
np.random.seed(3)
def generate_dataset(num_items):
    images = []
    all_areas = []
    all_rednesses = []
    all_ripenesses = []
    all_labels = []
    for n in range(num_items):
        fake_image, fake_ripeness = make_fake_image()
        images.append(fake_image)
        fake_area, fake_redness = toy_extract_features(fake_image)
        all_areas.append(fake_area)
        all_rednesses.append(fake_redness)
        all_ripenesses.append(fake_ripeness)
        all_labels.append(fake_ripeness > 0.7)
    
    
    
    return np.array(images), np.array([all_areas, all_rednesses]).T, np.array(all_labels)

dataset_images, dataset_X, dataset_y = generate_dataset(300)

`dataset_images` now contains 300 images. You can look at the shape of the numpy array below. 

In [None]:
dataset_images.shape

You can request an image by using the array indexing `[n]`. In the following, we ask for the dataset image at index 40 (so, the 41st image).

In [None]:
plt.imshow(dataset_images[40])

`dataset_X` contains the features corresponding to each image. So it has the shape `(300,2)` (300 images with 2 features from each image).

In [None]:
dataset_X.shape

`dataset_y` contains the class label for each image. `y[n]` is true if the corresponding apple is ripe, and false if not.

So to see if the apple above is ripe we execute:

In [None]:
dataset_y[40]

Now to perform our experiments, we have to separate the dataset into a training set and a test set. The training set will be used to create the model, and the test set will be used to check how well the model performs on data it has not seen yet. Fortunately, sklearn provides us a handy function that will split our dataset into training and test sets for us. In the following code block, we use `sklearn.model_selection.train_test_split` to split `dataset_images`, `dataset_X` and `dataset_y` into training data (`train_images`, `train_X` and `train_y`) and test data (`test_images`, `test_X` and `test_y`). 

__Question:__ Write down why you think is important to split your data into a training and a test set.


We decide to use 200 images for training and 100 for testing by passing the `test_size=100` argument to `train_test_split`. Typically you want to balance the training set and test set that there are enough items in the training set so that a good model can be created, and enough test data to demonstrate that the model works well on unseen data. This is always a difficult balance, but usually for large datasets the majority of dataset items are used for training. A rule of thumb you can use is to select 80%-90% of the data for training and 10%-20% for test cases. In this experiment, because we can generate as much fake data as we'd like, we can afford to use a lot of data for testing.

Note we also use `RobustScaler` again to rescale the image features. As an extra detail, note that we `fit` the scaler using only the training features, and use it to rescale (`transform`) both the training and test features. This is a basic illustration of what we are going to do build our model. We `fit` using the training data, and we will then `predict` on the test data.

In [None]:
import sklearn.model_selection

train_images, test_images, train_X, test_X, train_y, test_y = sklearn.model_selection.train_test_split(dataset_images, dataset_X, dataset_y, test_size=100)

scaler = sklearn.preprocessing.RobustScaler()
scaler.fit(train_X)

train_X = scaler.transform(train_X)
test_X = scaler.transform(test_X)


Before we continue, let's make scatter plots of the training and test data separately.

In [None]:
plt.subplot(1,2,1)
plt.scatter(train_X[:,0], train_X[:,1], c=train_y)
plt.title("Training Data")
plt.xlabel("Area")
plt.ylabel("Redness")
plt.subplot(1,2,2)
plt.scatter(test_X[:,0], test_X[:,1], c=test_y)
plt.title("Test Data")
plt.xlabel("Area")
plt.ylabel("Redness")


As you can see, the scatter plots look similar, but the points are different. Let's now train a nearest neighbour classifier on just the training data.

In [None]:
import sklearn.linear_model

classifier = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1)
#classifier = sklearn.neighbors.KNeighborsClassifier(n_neighbors=5)
#classifier = sklearn.linear_model.LogisticRegression()

classifier.fit(train_X, train_y)

Now, let's have a look at how the model is doing. The following code displays two scatter plots, one for the training and one for the test set. We color the points by their true class (not their predicted class). 

The decision boundary is revealed as in our previous discussion. Points in regions colored blue are classified as "unripe" and points in regions color red are classified as "ripe". So, if a point lies inside a region that mismatches its color, it has been incorrectly predicted (remember, the point is colored by the true class, and the background is colored by the predicted class). In those cases, a yellow 'x' is plotted to indicate that the prediction was incorrect. 

First have a look at the results below.

In [None]:
plt.rcParams["figure.figsize"] = (20,10)
train_predict = classifier.predict(train_X)
test_predict = classifier.predict(test_X)

def plot_decision_boundary(classifier, x0_min, x0_max, x1_min, x1_max, step=0.01):
    grid_X, grid_Y = np.meshgrid(np.arange(x0_min, x0_max, step), np.arange(x1_min, x1_max, step))

    pred_label = classifier.predict(np.array([grid_X.flatten(),grid_Y.flatten()]).T) 
    pred_label = pred_label.reshape(grid_X.shape)

    plt.pcolormesh(grid_X, grid_Y, pred_label, cmap="bwr", vmax=1.75, vmin=-0.75)

def scatter_predictions(X, y, prediction):
    base_size = plt.rcParams['lines.markersize'] ** 2
    
    class_colors = ["b", "r"]
    
    y_cols = [class_colors[n*1] for n in y]
    #pred_cols = [class_colors[n*1] for n in prediction]
    
    plt.scatter(X[:,0], X[:,1], c=y_cols, s=base_size*3, marker="o")
    #plt.scatter(X[:,0], X[:,1], c=pred_cols, s=base_size*3)
    wrong_cases = y != prediction
    plt.scatter(X[wrong_cases,0], X[wrong_cases,1], marker="x", s=base_size * 2, c="y")
    
plt.subplot(1,2,1)
plot_decision_boundary(classifier, -2, 2, -1.3, 1.3, 0.01)
scatter_predictions(train_X, train_y, train_predict)
plt.title("Training Predictions")
plt.xlabel("Area")
plt.ylabel("Redness")

plt.subplot(1,2,2)
plot_decision_boundary(classifier, -2, 2, -1.3, 1.3, 0.01)
scatter_predictions(test_X, test_y, test_predict)
plt.title("Test Predictions")
plt.xlabel("Area")
plt.ylabel("Redness")


The first thing you'll see is that (for nearest neighbours), there should be no incorrect predictions (technically, it is possible in the case where two training images have exactly the same features, but in this experiment that is exceedingly unlikely). 

__Question__: Why are there no mistakes in the training cases for nearest neighbour classification (looking at 1 neighbour)?

Secondly, it is highly likely that there are incorrect predictions in the test set scatter plot (in the unlikely event there aren't any mistakes, regenerate the dataset so that you can see what this looks like). Notice that they are colored differently from the background?

__Question__: Having seen mistakes being made on the test set, comment on the shape of the decision boundary and how it helps or hinders the task.

Now, go back to the frame which created the classifier and uncomment the case where 5 nearest neighbours are checked (comment the other classifiers). Now rerun the above. 

__Question__: How does the situation change now, in relation to the previous questions? Can you explain why?

Go back again to where the classifier was created, and uncomment the logistic regression classifier (comment the other classifiers). How does the decision boundary look? 

---

Before continuing, change back to a nearest neighbour classifier (look at 1 neighbour). 

By visualizing the decision boundary and labelling correct / incorrect predictions, we can get a great intuitive feeling for how the classifier is performing. But, in the end, intuition can get us only so far in evaluating a classifier in an objective way. We need to find a way of scoring the performance of a classifier, a bit like scoring a student's work in a homework assignment.

It turns out that there is not just one way to do this. There is a large variety of ways to use numbers to quantify classifier performance. Each of these scoring methods has its own advantages and disadvantages. These all form part of your machine learning toolbox. There are, however, some really common scoring procedures that you should become familiar with. Scikit-learn can help us here, since it implements a wide variety of such tests. 

The first scoring method we will look at is the socalled *confusion matrix*. For this problem, the confusion matrix looks as follows:

$$ \begin{array}{|C|c|c|}
\hline
& \mathrm{Predict\ Unripe} & \mathrm{Predict\ Ripe} \\ 
\hline
\hline
\mathrm{True\ Unripe} &\mathrm {Unripe\ Predicted\ Correctly} & \mathrm{Unripe\ Predicted\ Incorrectly}   \\
\hline
\mathrm{True\ Ripe} &\mathrm {Ripe\ Predicted\ Incorrectly} & \mathrm {Ripe\ Predicted\ Correctly}  \\
\hline
\end{array} $$

We calculate the confusion matrix by beginning with all the apples that we know are really unripe (hence the "True unripe" label of the row). We then look at the predictions by the model for these apples. We count the correct predictions, and place then under the column "predict unripe" (so, the top left cell). We count the incorrect predictions (unripe apples predicted as ripe), and place then under the column "predict ripe" (so, the top right cell).

For the "True Ripe" row, we undergo a similar procedure. But, you need to keep in mind that now correct predictions should be put under "Predict Ripe" (so, the bottom right cell), since we are looking at the truly ripe apples. The incorrectly predicted apples fall under "Predict unripe" (bottom left cell).

__Question:__ Which numbers in the confusion matrix do you like to be high and which numbers do you like to be low. So what kind of matrix is a confusion matrix for a perfect classifyer.

The confusion matrix allows us to get a quick idea of how well our classifier is doing. Correct predictions are counted on the diagonal of the matrix (here the matrix is 2-by-2, so the diagonal is the top left and bottom right cells). Incorrect predictions are in the off-diagonal cells. The nice thing about the confusion matrix is that it is easy to extend to problems where there are more than two class (perhaps "ripe", "unripe" and "damaged"?). Correct predictions are still on the diagonal in those cases.

We can ask Scikit learn to calculate the confusion matrix for us. Note we just need the ground truth (correct classes) in `test_y`, and the prediction given by the model `test_predict`.

In [None]:
cf_matrix = sklearn.metrics.confusion_matrix(test_y, test_predict)
cf_matrix

Conveniently, we can even plot the confusion matrix as if it were an image. This is especially useful if a problem has a great many classes (10+). If only the diagonal pixels in the image are bright, then we know that the classifier is doing something useful.

In [None]:
plt.imshow(cf_matrix, cmap="gray", vmin=0)

Another handy score is the *accuracy*, that is the portion of classifications that are correct. This is basically defined as:

$$ \mathrm{Accuracy} = \frac{\mathrm{Correct\ Predictions}}{\mathrm{Total\ Number\ of\ Test\ Cases}} $$

That gives us a fraction between 0 and 1. An accuracy of 0 means that none of the predictions were correct. An accuracy of 1 means all the predictions were correct. We can ask Scikit-learn to calculate the accuracy for us using

In [None]:
sklearn.metrics.accuracy_score(test_y, test_predict)

The accuracy can also be read off from the confusion matrix. In an imaginary experiment where the confusion matrix is:

$$ \begin{array}{|C|c|c|}
\hline
& \mathrm{Predict\ Unripe} & \mathrm{Predict\ Ripe} \\ 
\hline
\hline
\mathrm{True\ Unripe} &60 &5  \\
\hline
\mathrm{True\ Ripe} &3 & 30  \\
\hline
\end{array} \rightarrow \mathrm{Accuracy} = \frac{\mathrm{Correct\ Predictions}}{\mathrm{Total\ Number\ of\ Test\ Cases}} = \frac{60+30}{60+5+3+30} = \frac{90}{98} = 0.918 $$

You can also present the accuracy as a percentage by multiplying it by 100. For example, an accuracy of 0.9 means that 90% of test cases were correctly predicted.

__Question__: Typically, a lower accuracy should mean that our classifier is doing poorly. However, in our current problem, there are only two classes (ripe and unripe). This is a special case of classification known as "binary classification". For this special case, why is an accuracy of 0 actually great news?

Solution: For a binary classifier, an accuracy of 0.0 means that all predictions are incorrect. But, because there are only two classes, we can then just create a new classifier that always predicts the *opposite* of the 0.0 accuracy classifier. This new classifier would then have an accuracy of 1.0. Note that this does not work when there is more than two classes. Naturally, you would be rightly suspicious of such a result, so this would checking your work to see if everything is correct. 

__Question__: What is the worst case accuracy?

The worst case is obtained when our model performs as poorly as model that just gives random predictions (we say such a model performs only as well as chance). For a binary classifier, this is an accuracy of 0.5 (we guess correctly half of the time). For three classes, the worst case is $\frac{1}{3}=0.33\dot{3}$, and so forth. It is important to be mindful of the number of test cases for each class though, as discussed in the following paragraphs.

One final point. You should keep in mind how many test cases are available for each class. Why is this important? Let's say we have a confusion matrix that looks like this:

$$ \begin{array}{|C|c|c|}
\hline
& \mathrm{Predict\ Unripe} & \mathrm{Predict\ Ripe} \\ 
\hline
\hline
\mathrm{True\ Unripe} &99 & 0  \\
\hline
\mathrm{True\ Ripe} &1 & 0  \\
\hline
\end{array} \rightarrow \mathrm{Accuracy} = \frac{\mathrm{Correct\ Predictions}}{\mathrm{Total\ Number\ of\ Test\ Cases}} = \frac{99}{99+1} = \frac{99}{100} = 0.99 $$

99% of cases are predicted correctly. Sounds great right? The problem here is that we almost only have test cases for known unripe apples, there is only one test for a known ripe apple, and that prediction is incorrect! So, perhaps the true situation is that the classifier is predicting everything as unripe. If this is the case, then if we also have 99 test cases of known ripe apples, the confusion matrix and accuracy would be as given below

$$ \begin{array}{|C|c|c|}
\hline
& \mathrm{Predict\ Unripe} & \mathrm{Predict\ Ripe} \\ 
\hline
\hline
\mathrm{True\ Unripe} &99 & 0  \\
\hline
\mathrm{True\ Ripe} &99 & 0  \\
\hline
\end{array} \rightarrow \mathrm{Accuracy} = \frac{\mathrm{Correct\ Predictions}}{\mathrm{Total\ Number\ of\ Test\ Cases}} = \frac{99}{99+99} = \frac{99}{198} = 0.5 $$

That's exactly the worst possible result for a balanced amount of test cases (0.5), while our earlier accuracy for the imbalanced dataset was 0.99! 

So, it is very important to make sure your dataset is balanced (that is, there are enough test cases of each class, preferably more or less the same number). 

__Question__: Look at the confusion matrix that scikit learn calculated from our toy dataset. How well balanced is the dataset? Why is this?

Solution: The generated dataset doesn't have a great dataset balance, about 0.7 cases are for unripe apples. This is due to the way we generated the images. We selected a ripeness at random from 0.0 to 1.0 (with even probability). Because the threshold for a ripe apple is set at 0.7, we are generating an unripe apple 70% of the time. That doesn't mean we end up with exactly 70% unripe apples, but the more apples we generate, the closer to 70% we'll get in the end.