# Practical Week 6: Introduction to Machine Learning

In this practical you will familiarise yourself with some of the Python libraries that we will be using for machine learning in the coming weeks. You will learn how to use [`numpy`](https://numpy.org/) and [`pandas`](https://pandas.pydata.org/) for manipulating data, and [`seaborn`](https://seaborn.pydata.org/) for visualising data.


## Tabular Data

First, we will learn how to manipulate tabular data. Most machine learning systems represent data as multi-dimensional arrays. The `numpy` library provides efficient ways to access and compute with such multi-dimensional arrays. The `pandas` library augments this with human-readable indexes and convenience functions for displaying and filtering data.

Let us explore how multi-dimensional arrays work in these libraries.

We start by importing `numpy`:

In [None]:
import numpy as np

Let's create a one-dimensional array containing some numbers:

In [None]:
my_array = np.array([1,3,4,7,-1,2])
my_array

We can select elements from that array by specifying an index, or a set of indices, or a slice:

In [None]:
print(my_array[0])
print(my_array[2])

In [None]:
my_array[[0,2,3]]

In [None]:
my_array[:3]

In [None]:
my_array[3:]

We can also define functions that operate on all elements of the array:

In [None]:
def add_one(xs):
    return xs + 1

In [None]:
add_one(my_array)

Now, let us create a 2-dimensional array (a "matrix"):

In [None]:
matrix = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
matrix

We can select individual elements of the matrix via their index. We use the convention that the first index refers to the *row*, and the second index referst to the *column*.

In [None]:
matrix[0,2] # first row, third element

In [None]:
matrix[0,:] # entire first row

In [None]:
matrix[:,2] # entire third column

We can slice arrays into smaller arrays. Let's extract all but the last column from our matrix:

In [None]:
matrix[:,:-1]

Similarly, we can get the last column:

In [None]:
matrix[:,-1]

We can also slice based on conditions. For example, let's extract all rows where the element in the second column is greater than 3. First, we'll find which rows are relevant (True or False for each row), then we'll extract the rows.

In [None]:
selected = matrix[:,2]>3
selected

In [None]:
matrix[selected]

We can also calculate with the entire matrix or columns/rows. Let's sum all the numbers in the full matrix, in each column, and in each row:

In [None]:
matrix.sum() # sums all the elements in the array

In [None]:
matrix.sum(axis=0) # sums each column

In [None]:
matrix.sum(axis=1) # sums each row

Finally, let us change the shape of the array. Initially, we have a 4-by-3 matrix:

In [None]:
matrix.shape

Let's turn this into a one-dimensional array:

In [None]:
np.reshape(matrix, (12)) # simple array of 12 elements

In [None]:
np.reshape(matrix, (1,12)) # 2-dimensional array with 1 row and 12 columns

In [None]:
np.reshape(matrix, (12,1)) # 2-dimensional array with 12 rows and 1 column

Manipulating data structures in this way is really powerful and fast -- much faster than if we looped over the elements using plain python constructs. However, it would be more convenient if we could attach labels to each dimension so that we can keep track more easily what the values mean. The [`pandas`](https://pandas.pydata.org/) library provides us with classes and functions that enable us to do precisely that. 

Let's import [`pandas`](https://pandas.pydata.org/) and create a [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) from the matrix. A DataFrame is essentially a 2-dimensional array containing values, where each column and row can be associated with labels and other descriptive information.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(matrix)
df

Let's add some column labels:

In [None]:
df.columns = ['A','B','C']
df

We can also look at the first few rows of the data frame (this won't do much here since we have only a few rows), and obtain a summary of the content of each column:

In [None]:
df.head() # show the first 5 rows

In the following, we will call the rows "samples" and the columns "attributes", "features", or "variables". In the above table, there are 4 samples (0,..,3) and 3 variables (A,B,C).

Let's summarise the values in the data frame.

In [None]:
df.describe()

We see that there are 4 samples in the data frame (and that there are no missing values, since each column has the same count of 4), along with some descriptive statistics including the minimum value (`min`), maximum value (`max`), mean value (`mean`), quartiles (25%, 50%, 75%), and standard deviation (`std`). Some of these may not make sense to you unless you have learned some basic statistics.

We can access the data in the same way as we did earlier, and we can use the column labels to do this:

In [None]:
df['A'] # we want the entire column

We can also select rows and columns we are interested in. Property [`loc`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) enables us to access (and also assign) parts of a data frame.

In [None]:
df.loc[:,'A'] # all samples, only column A

In [None]:
df.loc[1:2,'A']


In [None]:
df.loc[1:2,'A':'B'] # samples indexed 1 and 2, attributes A and B

Sometimes, we may want to remove one attribute:

In [None]:
df.drop('C', axis=1) # axis 1 is the column axis; drop returns a copy. The original data frame remains unaffected.

In [None]:
df

We can create additional attributes in the data frame.

In [None]:
df2 = df.assign(CPLUS1=lambda d: d['C'] + 1) # this does not modify the original data frame
df2

We can also destructively assign values to the data frame cells. However, we should use this sparingly.

For example, let's change the cell in row 2 column CPLUS1 from 10 to 20.

In [None]:
df2.loc[2,'CPLUS1'] = 20
df2

Assignment works in a similar way to selecting. We can use slices and conditions.

In [None]:
df2.loc[0:2,['A','CPLUS1']] = 1234
df2

Assign the value `9876` to all cells in column `CPLUS` in rows where `B>3` and `C<10`.

In [None]:
# TODO

## Explore some data

Let us now load some data and explore it.

We will use the *Boston House Pricing Data Set* that comes with the *scikit-learn* (`sklearn`) library as an example. This data set contains data about dwellings and prices by township in Boston.

In [None]:
from sklearn import datasets
boston_housing = datasets.load_boston()

The dataset comes in two parts: the features describing the properties of a town, and the target capturing the median sales price of houses in that town. This is what we are interested in predicting. Let's store these two in separate variables: `X` represents the matrix of features, and `y` the array of median prices. For convenience, we'll wrap each in a `pandas.DataFrame`.

In [None]:
X = pd.DataFrame(boston_housing['data'], columns=boston_housing['feature_names'])
y = pd.DataFrame(boston_housing['target'], columns=['MEDV'])

In [None]:
X

In [None]:
y

We see that there are 506 samples and 13 features. Let's see what they mean:

In [None]:
from IPython.display import Markdown
Markdown(boston_housing['DESCR'])

Next, let's explore the data set further. Can you determine the minimum, maximum, and mean values of each feature?

In [None]:
X.describe()

Can you determine the lowest/highest/median average number of rooms per dwelling (`RM`)?

TODO

For convenience, let's create a single `DataFrame` that combines `X` and `y`:

In [None]:
boston_df = pd.concat([X,y],axis=1)
boston_df

Next, let's see what the properties are of the townships where `MEDV` exceeds 40.

Use `loc` to select the relevant rows and use `describe()` to see their properties

In [None]:
TODO

We may be interested in finding out which feature is most strongly associated with the target `MEDV`. We can compute the *[correlation coefficient](https://en.wikipedia.org/wiki/Correlation)* between each feature and `MEDV`. Correlation is a measure between -1 and 1, where values near +1 indicate that there is a strong (linear) increasing  association between two attributes, values near -1 indicate that there is a strong inverse (decreasing) relationship between two attributes, and values near 0 indicate that there is no (linear) relationship between the two attributes. 

In [None]:
boston_df.corrwith(boston_df['MEDV']).drop('MEDV').sort_values() # compute correlation, drop the MEDV column, and sort by value

The above statement computed the correlation between each column in the data frame and the `MEDV` column, then it removed the `MEDV` column (because all values are `1.0`), and then it sorted the results by correlation.
 
Can you determine which attribute has the strongest direct relationship with the target `MEDV`? What do we learn from this?

TODO

## Plotting Data

Visualising data can tell us a lot about the data set we are investigating. As part of any machine learning activity, we must first understand the data. Exploring the data and visualising it play an important role in this.

We will use [`matplotlib`](https://matplotlib.org/) and [`seaborn`](https://seaborn.pydata.org/) library to visualise the data.

First, we import the libraries.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

We already know that the average number of rooms per dwelling falls in the range `[3.561,8.78]`. However, we may want to learn more about the distribution of values within this range. We can investigate this by plotting the distribution as a histogram.

In [None]:
sns.displot(data=boston_df, x='RM');

In [None]:
sns.displot(data=boston_df, x='CHAS');

We see that this feature has only two distinct values (0 and 1), and that the number of zeros vastly outnumbers the ones. This situation is called *imbalance*.

Let's do the same for the target, `MEDV`:

In [None]:
sns.displot(data=boston_df, x='MEDV');

We see that most townships have median prices between 10 and 40.

Can you determine which is the most frequently occurring median price? What does the spike near 50 indicate?

TODO

In addition to understanding individual attributes, it is valuable to understand the relationship among variables. Let us visualise the relationship between `RM` and `MEDV`:

In [None]:
sns.scatterplot(data=boston_df, x='RM',y='MEDV', alpha=0.5);

We can see that most towns where dwellings where `RM` is between 5 and 7 have median house prices in the range [10,30]. However, there are some outliers, where the price is much higher. Let's find them in the data, by selecting all samples where `RM` is in the rage [5,7] and `MEDV` is 40 or higher. We do this by building an expression that evaluates to `True` for each sample where the condition we are interested is satisfied. Then, we use the resulting array of True/False values as index when selecting the samples.

In [None]:
condition  = boston_df['RM'].between(5,7) & (boston_df['MEDV'] >= 40) # placing parentheses is important
boston_df.loc[condition]

How many such outliers are in the data set?

TODO

Next, we may want to look at the relationship between more pairs of the attributes. Suppose we are interested in the mutual relationships among the attributes `LSTAT`, `CRIM`, `ZN`, `RM`, and `MEDV`.

As before, we can compute the correlation coefficient for each pair of attributes. This time, let us visualise the result as a [heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html).

In [None]:
interesting_vars = ['LSTAT','CRIM','ZN','RM','MEDV']
correlation_matrix = boston_df[interesting_vars].corr().round(2) # compute pairwise correlation, rounded to two digits
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True);

As before, we see that `RM` and `MEDV` have the strongest positive correlation and `LSTAT` and `MEDV` have the strongest inverse correlation. We can also see that some other pairs of attributes have a relatively strong relationship. For example, `RM` and `LSTAT` have a quite strong inverse relationship.

Understanding how the attributes relate to each other and to the target can be important when deciding which attributes to use for machine learning.

We can use [`seaborn.pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html) to plot each pair of attributes, so that we can visually inspect the relationships that are indicated in the heatmap above.

In [None]:
sns.pairplot(data=boston_df, vars=interesting_vars);  # this may take a few seconds

The resulting plot shows us the distribution of each attribute as histograms on the diagonals, and the pairwise relationships among the variables as scatterplots on the off-diagonal cells.

We can confirm that there is a strong increasing relationship between `RM` and `MEDV`. The scatterplot in the last column of the fourth row shows that if `RM` and `MEDV` tend to increase together (the scaterplot is sloped upwards). Similarly, we can confirm the negative relationship between `LSTAT` and `MEDV` we obtained when looking at the correlation coefficient (the scatterplot is sloped downwards). We can also see that there is no obvious strong relationship between `ZN` and `MEDV` (there is no obvious slope in the scatterplot). 

Suppose we wish to visualise the "expensive" properties, where `MEDV` exceeds `6*RM`.

We can do this by color-coding each sample. We will do this by introducing an additional feature, named `EXPENSIVE`, that is `True` when `MEDV>6*RM` and `False` otherwise.

In [None]:
boston_df_exp = boston_df.assign(EXPENSIVE=lambda s: s['MEDV']>6*s['RM'])
boston_df_exp

How many "expensive" townships are there? You can compute this by summing `EXPENSIVE`.

In [None]:
# TODO

ANSWER: There are 17 such townships.

Next, let's plot the relationship between `RM` and `MEDV` while color-coding the expensive properties:

In [None]:
sns.scatterplot(data=boston_df_exp, x='RM', y='MEDV', hue='EXPENSIVE');

As you may have anticipated, most townships were dwellings are expensive relative to the number rooms per dwelling have price levels in the top range, although not all high-pricing townships are classified as expensive. There is one township that is considered to be expensive although its `MEDV` is not among the highest ones.

Out of curiosity, let us see how the expensive townships fare in terms of distance from Boston business centres.

Create a scatterplot showing `RM` against `DIS`, and distinguish each sample by `EXPENSIVE`.

In [None]:
# TODO

Can you characterise which townships are considered `EXPENSIVE` only based on `RM` and `DIS`?

TODO

Suppose we have developed a formula to compute `MEDV` based on `RM` and `AGE` only. 

Let us define a function `boston_model_RM` which attempts to compute `MEDV` from given `RM` (and ignores all the other features). This function takes an 1-dimensional array of `RM` values as input and returns a 1-dimensional array of predicted `MEDV` values.

Use the formula `MEDV = 10*RM - 40` to compute the predicted MEDV values from the given RM values.

In [None]:
def boston_model_RM(rm):
    # TODO

Now, let's create an array with hypothetical values for `RM` and invoke the function to predict the expected `MEDV`. We will generate 40 values for RM in the range [5,9] in increments of 0.1. The numpy function [`np.arange`](https://numpy.org/doc/stable/reference/generated/numpy.arange.html) does this for us.

In [None]:
rm_range = np.arange(5, 9, 0.1)
medv_pred = boston_model_RM(rm_range)
medv_pred

To see if this works well, let's overlay the predictions on top of the boston data.

In [None]:
sns.scatterplot(data=boston_df, x='RM', y='MEDV')
sns.scatterplot(x=rm_range, y=medv_pred, color='red');

We see that although the predictions appear to be in the midst of the data, there are many large deviations. This tells us that the model does not fit very well. Perhaps unsurprisingly, average dwelling size is not enough to build a good predictor. 

In a later lecture we will see how we may quantify the performance of a predictor, so that we can compare and optimise its performance.

## Classification

To finish this introduction, let's distinguish small-, medium-, and large dwellings. 

Amend the data frame to distinguish among the three sizes. We'll classify dwellings as follows:

* SMALL: `RM <= 5`
* MEDIUM: `5 < RM <= 7`
* LARGE: `7 < RM`

Add an attribute `SIZE` to the data frame that indicates the size of the dwelling as per above classification.

Use [`pd.cut()`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) to compute the values for this attribute.

In [None]:
# TODO

In [None]:
boston_df_size = boston_df.assign(SIZE=df_sizes)

Plot `RM` against `MEDV` while using hue to distinguish different values of `SIZE`.

In [None]:
# TODO

Plot a histogram showing the distribution of `SIZE`.

In [None]:
sns.displot(data=boston_df_size, x='SIZE');

You have reached the end of this practical. You are now familiar with the basics of manipulating and interrogating data sets using `numpy` and `pandas`, and you can use `seaborn` to visualise data and inspect its properties. We will build on these essential skills in subsequent practicals, where we will train and test machine learning systems.