# Practical 5: Supervised Learning 


Upon completion of this session you should be able to:
- understand data dependency, linear regression and distances.
- be able to apply linear regression in Python.

---
- Materials in this module include resources collected from various open-source online repositories.
- Jupyter source file can be downloaded from https://github.com/gaoshangdeakin/SIT384-Jupyter
- If you found any issue/bug for this document, please submit an issue at [https://github.com/gaoshangdeakin/SIT384/issues](https://github.com/gaoshangdeakin/SIT384/issues)


---



This practical session will demonstrate different coefficient and linear regression.


## Background

### Machine Learning 

### Part 1 Data Dependency

1.1 [Pearson's-r Correlation coefficient](#pearson)

1.2 [Spearman's rank coefficient](#spearman)


### Part 2 Linear Regression

2.1 [Multiple Linear Regression](#mlr)

2.2 [Regression for Median House Price](#rmhp)

### Part 3 Distances

3.1 [Euclidean Distance](#euclidean)

3.2 [Cosine Distance](#cosine)

3.3 [Term-by-Document Matrix](#t2d)


## Tasks

## Summary

---

## <span style="color:#0b486b">Machine Learning</span>

<a id = "machinelearning"></a>

Machine learning (ML) is "Machines imitating and adapting human like behavior". In other words, we try to teach machines to “Learn from Experience”.

Machine learning algorithms use computational methods to “learn” information directly from data without relying on a predetermined equation as a model. The algorithms adaptively improve their performance as the number of samples available for learning increases. The ML-Algorithms find natural patterns within the data, get insights and predict the unknown for better decisions.

There are basically two types of ML Techniques:

   1. Supervised Learning
   2. Unsupervised Learning

###Supervised Learning:

Finds patterns (and develops predictive models) using both, input data and output data. All Supervised Learning techniques area form of either Classification or Regression.

* Classification: used for predicting discrete responses.E.g. Whether India will WIN or LOSE a Cricket match? Whether an email is SPAM or GENUINE? WIN, LOSE, SPAM, GENUINE are the predefined classes. And output has to fall among these depending on the input.
* Regression: used for predicting continuous responses.E.g. Trend in stock market prices, Weather forecast, etc.

###Unsupervised Learning:

Finds patterns based only on input data. This technique is useful when you’re not quite sure what to look for. Often used for exploratory Analysis of raw data. Most Unsupervised Learning techniques are a form of Cluster Analysis.

* Cluster Analysis: you group data items that have some measure of similarity based on characteristic values. At the end what you will have is a set of different groups (Let’s assume A — Z such groups). A Data Item(d1) in one group(A) is very much similar to other Data Items(d2 — dx) in the same group(A), but d1 is significantly different from Data Items belonging to different groups (B — Z).

SciPy and Scikit-Learn will be used in machine learning. For a quick review, go [.html version](practical5-review.html) or [.ipynb version](https://github.com/gaoshangdeakin/SIT384-Jupyter/practical5-review.ipynb).  

Before introducing regression, let's talk about data dependency first. 

---
## <span style="color:#0b486b">1. Data Dependency</span>

<a id = "pearson"></a>


### <span style="color:#0b486b">1.1 Pearson's-r Correlation coefficient</span>

The Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. The symbol for Pearson's correlation is "ρ" when it is measured in the population and "r" when it is measured in a sample. More detail can be found [statistics.laerd.com](https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php) or [wikipedia.org](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).

We assume $X=\left\{ X_{1},\ldots,X_{n}\right\}$ 
and $Y=\left\{ Y_{1},\ldots,Y_{n}\right\}$. Then Pearson-r correlation coefficient is defined as 

$$ \rho(X,Y) = \frac{\text{cov}(X,Y)}{\sigma_X \sigma_Y} =  \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^n(X_i-\bar{X})^2} \sqrt{\sum_{i=1}^n(Y_i-\bar{Y})^2}} $$

Use the car data and find the Pearson's-r correlation coefficient between car weights and fuel consumption.

In [None]:
import numpy as np
import csv
import matplotlib.pyplot as plt
import scipy.stats
import pandas as pd

%matplotlib inline

In [None]:
#No need to use wget if you've retrieved file from clouddeakin.
!pip install wget

In [None]:
import wget

link_to_data = 'https://raw.githubusercontent.com/gaoshangdeakin/SIT384/master/Auto.csv'
DataSet = wget.download(link_to_data)

In [None]:
data = pd.read_csv('Auto.csv')

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.head()

In [None]:
miles = data['miles']
weights = data['Weight']

In [None]:
print("miles[:10]:", miles[:10])
print("weights[:10]:", weights[:10])

In [None]:
pearson_r = np.cov(miles, weights)[0, 1] / (miles.std() * weights.std())
print("pearson_r:", pearson_r)

In [None]:
np.corrcoef(miles,weights)

In [None]:
horse = data['Horse power']

In [None]:
np.corrcoef(weights,horse)

In [None]:
# plotting
fig, ax = plt.subplots(figsize=(7, 5), dpi=100)
ax.scatter(weights,miles, alpha=0.6, edgecolor='none', s=100)
ax.set_xlabel('Car Weight (tons)')
ax.set_ylabel('Miles Per Gallon')

line_coef = np.polyfit(weights, miles, 1)
xx = np.arange(1, 5, 0.1)
yy = line_coef[0]*xx + line_coef[1]

ax.plot(xx, yy, 'r', lw=2)

**Exercise 1**: 

1. Find the Pearson's-r coefficient for two linearly dependent variables. Add some noise and see the effect of varying the noise. 
2. Simulate and visualize some data with positive linear correlation
3. Simulate and visualize some data with negative linear correlation. 

In [None]:
xx = np.arange(-5, 5, 0.1)
pp = 1.5  # level of noise
yy = xx + np.random.normal(0, pp, size=len(xx))

# visualize the data
fig, ax = plt.subplots()
ax.scatter(xx, yy, c='r', edgecolor='none')
ax.set_xlabel('X data')
ax.set_ylabel('Y data')

line_coef = np.polyfit(xx, yy, 1)
line_xx = np.arange(-5, 5, 0.1)
line_yy = line_coef[0]*line_xx + line_coef[1]

ax.plot(line_xx, line_yy, 'b', lw=2)

print scipy.stats.pearsonr(xx, yy)

Pearson's r coefficient is limited to analyze the linear correlation between two variables. It is not capable to show the non-linear dependency. Investigate the Pearson's r coefficient between two variables that are correlated non-linearly.

In [None]:
# generate some data, first for X
xx = np.arange(-5, 5, 0.1)

# assume Y = 2Y + some perturbation
pp = 1.1  # level of noise
yy = xx**2 + np.random.normal(0, pp, size=len(xx))

# visualize the data
fig, ax = plt.subplots()
ax.scatter(xx, yy, c='r', edgecolor='b')
ax.set_xlabel('X data')
ax.set_ylabel('Y data')
ax.set_title('$Y = X^2+\epsilon$', size=16)

The Pearson's-r correlation is near zero which means there is no linear correlation. But how about non-linear correlation? Isn't $y=x^2$?

In [None]:
np.corrcoef(xx,yy)

<a id = "spearman"></a>


### <span style="color:#0b486b">1.2 Spearman's rank coefficient</span>

Spearman's rank coefficient is used for discrete/ordinal data. Find the Spearman's rank between horse power and number of cylinders of the car data.

In [None]:
data.head()

In [None]:
#horse = np.array([float(dd[4]) for dd in data[1:]])
#cylinder = np.array([float(dd[2]) for dd in data[1:]])
horse = data['Horse power']
cylinder = data['cylinder number']


fig, ax = plt.subplots(figsize=(7, 5), dpi=100)
ax.scatter(horse, cylinder, alpha=0.6, edgecolor='none', s=100)
ax.set_xlabel('Horse power')
ax.set_ylabel('#Cylinders')

print (scipy.stats.spearmanr(horse, cylinder))

**Exercise 2**. 
Compute the spearman rank correlation between "Horse power" and "Engine displacement"

---
## <span style="color:#0b486b">2. Linear Regression</span>


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

First we investigate a simple case by fitting a linear regression for three data points. First we simulate the data:

In [None]:
# simulating the data

x = np.c_[0, 1, 2, 1.5].T
y  = [1, 1.5, 3.1, 1.5]

print ("x:",x)
print ("y:",y)

In [None]:
#plotting the data

fig, ax = plt.subplots(figsize=(5, 5), dpi=100)
ax.scatter(x, y, c='r')
ax.set_title('simulated data')
ax.set_xlabel('x')
ax.set_ylabel('y')

Now we fit the linear regression:

In [None]:
from sklearn import linear_model

# instanciate the model
lr = linear_model.LinearRegression()

# fit the model
lr.fit(x, y)

In [None]:
print ("Coefficients:", lr.coef_)
print ("   Intercept:", lr.intercept_)
# print ("    Residues:", lr.residues_)

Let's plot the line to see how it estimates our data:

In [None]:
yhat = lr.predict(x)

fig, ax = plt.subplots(figsize=(5, 5), dpi=100)
ax.scatter(x, y, c='r')
ax.plot(x, yhat)

ax.set_title('simulated data and the estimated line')
ax.set_xlabel('x')
ax.set_ylabel('y')

We can use the method `predict()` to predict `y` for a new `x`

In [None]:
x_test = np.c_[4, 2.3].T
y_test = lr.predict(x_test)

print ("x_test.T:", x_test.T)
print ("y_test:", y_test)

<a id = "mlr"></a>


### <span style="color:#0b486b">2.1 Multiple Linear Regression</span>


Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable y. For example if we have two explanatory variables (attributes, features), our data has such a form:

$$
D=\left\{ \left(\left(x_{1,1},x_{2,1}\right),y_{1}\right),\left(\left(x_{1,2},x_{2,2}\right),y_{2}\right),\ldots,\left(\left(x_{1,n},x_{2,n}\right),y_{n}\right)\right\} 
$$

Now we fit a multiple linear regression $y = x_1 + 2x_2 + 1$


In [None]:
# simulate the data

x = np.c_[[0, 0], [0, 1], [1, 1], [1, 0]].T
y = [1.5, 3.2, 4, 2]

print (x)
print (y)

In [None]:
mlr = linear_model.LinearRegression(fit_intercept=True)
mlr.fit(x,y)

In [None]:
print ("mlr.coef_:", mlr.coef_)
print ("mlr.intercept_:", mlr.intercept_)

In [None]:
print ("mlr.predict(x):", mlr.predict(x))

**Exercises 3**: 

As the score suggests, now we have the perfect regression. Change the values of $y$ slightly and see what effect it has on the `mlr`.

<a id = "rmhp"></a>


### <span style="color:#0b486b">2.2 Regression for median house prices</span>


We are going to use the package `pandas` for reading and storing the data.

In [None]:
#No need to use wget if you've retrieved file from clouddeakin.
wget.download('https://raw.githubusercontent.com/gaoshangdeakin/SIT384/master/housing_300.csv')

data = pd.read_csv('housing_300.csv')

In [None]:
data.head()

In [None]:
data.describe()

Plot the scatter plot of the number of rooms vs the median house prices.

In [None]:
fig, ax = plt.subplots(figsize=(7, 7), dpi=100)
median_prices = data['MEDV']
avg_rooms = data['RM']
scales = 50*np.ones(len(median_prices))
ax.scatter(avg_rooms, median_prices, color='b',s=scales, alpha=0.7, edgecolor='r')
plt.xlabel('$X$ (number of rooms)')
plt.ylabel('$Y$ (median house prices)')

In [None]:
print ("avg_rooms.shape:", avg_rooms.shape)
print ("median_prices.shape:", median_prices.shape)

How correlated are the number of rooms and the price of the house?

In [None]:
np.corrcoef(avg_rooms, median_prices)

Now we want to fit a linear regression mode on the data.

In [None]:
# prepare the data

x = np.c_[avg_rooms.values]
y = median_prices.tolist()

In [None]:
from sklearn import linear_model
lr = linear_model.LinearRegression()

In [None]:
lr.fit(x,y)

In [None]:
print (lr.coef_)
print (lr.intercept_)
# print lr.residues_

In [None]:
# obtain the model parameters

print (lr.coef_, lr.intercept_)

In [None]:
# predict 

yhat = lr.predict(x)

In [None]:
print ("x[:10]:", x[:10])
print ("yhat[:10]:", yhat[:10])

In [None]:
#plot the result

fig,ax = plt.subplots(figsize=(7,7),dpi=100)

scales = 20*np.ones(len(median_prices))
ax.scatter(avg_rooms,median_prices,color='b',s=scales,alpha=0.7,edgecolor='r')
plt.xlabel('$X$ (number of rooms)')
plt.ylabel('$Y$ (median house prices)')

# plot the regression linear leared
ax.plot(x,yhat)

# visualize the residuals
tmp = np.reshape(x,[1,len(x)])[0]
tmp_x = []
tmp_y = []
for i in range(len(x)):
    tmp_x = np.append(tmp_x,tmp[i])
    tmp_y = np.append(tmp_y,y[i])
    tmp_x = np.append(tmp_x,tmp[i])
    tmp_y = np.append(tmp_y,yhat[i])
    ax.plot(tmp_x,tmp_y,color='g',linewidth=0.5)
    tmp_x = []
    tmp_y = []

In [None]:
#sum of residual
# lr.residues_

It is customary to test your model on **unseen** data. So we divide our data into two parts. We use 70% of it to train the model and 30% to evaluate its performance on unseen data.

In [None]:
split = 0.7
split_idx = int(np.round(split * len(data)))
split_idx

In [None]:
train_data = data[0:200]
train_data.head()

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True)
train_data.plot(kind='scatter', x='RM', y='MEDV', ax=axs[0], figsize=(7, 7))
train_data.plot(kind='scatter', x='AGE', y='MEDV', ax=axs[1], figsize=(7, 7))

In [None]:
test_data = data[200:300]
test_data.head()

In [None]:
train_X = train_data['RM'].values
train_X = np.c_[train_X]
train_Y = train_data['MEDV'].tolist()

test_X = test_data['RM'].values
test_X = np.c_[test_X]
test_Y = test_data['MEDV'].tolist()

In [None]:
print (type(train_X))
print (train_X.shape)
print (type(train_Y))

In [None]:
'''
Build a linear regression model from training data
'''
from sklearn import linear_model
lr = linear_model.LinearRegression()

lr.fit(train_X, train_Y)

In [None]:
print (lr.coef_)
print (lr.intercept_)

Now we plot the linear regression result and the data to see how it fits the training data:

In [None]:
fig,ax = plt.subplots(figsize=(7,7),dpi=100)

# plot training data
scales = 20*np.ones(len(train_Y))
ax.scatter(train_X,train_Y,color='b',s=scales,alpha=0.7,edgecolor='r')
plt.xlabel('$X$ (number of rooms)')
plt.ylabel('$Y$ (median house prices)')
plt.title('Training a simple linear regression model')

# plot the regression line
train_Yhat = lr.predict(train_X)
plt.plot(train_X,train_Yhat)

Now that we have obtained the model parameters, we can use the model to predict for unseen data:

In [None]:
yhat_test = lr.predict(test_X)

In [None]:
fig,ax = plt.subplots(figsize=(7,7),dpi=100)

# plot the predicted points along the prediction line
scales = 30*np.ones(len(test_X))
ax.scatter(test_X,yhat_test,s=scales,color='b',edgecolor='r')
ax.plot(test_X,yhat_test,color='b',linewidth=.2)

# plot the true values
scales = 30*np.ones(len(test_X))
ax.scatter(test_X,test_Y,s=scales,color='g',edgecolor='b')

# plot the residual line
tmp = np.reshape(test_X,[1,len(test_X)])[0]
tmp_x = []
tmp_y = []
for i in xrange(len(test_X)):
    tmp_x = np.append(tmp_x,tmp[i])
    tmp_y = np.append(tmp_y,yhat_test[i])
    tmp_x = np.append(tmp_x,tmp[i])
    tmp_y = np.append(tmp_y,test_Y[i])
    ax.plot(tmp_x,tmp_y,color='red',linewidth=0.5)
    tmp_x = []
    tmp_y = []

In [None]:
data.head()

---
## <span style="color:#0b486b">3. Distances</span>

`Distance` is a numerical description of how far apart objects are. It is a concrete way of describing what it means for elements of some space to be close or far away from each other, for example the distance between two vectors in an 2-dimensional space.

Now that you have know how to represent an n-dimensional vector in Python with NumPy arrays, we will write a function as a metric to measure the distance between two vectors. There are multiple ways to measure the distance between two vectors. We will discuss Euclidean distance and cosine distance.

<a id = "euclidean"></a>


### <span style="color:#0b486b">3.1 Euclidean Distance</span>

Euclidean distance comes from Geometry. If we assume $\mathbf{x}_{1}=\left[x_{11},x_{12},\ldots,x_{1n}\right]$ and $\mathbf{x}_{2}=\left[x_{21},x_{22},\ldots,x_{2n}\right]$, then the Euclidean distance between $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ is defined as:

$$d\left(\mathbf{x}_{1},\mathbf{x}_{2}\right)=\sqrt{\left(x_{11}-x_{21}\right)^{2}+\left(x_{12}-x_{22}\right)^{2}+\ldots+\left(x_{1n}-x_{2n}\right)^{2}}
$$

We can use array operators for this task.

In [None]:
x1 = np.array([2, 5, 4, 6, 8])
x2 = np.array([3, 5, 6, 8, 6])

print (x1 - x2)
print ((x1 - x2) ** 2)
print (np.sqrt(np.sum((x1 - x2) ** 2)))

In [None]:
import numpy as np

In [None]:
def euclidean_distance1(x1, x2):
    d = x1 - x2
    d = d ** 2
    return np.sqrt(d.sum())

In [None]:
x1 = np.array([-1, 2, 0, 5])
x2 = np.array([4, 2, 1, 0])

print (euclidean_distance1(x1, x2))

Since two vectors passed to the function should be the same size, it is better to perform a sanity check before applying the subtraction. Otherwise it will raise an error. We can do this by using `if - elif` statement or as a better practice by using `try - except`.

In [None]:
import sys

def euclidean_distance2(x1, x2):
    if x1.shape[0] != x2.shape[0]:
        sys.exit('x1 and x2 are not the same size')
    else:
        d = x1 - x2
        d = d ** 2
        return np.sqrt(d.sum())

In [None]:
# fix this cell

x1 = np.array([-1, 2, 0, 5, 9])
x2 = np.array([4, 2, 1, 0, 1])
euclidean_distance2(x1, x2)

In [None]:
def euclidean_distance3(x1, x2):
    try:
        d = x1 - x2
        d = np.power(d, 2)
        return np.sqrt(d.sum())
    except ValueError as e:
        print ("Vectors passed to the function are not the same size")
        # you can return a default value
        return None

In [None]:
# fix this cell

x1 = np.array([-1, 2, 0, 5, 9])
x2 = np.array([4, 2, 1, 2])
a = euclidean_distance3(x1, x2)

In [None]:
def euclidean_distance4(x1, x2):
    try:
        d = np.array(x1) - np.array(x2)
        d = np.power(d, 2)
        return np.sqrt(d.sum())
    except ValueError as e:
        print ("Vectors passed to the function are not the same size")
        # you can return a default value
        return None

<a id = "cosine"></a>

### <span style="color:#0b486b">3.2 cosine similarity and distance</span>

Cosine similarity is a measure of similarity between two vectors based on the angle between them. Cosine similarity is widely used in information retrieval and text mining as a measure of similarity between documents and is defined as:

$$S_{c}\left(\mathbf{x}_{1},\mathbf{x_{2}}\right)=\frac{\mathbf{x}_{1}.\mathbf{x_{2}}}{\parallel\mathbf{x}_{1}\parallel^{2}+\parallel\mathbf{x}_{2}\parallel^{2}-\mathbf{x}_{1}.\mathbf{x_{2}}}$$


Cosine similarity is particularly used in positive space where the outcome is bounded in [0, 1]. The cosine distance is defined as the complement to cosine similarity in positive space that is $D_{c}\left(x_{1},x_{2}\right)=1-S_{c}\left(x_1,x_2\right)$ where $D_c$ is the cosine distance and $S_c$ is the cosine similarity.

In [None]:
x1 = np.array([1,2,3])
x2 = np.array([3,4,6])

print(x1 * x1)

In [None]:
def cosine_distance(x1, x2):
    try:
        num = (x1*x2).sum()
        denom = (x1*x1).sum() + (x2*x2).sum() - (x1*x2).sum()
        num += 0.0    # or use np.astype(float) to make sure of float division
        return 1 - num/denom
    except ValueError as e:
        print ("Vectors passed to the function are not the same size")
        return None
    

In [None]:
x1 = np.array([2, 0, 5, 9])
x2 = np.array([4, 2, 1, 0])
cosine_distance(x1, x2)

# <span style="color:#0b486b">Tasks</span>

Try the provided examples and get yourself familiar with sample plot code before attempting portolio tasks.

Please show your attempt to your tutor before you leave the lab, or email your files to your coordinator if you are an off-campus student.

# <span style="color:#0b486b">Summary</span>

In this session we have covered: 
 - data dependency, linear regression and distances.
 - how to apply linear regression in Python.

Reference:
1. Yash Soni, "Machine Learning for dummies — explained in 3 mins!", https://becominghuman.ai/machine-learning-for-dummies-explained-in-2-mins-e83fbc55ac6d, accessed 31/03/2019.