
# MATHS7027 Mathematical Foundations of Data Science
## Computer Exercise 4
In this computer exercise, we will learn how to use `numpy` to do linear algebra, and show some applications.
## Part 1: Getting Started with Matrices in `numpy`
To start, let's solve a typical linear algebra problem.

We want to solve:
$$
\begin{align*}
2x + y - z = & 2\\
x + 3y +2z = & 1\\
x + y + z = & 2.
\end{align*}
$$

We can write this system in the compact matrix form $$A\boldsymbol{x} = \boldsymbol{b},$$ where  
$$A = \begin{bmatrix}2 & 1 & -1 \\ 1 & 3 & 2\\ 1 & 1 & 1\end{bmatrix}, \quad \boldsymbol{b} = 
\begin{bmatrix}2 \\ 1 \\ 2\end{bmatrix}, \quad \text{and} \quad \boldsymbol{x} = \begin{bmatrix}x \\ y \\ z\end{bmatrix}.$$

We can define $A$ and $\boldsymbol{b}$ in `numpy` as follows:

In [19]:
import numpy as np

A = np.array([[2,1,-1], [1,3,2], [1,1,1]])
b = np.array([2,1,2])

print(A)
print(b)

[[ 2  1 -1]
 [ 1  3  2]
 [ 1  1  1]]
[2 1 2]


$A$ is a *list of lists*, and wrapping `np.array()` around $A$ and $\boldsymbol{b}$ allows us to use `numpy` functions on them. For example, find out the order of $\boldsymbol{b}$ by appending `.shape` to the end of `b`:

In [3]:
# try using .shape here

b.shape

(3,)

This means that `b` has 3 rows and one column. By default, `numpy` stores vectors as *column vectors* (not row vectors).

Find the shapes of the different vectors below:

In [6]:
c = np.array([1,2,3])
d = np.array([[1,2,3]])

# find the shapes of the vectors c and d

print(c.shape)
print(d.shape)

(3,)
(1, 3)


Notice the difference between how `numpy` defines column vectors and row vectors.

Now, let's do a bit more with matrices in numpy. Try the following exercise:
- Define the following matrices with `numpy`:
$$X = \begin{bmatrix}1 & 6 \\ 3 & -1\end{bmatrix}, \quad Y = \begin{bmatrix}1 & 1 & 1 \\ 1 & -2 & 0\end{bmatrix}, \quad Z = \begin{bmatrix}2 & 3 & 1 \\ 0 & 1 & -2\end{bmatrix}$$
- Where possible, find the following (where the matrix operation is not possible, pay close attention to the result/output `numpy` gives you):
$$X + Y; \quad Y + Z; \quad Y - Z$$

In [12]:
# define the matrices X, Y, and Z
X = np.array([[1,6],
              [3,-1]])

Y = np.array([[1,1,1],
              [1,-2,0]])

Z = np.array([[2,3,1],
              [-0,1,-2]])

# try X + Y, Y + Z, and Y - Z here
X + Y
Y + Z
Y - Z

array([[-1, -2,  0],
       [ 1, -3,  2]])

`np.matmul()` does [matrix multiplication](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matmul.html#numpy.matmul), and `.T` takes the [matrix transpose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.transpose.html). Look at the syntax for each operation, then find the following where possible (again, where the matrix operation is not possible, pay close attention to the error message):
- $XY$
- $ZX$
- $Z^TX$
- Confirm that $\boldsymbol{x} = (2,-1,1)$ is a solution of $A\boldsymbol{x} = \boldsymbol{b}$ from the top of the page.

In [31]:
# try X.Y, Z.X, and Z^T.X
XY = np.matmul(X,Y)
# ZX = np.matmul(Z,X) Can't calculate
Z_TX = np.matmul(Z.transpose(),X)
Z_TX
# check that (2,-1,1) is a solution of Ax = b
x = np.array([2,-1,1])
np.matmul(A,x) == b


list(np.linalg.solve(A,b) == x).count(False)

0

## Part 2: Linear Regression
Let's take what we've just learned and use it to perform a linear regression.

In an Australian Federal Election, some votes do not count because they are informal (meaning they do not follow the requirements for a correct vote). Australia uses ranked-choice voting, so voters must put a number next to each candidate listed on their ballot paper. If there are five candidates, but a voter only puts a number next three candidates, their vote may be treated as informal.

Suppose we want to construct a (linear) model to predict the informal vote rate based on how many candidates are running in an electorate. Our hypothesis is that if a ballot paper contains more candidates then there may be more informal votes.

The data below shows the informal vote rate and number of candidates in each electorate in South Australia, for the 2019 Australian Federal Election (source here: https://results.aec.gov.au/24310/Website/HouseDivisionalResults-24310.htm)

| Electorate | Informal Rate |Candidates |
| --- | --- | --- |
|Adelaide | 3.70 | 6 |
| Barker| 5.57 | 7 |
| Boothby| 4.70| 8 |
| Grey | 6.91 | 8| 
| Hindmarsh | 4.32 |6 |
| Kingston | 4.11 | 5 | 
| Makin | 4.49 | 5 | 
| Mayo | 3.05 | 6 |
| Spence | 5.98 | 6 | 
| Sturt | 5.37 | 8 | 

Let's start by converting this into a design matrix and a response variable vector. If that sentence doesn't make a lot of sense, you should review the **[course material about linear regression](https://myuni.adelaide.edu.au/courses/81010/pages/week-6-matrices-in-linear-regression)**.

In [46]:
X = np.array([[1, 6], 
              [1, 7], 
              [1, 8],
              [1, 8],
              [1, 6],
              [1, 5], 
              [1, 5], 
              [1, 6],
              [1, 6],
              [1, 8],])

y = np.array([3.70,5.57,4.70,6.91,4.32,4.11,4.49,3.05,5.98,5.37])

Why is there a column of 1's in our design matrix? Remember, the linear regression model is $y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}$. Without the column of 1's, we wouldn't have a $\beta_{0}$ term in our model.

Now, here comes the fun part - let's fit the linear regression! Use the equation below to find our regression coefficents, $\hat{\boldsymbol{\beta}}$. You might need to do some Googling, or review the numpy documentation, to figure out how to find the inverse of a matrix using `numpy`.

$$\hat{\boldsymbol{\beta}}=\left(X^{T} X\right)^{-1} X^{T} \boldsymbol{y}$$

In [49]:
# try to calculate beta-hat here
M = np.linalg.inv(np.matmul(X.transpose(),X))
N = np.matmul(X.transpose(),y)
beta = np.matmul(M,N)
beta


from sklearn.linear_model import LinearRegression

reg = LinearRegression()

reg.fit(X, y)
coef = reg.coef_
intercept = reg.intercept_
print(beta)
print([intercept,coef[1]])

[1.2684 0.5464]
[1.2684000000000006, 0.5463999999999998]


These are the coefficients of the linear regression. They describe the relationship between our variables! This is a formula $y = \hat{\beta}_{0} + \hat{\beta}_{1}x$, where $x$ is the number of candidates and $y$ is the informal voting rate.

We can compare the values our model predicts with the actual/observed values by calculating $X\hat{\boldsymbol{\beta}}$ and comparing it to $\mathbf{y}$. We might want to compare these side-by-side, and also look at their difference. We can do this by putting the columns together into an array using `np.c_`, which collects together columns into an array. For example, `np.c_[a,b,c]` would create an array with column vectors `a`, `b`, and `c` as its columns.

Make an array with $X\hat{\boldsymbol{\beta}}$, $y$, and their difference as columns, so you can directly compare your model with the real data.

In [None]:
# try making an array that includes your model output, the real data, and their
# difference as columns



## Part 3: Linear Regression with Multiple Variables
We might be able to improve our model if we incorporate some other factors. Some socio-economic data might be relevant; electorate-level data is available here: https://www.ausstats.abs.gov.au/ausstats/subscriber.nsf/0/DE95D3A11C2436F9CA2583AF0071AB19/$File/south%20australia%20profiles.pdf

Let's include the proportion of people within each electorate who are recent migrants (as they may be less familiar with Australia's voting system), and the proportion with year 12 (finished high school) or equivalent education.

| Electorate | Informal Rate | Candidates | Migrants | Education
| --- | --- | --- | --- | --- | 
|Adelaide | 3.70 |  6 | 17.6 | 82.4 |
| Barker| 5.57 |  7 | 3.2 | 57.9 | 
| Boothby| 4.70|  8 | 9.8 | 82.8 | 
| Grey | 6.91 |  8 | 1.9 | 53.5 | 
| Hindmarsh | 4.32 | 6 | 8.4 | 76.7 | 
| Kingston | 4.11 |  5 |6.2| 69.1 | 
| Makin | 4.49 |  5 | 9.0 | 73.1 | 
| Mayo | 3.05 |  6 | 3.1 | 70.8 | 
| Spence | 5.98 |  6 |  7.2 | 58.1 | 
| Sturt | 5.37 |  8 | 12.7 | 85.3 |

We can now use exactly the same approach as before; we just need to add in more columns to our matrix:

In [51]:
X2 = np.array([[1, 6, 17.6, 82.4], 
              [1, 7, 3.2, 57.9], 
              [1, 8, 9.8, 82.8],
              [1, 8, 1.9, 53.5],
              [1, 6, 8.4, 76.7],
              [1, 5, 6.2, 69.1], 
              [1, 5, 9.0, 73.1], 
              [1, 6, 3.1, 70.8],
              [1, 6, 7.2, 58.1],
              [1, 8, 12.7, 85.3]])

y2 = np.array([3.70,5.57,4.70,6.91,4.32,4.11,4.49,3.05,5.98,5.37])

# try to calculate the estimate of our model co-efficients (beta-hat) using the
# new data
M = np.linalg.inv(np.matmul(X2.transpose(),X2))
N = np.matmul(X2.transpose(),y2)
beta = np.matmul(M,N)
beta


array([ 7.20554166,  0.6064278 ,  0.12159395, -0.10270721])

Next, you can see if the new model performs better than our old model did:

In [None]:
# create an array showing the model output, actual data, and difference



We could also test our model on some electorates from other states. 

Here are a few electorates I have chosen from around Australia:

| Electorate | Informal Rate | Candidates | Migrants | Education
| --- | --- | --- | --- | --- | 
| Aston | 3.68| 5 | 8.8 | 79.7 |
| Ballarat | 4.37 | 7 | 3.0 | 68.6 | 
| Banks | 7.20 | 6 | 12.8 | 83.4 | 
| Barton | 9.53 | 6 | 19.2 | 83.9 |

We want to use the $\hat{\boldsymbol{\beta}}$ that we obtained from the South Australian data and test how well it predicts the informal voting rate for these new electorates:

In [None]:
# test our model on the new data here



Describe your findings here:

...

One last thing to note. In practice, there are packages for performing linear regression in Python. Furthermore, there are all sorts of statistical tools for determining how well a model matches the data, and which variables best predict behaviour, but this goes well beyond the topic at hand, which is simply an application of matrices.