# MATHS1004 Mathematics for Data Science I
## Computer Lab 3

The aim of this lab is to delve into using `numpy` for doing linear algebra, and to show some applications. Let's start with a problem from lectures: Solve
$$
\begin{align*}
2x + y - z &= 2\\
x + 3y +2z & = 1\\
x + y + z & = 2.
\end{align*}
$$
As we showed (or will show, depending on when you take this lab) in lectures, the solution to this system of equations is $x = 2, y = -1, z = 1$.

We can write this system in the compact matrix form 
$$
A\mathbf{x} = \mathbf{b},
$$
where
$$
A = \begin{bmatrix}
2 & 1 & -1 \\
1 & 3 & 2\\
1 & 1 & 1
\end{bmatrix}
$$

$$\mathbf{b} = 
\begin{bmatrix}
2\\
1\\
2
\end{bmatrix}
$$
and
$$
\mathbf{x} = 
\begin{bmatrix}
x\\
y\\
z
\end{bmatrix}
$$
(A good pen-and-paper exercise: verify that 
$
A\mathbf{x} = \mathbf{b}
$
corresponds to the system of equations at the top of the page.)

We can define $A$ and $\mathbf{b}$ in `numpy` as follows.

In [None]:
import numpy as np

A = np.array([[2,1,-1], [1,3,2], [1,1,1]])
b = np.array([2,1,2])

print(A)

$A$ is a *list of lists*, and wrapping `np.array()` around it and $\mathbf{b}$ allows us to use `numpy` functions on them. For example, find out the order $\mathbf{b}$ by appending `.shape` to the end of `b`:

That means that `b` has 3 rows and one column -- by default `numpy` stores vectors as *column vectors* (not row vectors).

Some exercises:
- Define the following matrices with `numpy`:
$$
X = 
\begin{bmatrix}
1 & 6\\
3 & -1
\end{bmatrix}
\quad
Y = 
\begin{bmatrix}
1 & 1 & 1\\
1 & -2 & 0
\end{bmatrix}
\quad
Z = 
\begin{bmatrix}
2 & 3 & 1\\
0 & 1 & -2
\end{bmatrix}
$$
- Find (where possible, and look at what `numpy` gives you when not):
$$
X + Y; \quad Y + Z; \quad Y - Z
$$

`np.matmul` does [matrix multiplication](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matmul.html#numpy.matmul), and `np.transpose` takes the [matrix transpose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.transpose.html). Look at the syntax in those links, and then find (where possible, or look at what the error message gives you):
- $XY$
- $ZX$
- $Z^TX$
- Confirm that $\mathbf{x}$ from the top of the page is a solution of $A\mathbf{x} = \mathbf{b}$.



## Solving systems of linear equations using matrices

A big aim of lectures over the next few lectures is to teach a fundamental algorithm for how to *solve* $A\mathbf{x} = \mathbf{b}$. This is important for you to understand what's happening at a fundamental level, but in practice, with `numpy` we can solve linear equations quickly in a few different ways. Try the following methods out:
1. Solve $A\mathbf{x} = \mathbf{b}$ by using the `np.linalg.solve` command (see docs [here](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.linalg.solve.html)).
2. Find the matrix *inverse* $A^{-1}$ of $A$ and then pre-multiply $\mathbf{b}$ by $A^{-1}$ to get $\mathbf{x} = A^{-1}\mathbf{b}$. The syntax for the inverse is in the docs [here](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.linalg.inv.html).

Try both of these in the cell below for $A$ and $\mathbf{b}$ defined at the top of the page, and make sure you get $\mathbf{x}$ as defined at the top.

## Applications of matrices

Why care about matrix multiplication? One of many applications is *sentiment analysis*, which takes a piece of text and gives it a score based on the words used. *Dictionary-based* methods use a list of scores for different words to calculate an average sentiment for a piece of text.

For example, one such list gives words a score from 1 to 9, where 1 is very sad and 9 is very happy (so that 5 is neutral). If we have the following scores $h$ for two words:
$$
\begin{align*}
h(\text{the}) &= 4.98\\
h(\text{dark}) &= 3.82
\end{align*}
$$

we might calculate the sentiment of the phrase "the dark" my the mean score of the words: $(4.98+3.82)/2 = 4.4$.

For you:
1. Put the scores for "the" and "dark" into a 2x1 column vector $h$.
2. Put the counts of each word in the phrase "the dark" (i.e., 1 and 1) into a 1x2 row vector of word counts $c$.
3. Matrix multiply $ch$, and divide by the sum of the word counts `np.sum(c)` to check the calculation above.

Now, if we want to score lots of phrases at once we can do it efficiently with linear algebra! Let's demonstrate using a really simple example from Dr. Seuss. Consider the couplet from *Green Eggs and Ham*:

"Would you, could you, in the dark?"

"I would not, could not, in the dark!"

Which sentence is happier? We can compute the mean score for each sentence by building a matrix of word count for each sentence, and then using a bit of matrix mulitplication. The following code creates a matrix $X$, where the first row is the number of counts in sentence 1, and the second row is the number of counts in sentence 2. You don't have to understand the code, but it uses some neat Python functions if you're interested (`enumerate`, `split`, `index`):

In [None]:
words = [ 'would', 'you', 'could', 'in', 'the', 'dark', 'i', 'not']
scores = np.array([ 5.38, 6.24, 5.52, 5.50, 4.98, 3.82, 5.92, 3.86])

t1 = "would you could you in the dark"
t2 = "i would not could not in the dark"

texts = [t1,t2]

X = np.zeros((2,len(words)))
# first, loop over the two texts
for i,text in enumerate(texts):
    # then loop over the words in each text
    for word in text.split():
        j = words.index(word)
        # +1 to the count matrix for the current word
        X[i,j] += 1 
    

In the cell below, look at the count matrix X, and check that the first row corresponds to the counts of words in sentence 1, and that the second row corresponds to the counts of words in sentence 2.

Now, to get the number of words in each sentence, we can just sum `X` along each row using `np.sum(X,axis=1)` (do this below and check):

And the mean sentiment scores for each text are calculated by matrix multiplying `X` by `scores` and dividing each row by the sum above:

In [None]:
np.matmul(X,scores)/np.sum(X,axis=1)

So which sentence is happier? Does this answer make sense?

### Notes:
- That was a tiny example, but it scales up an extremely long way. We used a dictionary of scores for just 8 words, but I selected them out of a list of 10,000 scores, from [here](http://hedonometer.org/words.html). That dictionary gets used to calculate a daily mean "happiness" of the entire Twittersphere as part of the so-called [hedonometer](http://hedonometer.org/index.html) project. (Full disclosure: I worked on this project, which is why this particular example is in here!) Every day this website scans through the text of ~50 million tweets, builds a giant count matrix $X$ **exactly** like you did, but with 10,000 columns from the words in those tweets, and does **exactly** the matrix multiplication you did to calculate the mean happiness score chart displayed on the website. The moral: Big data analysis is impossible without linear algebra!!
- In truth the dictionary of word scores above would be much better stored in a *Python dictionary* structure, which is actually one of the greatest features of this programming language. We will hopefully get a chance to talk about these explicitly in coming weeks!
- The full list of word scores we used above lives in a text file [here](https://github.com/andyreagan/labMT-simple/blob/master/labMTsimple/data/LabMT/labMT1.txt). If you're interested you might like to code up your own sentiment analysis tool!

