# DSCI 521: Methods for analysis and interpretation <br> Chapter 1: Processing numeric data

## Exercises
Note: numberings refer to the main notes.

#### 1.0.2.3 Exercise: Creating a matrix 
Using the above matrix-generation techniques, create a 2x2 identity matrix.

In [1]:
import numpy as np

identit2 = np.zeros((2,2))

for i in range(len(identit2)):
    identit2[i,i] = 1

print(identit2)
print(np.identity(2))

[[1. 0.]
 [0. 1.]]
[[1. 0.]
 [0. 1.]]


#### 1.0.2.5 Exercise: broadcast operations 
Square each element in the matrix `A` without using any loops.

In [2]:
A = [1, 2, 3, 4, 5]

A = np.array(A)

A_squares = A ** 2

print(A_squares)

A_squares = np.power(A, 2)

print(A_squares)

[ 1  4  9 16 25]
[ 1  4  9 16 25]


#### 1.1.2.2 Exercise: vector arithmetic 
Calculate `A` - `B` for the provided vectors, without using a loop.

In [3]:
A = [i for i in range(5)]
B = [-i for i in range(5)]

print(A)
print(B)

A = np.array(A)
B = np.array(B)

print(A - B)

[0, 1, 2, 3, 4]
[0, -1, -2, -3, -4]
[0 2 4 6 8]


In [4]:
## compute euclidean norm of A - B:
print(np.power(sum(np.power(A - B, 2)), 0.5), np.power(sum(np.power(A, 2)), 0.5))

10.954451150103322 5.477225575051661


#### 1.1.2.5 Exercise: scalar multiplication
Divide each element in the above vector `v` by 4.

In [5]:
v = np.array([9., 2., 4., 8., 1.])

v / 4

array([2.25, 0.5 , 1.  , 2.  , 0.25])

#### 1.1.2.7 Exercise: pointwise vector multiplication 
Perform pointwise multiplication between `v` divided by 4 which you calculated above, and `u`.

In [6]:
u = np.array([-5., 3.7, 2., 10., 0.])

(v / 4) * u

array([-11.25,   1.85,   2.  ,  20.  ,   0.  ])

#### 1.1.2.9 Exercise: inner products 
Find the dot product of the provided array `z` with `u` and `v`, respectively, from above.

In [7]:
z = np.array([1, 2, 3, 4, 5])

u = np.array([-5., 3.7, 2., 10., 0.])
v = np.array([9., 2., 4., 8., 1.])

print(z.dot(u))
print(z.dot(v))

48.4
62.0


#### 1.1.2.11 Exercise: cosine similarity 
Find the cosine similarity of the vectors `a` and `b`.

In [8]:
a = np.array([1, 0, 0])
b = np.array([0, 1, 0])

a.dot(b) / (np.linalg.norm(a) * np.linalg.norm(b))

0.0

#### 1.1.3.3 Exercise: Matrix indexing 
Print the value in the above matrix `A` located in the bottom-right entry.

In [9]:
A = np.array([
    [ 1,  2,  3],
    [ 4,  5,  6],
    [ 7,  8,  9],
    [10, 11, 12]
])

print(A[-1,-1])

12


#### 1.1.4.2 Exercise: matrix addition 
Using `A` and `B` from above, find the sum `A` + `B` + `A`.

In [10]:
## define a 4-row by 3-column matrix
A = np.array([
    [ 1,  2,  3],
    [ 4,  5,  6],
    [ 7,  8,  9],
    [10, 11, 12]
])

print(A, '\n')

## define another 4-row by 3-column matrix
B = np.array([
    [ 11,  12,  13],
    [ 14,  15,  16],
    [ 17,  18,  19],
    [ 20,  21,  22]
])

print(B, '\n')

## take the matrix sum of the two
print(2*A + B, "\n\n", A + A + B)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]] 

[[11 12 13]
 [14 15 16]
 [17 18 19]
 [20 21 22]] 

[[13 16 19]
 [22 25 28]
 [31 34 37]
 [40 43 46]] 

 [[13 16 19]
 [22 25 28]
 [31 34 37]
 [40 43 46]]


#### 1.1.4.4 Exercise: Scalar matrix multiplication
Divide each element in the above matrix `A` by 4.

In [11]:
A = np.array([
    [ 1,  2,  3],
    [ 4,  5,  6],
    [ 7,  8,  9],
    [10, 11, 12]
])

print(A.shape)
print(A/4)

(4, 3)
[[0.25 0.5  0.75]
 [1.   1.25 1.5 ]
 [1.75 2.   2.25]
 [2.5  2.75 3.  ]]


## Additional In-depth Exercises

### A. Getting to know numpy and word vectors
While we won't begin with text analysis formally until next chapter, we can take some pre-computed data to start exploring language with linear algebraic methods.

#### Main data
The numpy array file in `path = './01-numeric/data/vectors-' + str(dim) + '-.npy'`, which is a `dim = 10` dimensional semantic representation for the words of Mary Shelly's Frankenstein, Or the Modern Prometheus. The object has three linear-algebraic components: $U$, $V$, and $b$ and each has a row-dimension $N$ that represents the vocabulary size. Once loaded these are layed out as:

$$  
\begin{bmatrix} 
v_{1,1} & v_{1,2} & \dots & v_{1,10} \\ 
v_{2,1} & v_{2,2} & \dots & v_{2,10} \\ 
\vdots & \vdots & \ddots & \vdots \\ 
v_{N,1} & v_{N,2} & \dots & v_{N,10} 
\end{bmatrix} 
\begin{bmatrix} 
u_{1,1} & u_{1,2} & \dots & u_{1,10} \\ 
u_{2,1} & u_{2,2} & \dots & u_{2,10} \\ 
\vdots & \vdots & \ddots & \vdots \\ 
u_{N,1} & u_{N,2} & \dots & u_{N,10} 
\end{bmatrix} 
\begin{bmatrix} 
 b_{1} \\ 
 b_{2} \\ 
 \vdots \\ 
 b_{N} 
\end{bmatrix} 
$$

#### A.1 Load a `numpy` matrix with pre-set values from disk.

Load the vectors from disk using `np.load(path)` and slice the resulting object to produce `U`, `V`, and `b` arrays. Report the dimensions of each array to confirm their structure (and the size of the vocabulary).

In [12]:
import numpy as np

dim = 10
vectors = np.load('./data/vectors-10.npy')
V = vectors[:,:dim]
U = vectors[:,dim:2*dim]
b = vectors[:,-1]

vectors.shape, V.shape, U.shape, b.shape

((7273, 21), (7273, 10), (7273, 10), (7273,))

#### A.2 Save the `numpy` matrices with pre-set values from disk.
Then, save these to disk under the paths:

- `./01-numeric/data/U.npy`, 
- `./01-numeric/data/V.npy`, and 
- `./01-numeric/data/b.npy`.

using the `np.save(path, array)` command.

In [13]:
np.save("./data/V.npy", V)
np.save("./data/U.npy", U)
np.save("./data/b.npy", b)

#### A.3 Load the linked data for the vectors and build the word index
Load `data = json.load(open('./data/vectors-linked_data.json'))` and inspect the resulting object. For now, you'll only have to utilize the dictionary value within it, called `data['counts']`, which is a dictonary, keyed by words, with count-occurrence values (frequencies).

Using `data['counts']`, construct a `word_index` object, which is a dictionary of the form:
```
word_index = {
    w: representation_index,
    ...
}
```

such that `w` is any word string keying `data['counts']` and `representation_index` is the row index for `w` `U`-`V`-`b` representation of `w`. Hint: Python (>3) dictionaries hold keys in the order in which they were originally loaded, i.e., presentated in the json string serialization format. Use this and the fact that the the words were counted in the book's reading order.

In [14]:
import json
data = json.load(open('./data/vectors-linked_data.json'))
print(data.keys())
word_index = {w: i for i, w in enumerate(data['counts'])}

dict_keys(['counts', 'm_map', 'position', 'sentences'])


#### A.4 Write a flexible concatenation function
We'll want to conditionally concatenate the different representation components `U`, `V`, and `b`, but will assume that we're always going to use `U`. So write a function called `concatenate(U, V = 0, b = 0)` and stacks as many are non-zero size-by-side (as columns).

[Hint. Use the `np.column_stack()` function!]

In [15]:
def concatenate(U, V = 0, b = 0):
    to_concatenate = []
    for thing in [U, V, b]: 
        if type(thing) != int: to_concatenate.append(thing)
    if not to_concatenate: 
        print("at least one array must be non-empty")
        return np.array([])
    else:
        return(np.column_stack(tuple(to_concatenate)))
concatenate(U, V, b).shape

(7273, 21)

#### A.5 Write a cosine similarity functions that determines the most similar (word) vectors
Use the `concatenate()` function from the previous cell to flexibly stack the different columns. This will allow us to explore how/where the representations stores semantics.

_Accepts_:
- `w`: the target string to measure similarity against
- `U`, `V (= 0)`, `b (= 0)`: the semantic arrays from the representation or integer (null)
- `top (= 0)`: an integer describing the number of 'most similar' words/scores, and
- `v (= 0)`: a `V` (plus `U` and/or `b` dimensional) vector to which to compare to the vectors, instead of the target word.

_Returns_:
- `w_sims`: a sorted list of `top` tuples, of form: `(v, w_v_similarity)`, sortet high to low by `w_v_similarity` (cosine) values between vectors for words `w` and `v`.

Note: vectors must be unit-normed in order to compute these similarities!

In [16]:
def most_similar(w, word_index, U, V=0, b=0, top=10, v = 0):
    vec = concatenate(U, V, b)
    vec = vec / np.linalg.norm(vec, axis=1)[:, np.newaxis] # broadcasting
    if type(v) == int: v = vec[word_index[w],:]
    similar = sorted(enumerate(list(vec.dot(v))), 
                          key = lambda x: x[1], reverse = True)
    types = list(word_index.keys())
    if not top: top = len(vec.shape[0])
    word_sims = [(types[ix], sim) for ix, sim in similar[:top]]
    return word_sims
most_similar('she', word_index, U, V, b, top=10)

[('she', 1.0),
 ('her', 0.9738238945046376),
 ('very', 0.9727472319478015),
 ('we', 0.9719620112096093),
 ('he', 0.9655729080775022),
 ('justine', 0.964113201603028),
 ('dear', 0.9623722968823758),
 ('his', 0.961025308751945),
 ('poor', 0.949050515881002),
 ('a', 0.9445971362834037)]

#### A.6 Build an analogy generator
The classic motivation for the utility of word2vec is it's capacity to represent semantic constructs that 'locally' obey liear relationships. The most well known semantic constructs represented are loosely analogies, e.g., across gender, like:
$$
\hat{v}_\text{queen}\approx \frac{1}{3}\left(v_\text{king} - v_\text{man} + v_\text{woman}\right)
$$.

Here, build an analogy generator by writing a function:
- `analogy(positive, negative, word_index, U, V=0, b=0, top=10)`,
which accepts a list of two words called `positive` and a string called `negative` (in addition to the others from __A.4__), computes the above, and (uses this as `v` in the result of __A.4__) to compute the most similar other words in the vocabulary to the specified linear combination.

In [17]:
def analogy(positive, negative, word_index, U, V=0, b=0, top=10):
    vec = concatenate(U, V, b)
    v_hat = (vec[[word_index[w] for w in positive],:].sum(axis = 0) - vec[word_index[negative],:])/3
    return most_similar('', word_index, U, V, b, top, v_hat)
analogy(['father', 'creation'], 'monster', word_index, U, V, b, 25)

[('greatest', 0.1826567889499377),
 ('mother', 0.16515019232053824),
 ('of', 0.13401675657011436),
 ('m', 0.12988934133664037),
 ('to', 0.12635711683792133),
 ('child', 0.12585030505433098),
 ('years', 0.1234544843702866),
 ('time', 0.12307770986396062),
 ('and', 0.1207347591632278),
 ('very', 0.11414261424913746),
 ('natural', 0.11145783399374459),
 ('in', 0.11083148324067152),
 ('had', 0.10471441583413287),
 ('which', 0.10229660669490491),
 ('was', 0.09951253650731141),
 ('a', 0.0986087505470151),
 ('science', 0.09816803333510984),
 ('her', 0.0976394831690304),
 ('i', 0.0953172936524621),
 ('most', 0.09388143215089532),
 ('waldman', 0.08835233157641761),
 ('agrippa', 0.08767361614651456),
 ('dear', 0.08749062401763066),
 ('on', 0.08683379748941522),
 ('been', 0.08601971701558453)]