In this notebook, I want to wrap up some loose ends from last time.

## The two cultures

This "debate" captures the tension between two approaches:

- modeling the underlying mechanism of a phenomena
- using machine learning to predict outputs (without necessarily understanding the mechanisms that create them)

<img src="https://github.com/fastai/course-nlp/blob/master/images/glutathione.jpg?raw=1" alt="One carbon cell metabolism" style="width: 80%"/>

I was part of a research project (in 2007) that involved manually coding each of the above reactions.  We were determining if the final system could generate the same ouputs (in this case, levels in the blood of various substrates) as were observed in clinical studies.  

The equation for each reaction could be quite complex:
<img src="https://github.com/fastai/course-nlp/blob/master/images/vcbs.png?raw=1" alt="reaction equation" style="width: 80%"/>

This is an example of modeling the underlying mechanism, and is very different from a machine learning approach.

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2391141/

## The most popular word in each state

<img src="https://github.com/fastai/course-nlp/blob/master/images/map-popular-word.png?raw=1" alt="The" style="width: 80%"/>

A time to remove stop words

## Factorization is analgous to matrix decomposition

### With Integers

Multiplication: 
	$$2 * 2 * 3 * 3 * 2 * 2 \rightarrow 144$$
    
<img src="https://github.com/fastai/course-nlp/blob/master/images/factorization.png?raw=1" alt="factorization" style="width: 50%"/>

Factorization is the “opposite” of multiplication: 
	 $$144 \rightarrow 2 * 2 * 3 * 3 * 2 * 2$$
     
Here, the factors have the nice property of being prime.

Prime factorization is much harder than multiplication (which is good, because it’s the heart of encryption).

### With Matrices

Matrix decompositions are a way of taking matrices apart (the "opposite" of matrix multiplication).

Similarly, we use matrix decompositions to come up with matrices with nice properties.

Taking matrices apart is harder than putting them together.

[One application](https://github.com/fastai/numerical-linear-algebra/blob/master/nbs/3.%20Background%20Removal%20with%20Robust%20PCA.ipynb):

<img src="https://github.com/fastai/course-nlp/blob/master/images/grid1.jpg?raw=1" alt="The" style="width: 100%"/>

What are the nice properties that matrices in an SVD decomposition have?

$$A = USV$$

## Some Linear Algebra Review

### Matrix-vector multiplication

$Ax = b$ takes a linear combination of the columns of $A$, using coefficients $x$

http://matrixmultiplication.xyz/

### Matrix-matrix multiplication

$A B = C$ each column of C is a linear combination of columns of A, where the coefficients come from the corresponding column of C

<img src="https://github.com/fastai/course-nlp/blob/master/images/face_nmf.png?raw=1" alt="NMF on faces" style="width: 80%"/>

(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))

### Matrices as Transformations

The 3Blue 1Brown [Essence of Linear Algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) videos are fantastic.  They give a much more visual & geometric perspective on linear algreba than how it is typically taught.  These videos are a great resource if you are a linear algebra beginner, or feel uncomfortable or rusty with the material.

Even if you are a linear algrebra pro, I still recommend these videos for a new perspective, and they are very well made.

## British Literature SVD & NMF in Excel

Data was downloaded from [here](https://de.dariah.eu/tatom/datasets.html)

The code below was used to create the matrices which are displayed in the SVD and NMF of British Literature excel workbook. The data is intended to be viewed in Excel, I've just included the code here for thoroughness.

### Initializing, create document-term matrix

In [3]:
!wget https://liferay.de.dariah.eu/tatom/_downloads/datasets.zip
!unzip datasets.zip

--2019-07-16 19:39:55--  https://liferay.de.dariah.eu/tatom/_downloads/datasets.zip
Resolving liferay.de.dariah.eu (liferay.de.dariah.eu)... 134.76.30.131
Connecting to liferay.de.dariah.eu (liferay.de.dariah.eu)|134.76.30.131|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57659024 (55M) [application/zip]
Saving to: ‘datasets.zip’


2019-07-16 19:39:59 (17.8 MB/s) - ‘datasets.zip’ saved [57659024/57659024]



In [16]:
!ls ./data/british-fiction-corpus/

ABronte_Agnes.txt      Dickens_David.txt	Richardson_Pamela.txt
ABronte_Tenant.txt     Dickens_Hard.txt		Sterne_Sentimental.txt
Austen_Emma.txt        EBronte_Wuthering.txt	Sterne_Tristram.txt
Austen_Pride.txt       Eliot_Adam.txt		Thackeray_Barry.txt
Austen_Sense.txt       Eliot_Middlemarch.txt	Thackeray_Pendennis.txt
CBronte_Jane.txt       Eliot_Mill.txt		Thackeray_Vanity.txt
CBronte_Professor.txt  Fielding_Joseph.txt	Trollope_Barchester.txt
CBronte_Villette.txt   Fielding_Tom.txt		Trollope_Phineas.txt
Dickens_Bleak.txt      Richardson_Clarissa.txt	Trollope_Prime.txt


In [0]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import decomposition
from glob import glob
import os

In [0]:
np.set_printoptions(suppress=True)

In [0]:
filenames = []
for folder in ["british-fiction-corpus"]: #, "french-plays", "hugo-les-misérables"]:
    filenames.extend(glob("data/" + folder + "/*.txt"))

In [0]:
filenames = "./data/british-fiction-corpus/*.txt"

In [21]:
filenames

['data/british-fiction-corpus/CBronte_Professor.txt',
 'data/british-fiction-corpus/Thackeray_Pendennis.txt',
 'data/british-fiction-corpus/Sterne_Sentimental.txt',
 'data/british-fiction-corpus/Trollope_Prime.txt',
 'data/british-fiction-corpus/ABronte_Agnes.txt',
 'data/british-fiction-corpus/CBronte_Jane.txt',
 'data/british-fiction-corpus/ABronte_Tenant.txt',
 'data/british-fiction-corpus/Dickens_Hard.txt',
 'data/british-fiction-corpus/Eliot_Adam.txt',
 'data/british-fiction-corpus/Austen_Pride.txt',
 'data/british-fiction-corpus/Eliot_Mill.txt',
 'data/british-fiction-corpus/CBronte_Villette.txt',
 'data/british-fiction-corpus/Eliot_Middlemarch.txt',
 'data/british-fiction-corpus/Sterne_Tristram.txt',
 'data/british-fiction-corpus/Richardson_Clarissa.txt',
 'data/british-fiction-corpus/Austen_Sense.txt',
 'data/british-fiction-corpus/Fielding_Tom.txt',
 'data/british-fiction-corpus/EBronte_Wuthering.txt',
 'data/british-fiction-corpus/Dickens_David.txt',
 'data/british-fiction-co

In [20]:
len(filenames)

27

In [22]:
vectorizer = TfidfVectorizer(input='filename', stop_words='english')
dtm = vectorizer.fit_transform(filenames).toarray()
vocab = np.array(vectorizer.get_feature_names())
dtm.shape, len(vocab)

((27, 55035), 55035)

In [23]:
[f.split("/")[3] for f in filenames]

IndexError: ignored

### NMF

In [0]:
clf = decomposition.NMF(n_components=10, random_state=1)

W1 = clf.fit_transform(dtm)
H1 = clf.components_

In [0]:
num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

In [0]:
def get_all_topic_words(H):
    top_indices = lambda t: {i for i in np.argsort(t)[:-num_top_words-1:-1]}
    topic_indices = [top_indices(t) for t in H]
    return sorted(set.union(*topic_indices))

In [0]:
ind = get_all_topic_words(H1)

In [28]:
vocab[ind]

array(['adams', 'allworthy', 'amelia', 'barry', 'becky', 'bounderby',
       'catherine', 'cathy', 'clavering', 'corporal', 'crawley', 'darcy',
       'did', 'dobbin', 'dorothea', 'earnshaw', 'edgar', 'elinor', 'emma',
       'father', 'finn', 'glegg', 'good', 'hareton', 'hath', 'heathcliff',
       'jones', 'jos', 'joseph', 'know', 'lady', 'laura', 'like',
       'linton', 'little', 'll', 'lopez', 'lydgate', 'lyndon', 'maggie',
       'man', 'marianne', 'miss', 'mr', 'mrs', 'osborne', 'pen',
       'pendennis', 'philip', 'phineas', 'quoth', 'rawdon', 'said',
       'sophia', 'thought', 'time', 'tis', 'toby', 'tom', 'trim',
       'tulliver', 'uncle', 'wakem', 'weston', 'wharton'], dtype='<U31')

In [29]:
show_topics(H1)

['mr said mrs bounderby lydgate know little dorothea',
 'said little like did time know thought good',
 'adams jones joseph allworthy sophia hath said lady',
 'mr elinor emma darcy mrs weston marianne miss',
 'toby said uncle father corporal quoth tis trim',
 'heathcliff linton hareton catherine earnshaw cathy edgar ll',
 'said lyndon pendennis barry man clavering lady pen',
 'phineas said mr lopez finn man wharton laura',
 'crawley osborne rawdon dobbin amelia jos becky said',
 'maggie tulliver said tom glegg philip mr wakem']

In [0]:
W1.shape, H1[:, ind].shape

((27, 10), (10, 64))

#### Export to CSVs

In [0]:
from IPython.display import FileLink, FileLinks

In [0]:
np.savetxt("britlit_W.csv", W1, delimiter=",", fmt='%.14f')
FileLink('britlit_W.csv')

In [0]:
np.savetxt("britlit_H.csv", H1[:,ind], delimiter=",", fmt='%.14f')
FileLink('britlit_H.csv')

In [0]:
np.savetxt("britlit_raw.csv", dtm[:,ind], delimiter=",", fmt='%.14f')
FileLink('britlit_raw.csv')

In [0]:
[str(word) for word in vocab[ind]]

['adams',
 'allworthy',
 'bounderby',
 'brandon',
 'catherine',
 'cathy',
 'corporal',
 'crawley',
 'darcy',
 'dashwood',
 'did',
 'earnshaw',
 'edgar',
 'elinor',
 'emma',
 'father',
 'ferrars',
 'finn',
 'glegg',
 'good',
 'gradgrind',
 'hareton',
 'heathcliff',
 'jennings',
 'jones',
 'joseph',
 'know',
 'lady',
 'laura',
 'like',
 'linton',
 'little',
 'll',
 'lopez',
 'louisa',
 'lyndon',
 'maggie',
 'man',
 'marianne',
 'miss',
 'mr',
 'mrs',
 'old',
 'osborne',
 'pendennis',
 'philip',
 'phineas',
 'quoth',
 'said',
 'sissy',
 'sophia',
 'sparsit',
 'stephen',
 'thought',
 'time',
 'tis',
 'toby',
 'tom',
 'trim',
 'tulliver',
 'uncle',
 'wakem',
 'wharton',
 'willoughby']

### SVD

In [0]:
U, s, V = decomposition.randomized_svd(dtm, 10)

In [0]:
ind = get_all_topic_words(V)

In [0]:
len(ind)

52

In [0]:
vocab[ind]

array(['adams', 'allworthy', 'bounderby', 'bretton', 'catherine',
       'crimsworth', 'darcy', 'dashwood', 'did', 'elinor', 'elton', 'emma',
       'finn', 'fleur', 'glegg', 'good', 'gradgrind', 'hareton', 'hath',
       'heathcliff', 'hunsden', 'jennings', 'jones', 'joseph', 'knightley',
       'know', 'lady', 'linton', 'little', 'lopez', 'louisa', 'lydgate',
       'madame', 'maggie', 'man', 'marianne', 'miss', 'monsieur', 'mr',
       'mrs', 'pelet', 'philip', 'phineas', 'said', 'sissy', 'sophia',
       'sparsit', 'toby', 'tom', 'tulliver', 'uncle', 'weston'], 
      dtype='<U31')

In [0]:
show_topics(H1)

['mr said mrs miss emma darcy little know',
 'said little like did time know thought good',
 'adams jones said lady allworthy sophia joseph mr',
 'elinor marianne dashwood jennings willoughby mrs brandon ferrars',
 'maggie tulliver said tom glegg philip mr wakem',
 'heathcliff linton hareton catherine earnshaw cathy edgar ll',
 'toby said uncle father corporal quoth tis trim',
 'phineas said mr lopez finn man wharton laura',
 'said crawley lyndon pendennis old little osborne lady',
 'bounderby gradgrind sparsit said mr sissy louisa stephen']

In [0]:
np.savetxt("britlit_U.csv", U, delimiter=",", fmt='%.14f')
FileLink('britlit_U.csv')

In [0]:
np.savetxt("britlit_V.csv", V[:,ind], delimiter=",", fmt='%.14f')
FileLink('britlit_V.csv')

In [0]:
np.savetxt("britlit_raw_svd.csv", dtm[:,ind], delimiter=",", fmt='%.14f')
FileLink('britlit_raw_svd.csv')

In [0]:
np.savetxt("britlit_S.csv", np.diag(s), delimiter=",", fmt='%.14f')
FileLink('britlit_S.csv')

In [0]:
[str(word) for word in vocab[ind]]

['adams',
 'allworthy',
 'bounderby',
 'bretton',
 'catherine',
 'crimsworth',
 'darcy',
 'dashwood',
 'did',
 'elinor',
 'elton',
 'emma',
 'finn',
 'fleur',
 'glegg',
 'good',
 'gradgrind',
 'hareton',
 'hath',
 'heathcliff',
 'hunsden',
 'jennings',
 'jones',
 'joseph',
 'knightley',
 'know',
 'lady',
 'linton',
 'little',
 'lopez',
 'louisa',
 'lydgate',
 'madame',
 'maggie',
 'man',
 'marianne',
 'miss',
 'monsieur',
 'mr',
 'mrs',
 'pelet',
 'philip',
 'phineas',
 'said',
 'sissy',
 'sophia',
 'sparsit',
 'toby',
 'tom',
 'tulliver',
 'uncle',
 'weston']

## Randomized SVD offers a speed up

<img src="https://github.com/fastai/course-nlp/blob/master/images/svd_slow.png?raw=1" alt="" style="width: 80%"/>

One way to address this is to use randomized SVD.  In the below chart, the error is the difference between A - U * S * V, that is, what you've failed to capture in your decomposition:

<img src="https://github.com/fastai/course-nlp/blob/master/images/svd_speed.png?raw=1" alt="" style="width: 60%"/>

For more on randomized SVD, check out my [PyBay 2017 talk](https://www.youtube.com/watch?v=7i6kBz1kZ-A&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=7).

For significantly more on randomized SVD, check out the [Computational Linear Algebra course](https://github.com/fastai/numerical-linear-algebra).

## Full vs Reduced SVD

Remember how we were calling `np.linalg.svd(vectors, full_matrices=False)`?  We set `full_matrices=False` to calculate the reduced SVD.  For the full SVD, both U and V are **square** matrices, where the extra columns in U form an orthonormal basis (but zero out when multiplied by extra rows of zeros in S).

Diagrams from Trefethen:

<img src="https://github.com/fastai/course-nlp/blob/master/images/full_svd.JPG?raw=1" alt="" style="width: 80%"/>

<img src="https://github.com/fastai/course-nlp/blob/master/images/reduced_svd.JPG?raw=1" alt="" style="width: 70%"/>