Quick implementation of a Term Document Matrix from fast.ai Machine Learning 1 [Lesson 10 (1:06:51)](https://youtu.be/37sFIak42Sc?t=4011). FastAI library can be downloaded from: https://github.com/fastai/fastai/tree/master/fastai

***NOTE*** this is a toy implementation that only counters *whether* a word is in a line instead of counting *how many times* a word appears.

-- Wayne Nixalo

---

## Term-Document Matrix example

In [1]:
def unique_words(corpus):
    bag = dict()
    for text in corpus:
        for word in text.split(' '):
            bag[word] = 1
    return list(bag.keys())

def print_TD_matrix():
    
    bow = bag_of_words

    # Print out formatted Term-Document Matrix
    for i in range(len(corpus)+2):
        if i == 0:
            print(f'{"TEXT":^18} | LABEL | {"FEATURES (f)":^27}')
        elif i == 1:
            print(f'{" ":^18} | {" ":<5} | {bow[0]:5}{bow[1]:^5}{bow[2]:^4}{bow[3]:}{bow[4]:^7}{bow[5]:^4}')
        else:
            print(f'{corpus[i-2]:<18} | {labels[i-2]:^6}| '
                  f'{TD_matrix[i-2][0]:^3}{TD_matrix[i-2][1]:^7}{TD_matrix[i-2][2]:^3}'
                  f'{TD_matrix[i-2][3]:^7}{TD_matrix[i-2][4]:^4}{TD_matrix[i-2][4]:^5}')
        if i == len(corpus)+1:
            print(f'{"-"*26:<26} | '
                  f'{"1":^3}{"1":^7}{"1":^3}'
                  f'{"1":^7}{"1":^4}{"1":^5}')
            for f in range(len(frequencies)):
                print(f'{"p(c="+str(1-f)+")":<18} | {probabilities[1-f]:^5} | '
                      f'{frequencies[f][0]:^3.2f}{frequencies[f][1]:^6.2f}{frequencies[f][2]:^5.2f}'
                      f'{frequencies[f][3]:^4.2f}{frequencies[f][4]:^6.2f}{frequencies[f][5]:^2.2f}')

def print_NB():
    p_doc = [[1 for i in range(n_docs)] for p in range(len(probabilities.keys()))]
    for d in range(n_docs):
        λ = labels[d]
        for w in range(n_wrds):
            p_doc[TD_matrix[d][w]][d] *= frequencies[labels[λ]][w]
            
    print(f'{"p(d|1)":>13}{"p(d|0)":>7}{"ratio":>6}')
    for d in range(n_docs):
        print(f'Doc {d+1:}: {p_doc[0][d]:^6.3f}{p_doc[1][d]:^7.3f}{p_doc[0][d]/p_doc[1][d]:^7.3f}')

In [2]:
corpus = ['this movie is good', 'the movie is good', 
        'this movie is bad', 'the movie is bad']
bag_of_words = unique_words(corpus)

n_docs = len(corpus)
n_wrds = len(bag_of_words)

labels = [1,1,0,0]
labels_avg = sum(labels)/n_docs
probabilities = {0:1-labels_avg, 1:labels_avg}

TD_matrix = [[int(word in text) for word in bag_of_words] for text in corpus]
frequencies = [[(sum([TD_matrix[r+(f*2)][c] 
                      for r in range(n_docs//2)]) + 1)/(n_docs//2 + 1) 
                for c in range(n_wrds)] for f in range(2)]


# Display Bag of Words representation of reviews
print_TD_matrix()
print()
print_NB()

       TEXT        | LABEL |        FEATURES (f)        
                   |       | this movie is good  the  bad 
this movie is good |   1   |  1    1    1    1    0    0  
the movie is good  |   1   |  0    1    1    1    1    1  
this movie is bad  |   0   |  1    1    1    0    0    0  
the movie is bad   |   0   |  0    1    1    0    1    1  
-------------------------- |  1    1    1    1    1    1  
p(c=1)             |  0.5  | 0.67 1.00 1.00 1.00 0.67 0.33
p(c=0)             |  0.5  | 0.67 1.00 1.00 0.33 0.67 1.00

       p(d|1) p(d|0) ratio
Doc 1: 0.667  0.222  3.000 
Doc 2: 0.667  0.222  3.000 
Doc 3: 0.222  0.667  0.333 
Doc 4: 0.222  0.667  0.333 


Sklearn has a function to create a Term-Document matrix:

`from sklearn.feature_extraction.text import CountVectorizer`

---

## Tokenizing and Term-Document Matrix Creation

Using IMDB movie reviews dataset wrt Sentiment Classification

Dataset can be downloaded from terminal via:

```
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

gunzip aclImdb_v1.tar.gz

tar -xvf aclImdb_v1.tar
```

In [3]:
from fastai.nlp import *
from sklearn.linear_model import LogisticRegression

import pathlib

In [4]:
PATH = pathlib.Path('../../data/aclImdb')
names = ['neg', 'pos']

In [5]:
trn, trn_y = texts_labels_from_folders(PATH / 'train', names)
val, val_y = texts_labels_from_folders(PATH / 'test', names)

`pathlib`: [docs](https://docs.python.org/3/library/pathlib.html) | [cheatsheet](http://pbpython.com/pathlib-intro.html)

```
tokenize: <function fastai.text.tokenize>
def tokenize(s): return re_tok.sub(r' \1 ', s).split()
```


[`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) converts a collection of text documents to a matrix of token counts (part of `sklearn.feature_extraction.text`)

In [6]:
vectorizer = CountVectorizer(tokenizer=tokenize) # the 'model'

`fit_transform(trn)` finds the vocabulary in the training set. It also tranforms the training set into a Term-Document Matrix. Since we have to apply the *same transformation* to our validation set, the second line uses just the method `transform(val)`.

`trn_term_doc` and `val_term_doc` are sparse matrices. `trn_term_doc[i]` represents training document `i` and it contains a count of words for each document for each word in the vocabulary.

In [7]:
trn_term_doc = vectorizer.fit_transform(trn) # create TD mat & transform
val_term_doc = vectorizer.transform(val)     # apply same transformation (vocab)

In [8]:
trn_term_doc  # 75,132 unique words in vocab for 25,000 reviews

<25000x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 3749745 stored elements in Compressed Sparse Row format>

In [9]:
trn_term_doc[0]  # 104 unique words in this review

<1x75132 sparse matrix of type '<class 'numpy.int64'>'
	with 104 stored elements in Compressed Sparse Row format>

sample of words in vocabulary:

In [10]:
vocab = vectorizer.get_feature_names(); n = 10000; vocab[n:n+5]

['cabbage', 'cabbages', 'cabbie', 'cabby', 'cabel']

In [11]:
# looking up vocab index of a word
vectorizer.vocabulary_['cabbage']

10000

In [12]:
# checking TDF of term 10000 in doc 0
trn_term_doc[0,10000]

0

## Naïve Bayes 

A **log-count ratio** $r$ for each word $f$ is defined as:

$r = \log \frac{\text{ratio of feature }\ f \ \text{ in positive documents}}{\text{ratio of feature }\ f \ \text{ in negative documents}}$

ratio of feature $f$ in positive documents is the **number of times a positive document has a feature $\boldsymbol{f}$** divided by the **number of positive documents**.

[Latex reference](https://www.sharelatex.com/learn/Fractions_and_Binomials)

Taking the Log allows us to add & subtract things instead of multiplying and dividing. It also aids in numerical fidelity: after numerous multiplications near zero, products can become so small as to exceed machine precision limits.

In [13]:
x = trn_term_doc # indep var
y = trn_y        # dep var

p = x[y==1].sum(0) + 1  # sum positives
q = x[y==0].sum(0) + 1  # sum negatives
r = np.log((p/p.sum())/(q/q.sum())) # log of ratios
b = np.log(len(p)/len(q)) # ratio of class probabilities

Formula for Naïve Bayes:

* for each document: multiply the Bayes probabilities by the counts : Matrix Multiply

* add log of class ratios : add `b`

* compare to zero (not 1 since we're in Logspace now)

* compare to mean

In [14]:
pre_preds = val_term_doc @ r.T + b
preds = pre_preds.T > 0
(preds == val_y).mean()

0.8074

80.7% accuracy.

Binarized Naive Bayes:

In [15]:
pre_preds = val_term_doc.sign() @ r.T + b
preds = pre_preds.T > 0
(preds == val_y).mean()

0.82624

## Logistic Regression

Instead of assuming the coefficients in $r$ are what should be used, we can learn them instead. Here's how we can fit logistic regression where the features are the unigrams.

This'll give us something with the same functional form as before (`val_term_doc @ r.T + b`), but instead of using a theoretical fixed $r$ and $b$, they'll be calculated based on logistic regression.

`dual=True` makes LogisticRegression run faster when the matrix is wider than it is tall by using a mathematically equivalent reformulation.

In [16]:
model = LogisticRegression(C=1e8, dual=True)
model.fit(x, y)
preds = model.predict(val_term_doc)
(preds == val_y).mean()

0.85728

Binarized Logistic Regression

In [17]:
model = LogisticRegression(C=1e8, dual=True)
model.fit(trn_term_doc.sign(), y)
preds = model.predict(val_term_doc.sign())
(preds == val_y).mean()

0.85504

LogReg is going to give us a coefficient for every term in our vocabulary. This gives us about 75,000 coefficients, which seems high given we have only about 25,000 reviews.

We can regularize this by using sklearn's built in regularizer via the `C` parameter. Smaller `C` means more regularization (so large `C=1e8` effectively shuts it off).

In [18]:
model = LogisticRegression(C=0.1, dual=True)
model.fit(x, y)
preds = model.predict(val_term_doc)
(preds == val_y).mean()

0.88236

In [19]:
model = LogisticRegression(C=0.1, dual=True)
model.fit(x.sign(), y)
preds = model.predict(val_term_doc.sign())
(preds == val_y).mean()

0.88404

* L1 Reg looks at absolute value of weights
* L2 Reg looks at the weights squared

L1 Reg will try to make as many things Zero as possible. L2 Reg tries to make *everything* smaller.

For our purposes here, L1 Regularization is appropriate - though L2 is the default w/ sklearn LogisticRegression (and the only way to use `dual=True`).

The difference between L1/L2 isn't so important in modern ML since we rarely try to directly interpret the coefficients; instead we try to understand our models through interrogation.

## Trigram with NB features

Our next model is a verison of Logistic Regression with Naïve Bayes features described [here](https://www.aclweb.org/anthology/P12-2018). For every document we compute binarized features as described above, but this time we use bigrams and trigrams too. Each feature is a Log-count ratio. A Logistic Regression model is then trained to predict sentiment.

When you initialize the `CountVectorizer`, you can also ask for n-grams -- by default we get unigrams (single words). If we specify `ngram_range=(1,3)` we'll get bigrams and trigrams in addition to unigrams.

In [20]:
vectorizer = CountVectorizer(ngram_range=(1,3), 
                             tokenizer=tokenize, max_features=800000)
trn_term_doc = vectorizer.fit_transform(trn)
val_term_doc = vectorizer.transform(val)

In [21]:
trn_term_doc.shape

(25000, 800000)

In [22]:
vocab = vectorizer.get_feature_names()

In [23]:
n = 600000; vocab[n:n+5]

['simplicity and',
 'simplicity of',
 'simplicity of the',
 'simplicity that',
 'simplification']

Trigram features dramatically improve Naïve Bayes and Logistic Regression. This technique is very helpful for taking advantage of Bag of Words approaches because it allows us to see the difference between [not good] vs [not bad], or [good] vs [" good "] and etc.

The `CountVectorizer` will sort n-grams by how often the appear and cutoff at the `max_features` most common n-grams. `max_features` can be specified as some 'large' number that loads reasonably quickly and provides good results.

In [28]:
def pr(y_i):
    p = x[y == y_i].sum(0)
    return (p + 1)/((y == y_i).sum() + 1)

In [29]:
y = trn_y
x = trn_term_doc.sign()
val_x = val_term_doc.sign()

In [30]:
r = np.log(pr(1) / pr(0)) # <==> r = np.log((p/p.sum())/(q/q.sum()))
b = np.log((y==1).mean() / (y==0).mean()) # <==> b = np.log(len(p)/len(q))

In [31]:
model = LogisticRegression(C=0.1, dual=True)
model.fit(x, y)

preds = model.predict(val_x)
(preds.T == val_y).mean()

0.905

Log-count ratio r:

In [32]:
r.shape, r

((1, 800000),
 matrix([[-0.05468, -0.161  , -0.24784, ...,  1.09861, -0.69315, -0.69315]]))

In [33]:
np.exp(r)

matrix([[0.94678, 0.85129, 0.78049, ..., 3.     , 0.5    , 0.5    ]])

Fitting regularized Logistic Regression where features are the trigrams' Log-count ratios.

In [34]:
x_nb = x.multiply(r)
model = LogisticRegression(dual=True, C=0.1)
model.fit(x_nb, y)

val_x_nb = val_x.multiply(r)
preds = model.predict(val_x_nb)
(preds.T == val_y).mean()

0.91768

## Misc

In [None]:
# bag_of_words to string for printing
# bag = ''
# for word in bag_of_words:
#     bag += word + ' '

In [192]:
# probabilities = {0:0.5, 1:0.5}
# bow_dict = {word:i for i, word in enumerate(bag_of_words)}

# frequency = [[0 for i in range(len(bag_of_words))] for i in range(len(probabilities.keys()))]
# for i,text in enumerate(corpus):
#     for word in text.split(' '):
#         frequency[labels[i]][bow_dict[word]] += 1
# frequency = [list(map(lambda x : x /(1 + len(TD_matrix)//2), frequency[i])) for i in range(len(frequency))]


#                   f'{(sum([TD_matrix[i][0] for i in range(len(TD_matrix))]) + 1)/(len(TD_matrix)+1):^3}'
#                   f'{(sum([TD_matrix[i][1] for i in range(len(TD_matrix))]) + 1)/(len(TD_matrix)+1):^7}'
#                   f'{(sum([TD_matrix[i][2] for i in range(len(TD_matrix))]) + 1)/(len(TD_matrix)+1):^3}'
#                   f'{(sum([TD_matrix[i][3] for i in range(len(TD_matrix))]) + 1)/(len(TD_matrix)+1):^5}'
#                   f'{(sum([TD_matrix[i][4] for i in range(len(TD_matrix))]) + 1)/(len(TD_matrix)+1):^5}'
#                   f'{(sum([TD_matrix[i][5] for i in range(len(TD_matrix))]) + 1)/(len(TD_matrix)+1):^3}'

In [182]:
## for-loop way
# frequencies = [[],[]]
# for f in range(len(frequencies)):
#     for c in range(len(TD_matrix[0])):
#         tot = 0
#         for r in range(len(TD_matrix)//2):
#             tot += TD_matrix[r+(f*2)][c]
#         frequencies[f].append((tot + 1)/(len(TD_matrix)//2 + 1))       

# list-comprehension way
# frequencies = [[(sum([TD_matrix[r+(f*2)][c] for r in range(len(TD_matrix)//2)]) + 1)/(len(TD_matrix)//2 + 1) for c in range(len(TD_matrix[0]))] for f in range(2)]