# 6. Intro to Natural Language Processing (NLP)
#### By the end of this day, you'll be able to 
- explain the differences between bag of words and n-grams
- apply the TFIDF transformation
- explain the difference between topic modeling and LDA
- run through a simple NLP pipeline, from cleaning and vectorizing text to mining topics

## 6.1 N-grams
- an ngram is a contiguous sequence of n items from a given sample of text or speech
- the items can be phonemes, syllables, letters, words, or base pairs (according to the application)

- we can generalize bag of words to phrases of *n* words
- bag of words is a unigram representation of text
- we can have unigrams, bigrams, 3-grams, 4-grams, etc.

#### Our corpus:
- It was the best of times, it was the worst of times, it was the Age of Wisdom, it was the Age of Foolishness,

#### Bigrams
['it was', 'was the', 'the best', 'best of', ...]

# Exercise 1 
## Prepare the bigrams for this corpus (be sure to normalize/lemmatize the corpus first)
- It was the best of times,
- it was the worst of times,
- it was the Age of Wisdom,
- it was the Age of Foolishness,

In [74]:
# Solution
import re
import contractions
import inflect
import nltk
from nltk.util import ngrams
from nltk.corpus.reader import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def normalize(doc, stopwords):
    doc = doc.lower()
    doc = doc.strip()
    contractions.fix(doc)
    doc = re.sub(r'[^\w\s]', '', doc)
    words = nltk.word_tokenize(doc)
    p = inflect.engine()
    words = [p.number_to_words(word) for word in words if word.isdigit()] + \
            [word for word in words if word.isdigit() == False]
    words = [i for i in words if not i in stopwords]
    
    return words

In [75]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    else:
        return None

In [76]:
def lemmatize(doc):
    tagged_doc = nltk.pos_tag(doc)
    lemmas = []
    for word, tag in tagged_doc:
        wn_tag = get_wordnet_pos(tag)
        if wn_tag is None:
            lemma = lemmatizer.lemmatize(word) 
        else:
            lemma = lemmatizer.lemmatize(word, pos=wn_tag) 
        lemmas.append(lemma)
        
    return lemmas

In [87]:
def ngramize(doc, n):
    tokens = nltk.word_tokenize(doc)
    bigrams = list(ngrams(tokens, n))
    return bigrams

In [90]:
corpus = ['The character said: It was the best of times, ', 
          'The character said: it was the worst of times, ', 
          'The character said: it was the Age of Wisdom, ', 
          'it was the Age of Foolishness. - Charles Dickens']

sw = set(stopwords.words('english'))
norm_corpus = [normalize(doc, sw) for doc in corpus]

lemmatizer = WordNetLemmatizer()
lemmed_corpus  = [lemmatize(doc) for doc in norm_corpus]

clean_corpus = [' '.join(doc) for doc in lemmed_corpus]
print(clean_corpus)
print()

bigrams = [ngramize(doc, 2) for doc in clean_corpus]
print(bigrams)

['character say best time', 'character say bad time', 'character say age wisdom', 'age foolishness charles dickens']

[[('character', 'say'), ('say', 'best'), ('best', 'time')], [('character', 'say'), ('say', 'bad'), ('bad', 'time')], [('character', 'say'), ('say', 'age'), ('age', 'wisdom')], [('age', 'foolishness'), ('foolishness', 'charles'), ('charles', 'dickens')]]


In [None]:
# add 

## 6.2 Bag of words vs. N-grams

- bag of words is simple, but it is computationally inefficient
- n-grams can create an even larger count matrix
- n > 3 is rarely used
- corpus of 1 billion ($10^9$) words contains roughly $10^5$ 1-grams, $3 \times 10^5$ 2-grams, over $10^6$ 3-grams roughly
- One counter-example: a large n can reveal plagarism 

# Exercise 2 

From `clean_corpus`, construct the bi-grams bag of words matrix in the previous slide.

In [17]:
# Solution
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(clean_corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

['age foolishness', 'age wisdom', 'bad time', 'best time', 'character say', 'charles dickens', 'foolishness charles', 'say age', 'say bad', 'say best']
[[0 0 0 1 1 0 0 0 0 1]
 [0 0 1 0 1 0 0 0 1 0]
 [0 1 0 0 1 0 0 1 0 0]
 [1 0 0 0 0 1 1 0 0 0]]


## 6.3 Towards the TFIDF transformation:

### 6.3.1 Another look at the count matrix

|c|term_1|term_2|...|term_j|...|term_m|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|c_11|c_12|...|c_1j|...|c_1m|
|__Doc_2__|c_21|c_22|...|c_2j|...|c_2m|
|__...__|...|...|...|...|...|...|
|__Doc_i__|c_i1|c_i2|...|c_ij|...|c_im|
|__...__|...|...|...|...|...|...|
|__Doc_n__|c_n1|c_n2|...|c_nj|...|c_nm|

- c is our count matrix
- c_ij = number of times term_j appears in document_i

### 6.3.2 The term count vector

|-|term_1|term_2|...|term_j|...|term_m|__<font color='red'>T</font>__|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|c_11|c_12|...|c_1j|...|c_1m|__<font color='red'>sum_j(c_1j)</font>__|
|__Doc_2__|c_21|c_22|...|c_2j|...|c_2m|__<font color='red'>sum_j(c_2j)</font>__|
|__...__|...|...|...|...|...|...|<font color='red'>...</font>|
|__Doc_i__|c_i1|c_i2|...|c_ij|...|c_im|__<font color='red'>sum_j(c_ij)</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>|
|__Doc_n__|c_n1|c_n2|...|c_nj|...|c_nm|__<font color='red'>sum_j(c_nj)</font>__|

- T is the term count vector
- T_i is the number of terms in document_i

### 6.3.3 The document count vector

|-|term_1|term_2|...|term_j|...|term_m|__<font color='red'>T</font>__|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|c_11|c_12|...|c_1j|...|c_1m|__<font color='red'>sum_j(c_1j)</font>__|
|__Doc_2__|c_21|c_22|...|c_2j|...|c_2m|__<font color='red'>sum_j(c_2j)</font>__|
|__...__|...|...|...|...|...|...|<font color='red'>...</font>|
|__Doc_i__|c_i1|c_i2|...|c_ij|...|c_im|__<font color='red'>sum_j(c_ij)</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>|
|__Doc_n__|c_n1|c_n2|...|c_nj|...|c_nm|__<font color='red'>sum_j(c_nj)</font>__|
|__<font color='blue'>D</font>__|__<font color='blue'>sum_i(c_i1 > 0)</font>__|__<font color='blue'>sum_i(c_i2 > 0)</font>__|<font color='blue'>...</font>|__<font color='blue'>sum_i(c_ij > 0)</font>__|<font color='blue'>...</font>|__<font color='blue'>sum_i(c_im > 0)</font>__||

- D is the document count vector
- D_j is the number of documents that contain term_j at least once.

# Exercise 3
## Calculate the term count and document counts of the following count matrix:

|-|term_1|term_2|term_3|term_4|term_5|term_6|term_7|term_8|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc1__|2|3|1|0|3|3|1|2|
|__Doc2__|1|3|1|0|0|0|3|2|
|__Doc3__|1|0|1|2|0|1|0|1|
|__Doc4__|0|2|1|3|3|0|0|3|
|__Doc5__|2|2|2|0|1|0|3|2|
|__Doc6__|1|0|3|3|1|2|3|1|
|__Doc7__|2|2|0|2|0|2|3|2|


### Solution
- T = [15, 10, 6, 12, 12, 14, 13]
- D = [6, 5, 6, 4, 4, 4, 5, 7]

## 6.4 The TFIDF transformation

#### We need to normalize or de-bias the count matrix!
- Some documents are shorter, others are longer => there is a bias towards longer documents
- Some terms appear in most of the documents => there is bias towards frequent terms

### 6.4.1 TFIDF - Term Frequency times Inverse Document Frequency
- number of documents: n
- term frequency: C_ij/T_i (frequency of word in a document)
- document frequency: D_j/n (rank of the word for its relevancy in the corpus)
- inverse document frequency (textbook): ln(n/D_j)
- inverse document frequency (scikit learn): ln[(n+1)/(D_j+1)]+1
    - adding 1 to numerator and denominator prevents zero divisions (as if extra document was seen containing every term in the corpus exactly once)
    - adding 1 to idf is that terms with zero idf (terms that occur in all documents in a training set) will not be entirely ignored
- tfidf = tf * idf
### <center>W_ij = c_ij / T_i * ln(D_j / n)</center>
#### <center>or</center>
### <center>W_ij = c_ij / T_i * ln[(D_j + 1)/(n + 1)]+1</center>

### W_ij = c_ij / T_i * ln[(D_j + 1)/(n + 1)]+1


|__W__ |term_1|term_2|...|term_j|...|term_m|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|c_11/T_1 * ln[(D_1+1)/(n+1)]+1|c_12/T_1 * ln[(D_2+1)/n+1)]+1|...|c_1j/T_1 * ln[(D_j+1)/n+1)]+1|...|c_1m/T_1 * ln[(D_m+1)/n+1)]+1|
|__Doc_2__|c_21/T_2 * ln[(D_1+1)/(n+1)]+1|c_22/T_2 * ln[(D_2+1)/n+1)]+1|...|c_2j/T_2 * ln[(D_j+1)/n+1)]+1|...|c_2m/T_2 * ln[(D_m+1)/n+1)]+1|
|__...__|...|...|...|...|...|...|
|__Doc_i__|c_i1/T_i * ln[(D_1+1)/(n+1)]+1|c_i2/T_i * ln[(D_2+1)/n+1)]+1|...|c_ij/T_i * ln[(D_j+1)/n+1)]+1|...|c_im/T_i * ln[(D_m+1)/n+1)]+1|
|__...__|...|...|...|...|...|...|
|__Doc_n__|c_n1/T_n * ln[(D_1+1)/(n+1)]+1|c_n2/T_n * ln[(D_2+1)/n+1)]+1|...|c_nj/T_n * ln[(D_j+1)/n+1)]+1|...|c_nm/T_n * ln[(D_m+1)/n+1)]+1|


### then, normalize the weights so they are between 0 and 1 (L2)
### TFIDF_ij = W_ij / sqrt( sum( W_ij^2 ) ) 

# Exercise 4
### Calculate the TFIDF matrix for the following count matrix:


|c |term_1|term_2|term_3|term_4|term_5|term_6|<font color='red'>T</font>|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|0|1|1|0|1|1|<font color='red'>4</font>|
|__Doc_2__|0|2|1|0|1|1|<font color='red'>5</font>|
|__Doc_3__|1|0|1|1|1|1|<font color='red'>5</font>|
|__<font color='blue'>D</font>__|<font color='blue'>1</font>|<font color='blue'>2</font>|<font color='blue'>3</font>|<font color='blue'>1</font>|<font color='blue'>3</font>|<font color='blue'>3</font>||

### First...
## Calculate W_ij = c_ij / T_i * ln[(D_j + 1)/(n + 1)]+1
|W |term_1|term_2|term_3|term_4|term_5|term_6|<font color='red'>T</font>|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|0/4*ln(4/2)+1|1/4*ln(4/3)+1|1/4*ln(4/4)+1|0/4*ln(4/2)+1|1/4*ln(4/4)+1|1/4*ln(4/4)+1|<font color='red'>4</font>|
|__Doc_2__|0/5*ln(4/2)+1|2/5*ln(4/3)+1|1/5*ln(4/4)+1|0/5*ln(4/2)+1|1/5*ln(4/4)+1|1/5*ln(4/4)+1|<font color='red'>5</font>|
|__Doc_3__|1/5*ln(4/2)+1|0/5*ln(4/3)+1|1/5*ln(4/4)+1|1/5*ln(4/2)+1|1/5*ln(4/4)+1|1/5*ln(4/4)+1|<font color='red'>5</font>|
|__<font color='blue'>D</font>__|<font color='blue'>1</font>|<font color='blue'>2</font>|<font color='blue'>3</font>|<font color='blue'>1</font>|<font color='blue'>3</font>|<font color='blue'>3</font>||

|W |term_1|term_2|term_3|term_4|term_5|term_6|<font color='red'>T</font>|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|0|0.3219|0.25|0|0.25|0.25|<font color='red'>4</font>|
|__Doc_2__|0|0.51507|0.2|0|0.2|0.2|<font color='red'>5</font>|
|__Doc_3__|0.3386|0|0.2|0.33863|0.2|0.2|<font color='red'>5</font>|
|__<font color='blue'>D</font>__|<font color='blue'>1</font>|<font color='blue'>2</font>|<font color='blue'>3</font>|<font color='blue'>1</font>|<font color='blue'>3</font>|<font color='blue'>3</font>||

### Then...

## TFIDF_ij = W_ij / sqrt( sum( W_ij^2 ) ) 
normalize each row so weights are between 0 and 1

$\sqrt{ 0.3219^2+0.25^2+0.25^2+0.25^2 } = .539555$

$\sqrt{ 0.51507^2+0.2^2+0.2^2+0.2^2 } = .620723$

$\sqrt{ 0.3386^2+0.2^2+0.33863^2+0.2^2+0.2^2 } = .591033$

|TFIDF |term_1|term_2|term_3|term_4|term_5|term_6|<font color='red'>T</font>|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|0|0.59662724|0.46333427|0|0.46333427|0.46333427|<font color='red'>4</font>|
|__Doc_2__|0|0.82979177|0.32220367|0|0.32220367|0.32220367|<font color='red'>5</font>|
|__Doc_3__|0.57292883|0|0.338381|0.57292883|0.338381|0.338381|<font color='red'>5</font>|
|__<font color='blue'>D</font>__|<font color='blue'>1</font>|<font color='blue'>2</font>|<font color='blue'>3</font>|<font color='blue'>1</font>|<font color='blue'>3</font>|<font color='blue'>3</font>||

### 6.4.2 TFIDF Transformation in Corpus Preparation

In [91]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['This is the document.',
          'This document is the document.',
          'And this is the paper.']

c_vectorizer = CountVectorizer()
X = c_vectorizer.fit_transform(corpus)
print(c_vectorizer.get_feature_names())
print(X.toarray())

t_vectorizer = TfidfVectorizer()
tfidf = t_vectorizer.fit_transform(corpus)
print(tfidf.toarray())

['and', 'document', 'is', 'paper', 'the', 'this']
[[0 1 1 0 1 1]
 [0 2 1 0 1 1]
 [1 0 1 1 1 1]]
[[0.         0.59662724 0.46333427 0.         0.46333427 0.46333427]
 [0.         0.82979177 0.32220367 0.         0.32220367 0.32220367]
 [0.57292883 0.         0.338381   0.57292883 0.338381   0.338381  ]]


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['The character said: It was the best of times, ', 
          'The character said: it was the worst of times, ', 
          'The character said: it was the Age of Wisdom, ', 
          'it was the Age of Foolishness. - Charles Dickens']
norm_corpus = [normalize(doc) for doc in corpus]
lemmed_corpus = [lemmatize(doc) for doc in norm_corpus]
clean_corpus = [' '.join(doc) for doc in lemmed_corpus]
clean_corpus

In [18]:
# the bigram TFIDF matrix
vectorizer = TfidfVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(clean_corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

['age', 'bad', 'best', 'character', 'charles', 'dickens', 'foolishness', 'say', 'time', 'wisdom']
[[0.         0.         0.64065543 0.40892206 0.         0.
  0.         0.40892206 0.5051001  0.        ]
 [0.         0.64065543 0.         0.40892206 0.         0.
  0.         0.40892206 0.5051001  0.        ]
 [0.5051001  0.         0.         0.40892206 0.         0.
  0.         0.40892206 0.         0.64065543]
 [0.41428875 0.         0.         0.         0.52547275 0.52547275
  0.52547275 0.         0.         0.        ]]


## 6.5 Topic Models

- A type of statistical model for discovering the abstract “topics” that occur in a collection of documents
- Can think of it as a form of dimensionality reduction

### 6.5.1 Latent Dirichlet Allocation (LDA)
- A flavor of topic modeling that can be used to classify text in a document to a particular topic. 
- The LDA model discovers the different topics that the documents represent and how much of each topic is present in a document
- Very popular since its inception in 2003 by David Blei.

TODO:

#### A Toy Example... 
- Document 1: I had a peanut butter sandwich for breakfast.
- Document 2: I like to eat almonds, peanuts and walnuts.
- Document 3: My neighbor got a little dog yesterday.
- Document 4: Cats and dogs are mortal enemies.
- Document 5: You mustn’t feed peanuts to your dog.

```
Topic 1: 30% peanuts, 15% almonds, 10% breakfast… (you can interpret that this topic deals with food)
Topic 2: 20% dogs, 10% cats, 5% peanuts… ( you can interpret that this topic deals with pets or animals)

Documents 1 and 2: 100% Topic 1
Documents 3 and 4: 100% Topic 2
Document 5: 70% Topic 1, 30% Topic 2
```

### 6.5.3 Example LDA Code Using `scikit learn`

#### ABC News Headlines Corpus from Kaggle

Format: CSV

- publish_date: Date of publishing for the article in yyyyMMdd format
- headline_text: Text of the headline in Ascii, English, lowercase
- Start Date: 2003-02-19 End Date: 2017-12-31
- Total Records: 1,103,663

Rohit Kulkarni (2017), A Million News Headlines [CSV Data file], doi:10.7910/DVN/SYBGZL, Retrieved from: https://www.kaggle.com/therohk/million-headlines/downloads/million-headlines.zip/8

In [103]:
import pandas as pd
data = pd.read_csv('./abcnews-date-text.csv', error_bad_lines=False)
corpus = list(data['headline_text'])
corpus = corpus[:1000]
print(corpus[:5])

['aba decides against community broadcasting licence', 'act fire witnesses must be aware of defamation', 'a g calls for infrastructure protection summit', 'air nz staff in aust strike for pay rise', 'air nz strike to affect australian travellers']


In [104]:
from sklearn.decomposition import LatentDirichletAllocation

# set some parameters
n_topics = 30
n_top_words = 10
ngram = 2

# prepare the corpus
norm_corpus = [normalize(doc, sw) for doc in corpus]
lemmed_corpus = [lemmatize(doc) for doc in norm_corpus]
clean_corpus = [' '.join(doc) for doc in lemmed_corpus]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(clean_corpus)

# model the cleaned corpus
lda = LatentDirichletAllocation(n_components=n_topics, random_state=0).fit(X)
lda



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=30, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=0,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [105]:
help(LatentDirichletAllocation)

Help on class LatentDirichletAllocation in module sklearn.decomposition.online_lda:

class LatentDirichletAllocation(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  LatentDirichletAllocation(n_components=10, doc_topic_prior=None, topic_word_prior=None, learning_method=None, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=1, verbose=0, random_state=None, n_topics=None)
 |  
 |  Latent Dirichlet Allocation with online variational Bayes algorithm
 |  
 |  .. versionadded:: 0.17
 |  
 |  Read more in the :ref:`User Guide <LatentDirichletAllocation>`.
 |  
 |  Parameters
 |  ----------
 |  n_components : int, optional (default=10)
 |      Number of topics.
 |  
 |  doc_topic_prior : float, optional (default=None)
 |      Prior of document topic distribution `theta`. If the value is None,
 |      defaults to `1 / n_components`.
 |      In the 

In [106]:
lda.components_

array([[0.03518048, 0.03494565, 0.03519915, ..., 0.03476496, 0.03479979,
        0.03467066],
       [0.03491272, 0.03461502, 0.0347583 , ..., 0.0349228 , 0.03526037,
        0.03476048],
       [0.03488117, 0.03493359, 0.62683389, ..., 0.03522469, 0.03471125,
        0.03505534],
       ...,
       [0.03506836, 0.03486589, 0.03495162, ..., 0.03496106, 0.03486878,
        0.03478744],
       [0.0349356 , 0.03492402, 0.03498623, ..., 0.034816  , 0.03486217,
        0.03478007],
       [0.03476832, 0.03476624, 0.03476583, ..., 0.0352323 , 0.03477653,
        0.03521766]])

In [107]:
lda.components_.shape

(30, 2313)

#### LDA
- Pro: The number of topics, K, is one of the only tuneable parameters in the model, making it simple and easy to use
- Con: Evaluation is subjective and requires subject matter expertise

## 6.6 Other popular NLP packages

- Spacy
- Gensim

# Recap
- Main goal of NLP in ML is to convert variable length documents into fixed length numbers
- Stemming and lemmatization are two attempts to reduce derived words to their bases
- Bag of words counts how many times each unique word appears in a document
- n grams counts how many times unique phrases of length n appear in a document
- n > 3 is rarely used
- While n grams are simple to calculate, the count matrix is sparse and requires a lot of memory to store => computationally inefficient 

# Recap
- TFIDF stands for Term Frequency times Inverse Document Frequency
- The goal of TFIDF is to de-bias the count matrix
- W_ij = c_ij / T_i * log( D_j/n )
- TFIDF_ij = W_ij / sqrt( sum( W_ij^2 ) )

# Recap
- LDA is one flavor of many different topic modeling algorithms
- It can be used to summarize what a large, prohibitively long corpus of documents is about
- The number of topics, K, is one of the only tuneable parameters in the model making it easy to use
- Evaluation is subjective and requires subject matter expertise 