# Please go to https://ccv.jupyter.brown.edu

## What we learned so far...
- The main goal of NLP in machine learning
- Text normalization
- The differences between stemming and lemmatization
- Bag of Words

# 5. Intro to Natural Language Processing (NLP)
#### By the end of this day, you'll be able to 
- explain the differences between bag of words and n-grams
- apply the TFIDF transformation
- explain the difference between topic modeling and LDA
- run through a simple NLP pipeline, from cleaning and vectorizing text to mining topics

## 5.1 N-grams
- an ngram is a contiguous sequence of n items from a given sample of text or speech
- the items can be phonemes, syllables, letters, words, or base pairs (according to the application)

- we can generalize bag of words to phrases of *n* words
- bag of words is a unigram representation of text
- we can have unigrams, bigrams, 3-grams, 4-grams, etc.

#### Our corpus:
- It was the best of times, it was the worst of times, it was the Age of Wisdom, it was the Age of Foolishness,

#### Bigrams
['it was', 'was the', 'the best', 'best of', ...]

In [None]:
import nltk
help(nltk.ngrams)

# Exercise 1 
## Prepare the bigrams for each document in this corpus 
- It was the best of times,
- it was the worst of times,
- it was the Age of Wisdom,
- it was the Age of Foolishness,

In [None]:
# Solution



## 5.2 Bag of words using n-grams

- changing the unit of analysis from words to n-grams can help to encode some contextual information in NLP applications
- `CountVectorizer()` has a built-in `ngrams_range` option for computing a count matrix with any number of n-grams

In [None]:
# Solution
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

## 5.2.1 Bag of words vs. N-grams

- bag of words is simple, but it is computationally inefficient
- n-grams can create an even larger count matrix
- n > 3 is rarely used
- corpus of 1 billion ($10^9$) words contains roughly $10^5$ 1-grams, $3 \times 10^5$ 2-grams, over $10^6$ 3-grams roughly
- One counter-example: a large n can reveal plagarism 

# Exercise 2 

### From the corpus below, construct a count matrix with a vocabulary made up of both bigrams and trigrams. Print the vocabulary and the count array.

```
corpus = ['It was the best of times,',
          'it was the worst of times,',
          'it was the Age of Wisdom,',
          'it was the Age of Foolishness,']```

In [None]:
# Solution



## 5.3 Towards the TFIDF transformation:

### 5.3.1 Another look at the count matrix

|c|term_1|term_2|...|term_j|...|term_m|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|c_11|c_12|...|c_1j|...|c_1m|
|__Doc_2__|c_21|c_22|...|c_2j|...|c_2m|
|__...__|...|...|...|...|...|...|
|__Doc_i__|c_i1|c_i2|...|c_ij|...|c_im|
|__...__|...|...|...|...|...|...|
|__Doc_n__|c_n1|c_n2|...|c_nj|...|c_nm|

- c is our count matrix
- c_ij = number of times term_j appears in document_i

### 5.3.2 The term count vector

|-|term_1|term_2|...|term_j|...|term_m|__<font color='red'>T</font>__|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|c_11|c_12|...|c_1j|...|c_1m|__<font color='red'>sum_j(c_1j)</font>__|
|__Doc_2__|c_21|c_22|...|c_2j|...|c_2m|__<font color='red'>sum_j(c_2j)</font>__|
|__...__|...|...|...|...|...|...|<font color='red'>...</font>|
|__Doc_i__|c_i1|c_i2|...|c_ij|...|c_im|__<font color='red'>sum_j(c_ij)</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>|
|__Doc_n__|c_n1|c_n2|...|c_nj|...|c_nm|__<font color='red'>sum_j(c_nj)</font>__|

- T is the term count vector
- T_i is the number of terms in document_i

### 5.3.3 The document count vector

|-|term_1|term_2|...|term_j|...|term_m|__<font color='red'>T</font>__|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|c_11|c_12|...|c_1j|...|c_1m|__<font color='red'>sum_j(c_1j)</font>__|
|__Doc_2__|c_21|c_22|...|c_2j|...|c_2m|__<font color='red'>sum_j(c_2j)</font>__|
|__...__|...|...|...|...|...|...|<font color='red'>...</font>|
|__Doc_i__|c_i1|c_i2|...|c_ij|...|c_im|__<font color='red'>sum_j(c_ij)</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>|
|__Doc_n__|c_n1|c_n2|...|c_nj|...|c_nm|__<font color='red'>sum_j(c_nj)</font>__|
|__<font color='blue'>D</font>__|__<font color='blue'>sum_i(c_i1 > 0)</font>__|__<font color='blue'>sum_i(c_i2 > 0)</font>__|<font color='blue'>...</font>|__<font color='blue'>sum_i(c_ij > 0)</font>__|<font color='blue'>...</font>|__<font color='blue'>sum_i(c_im > 0)</font>__||

- D is the document count vector
- D_j is the number of documents that contain term_j at least once.

# Exercise 3
## Calculate the term count and document counts of the following count matrix:

|-|term_1|term_2|term_3|term_4|term_5|term_6|term_7|term_8|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc1__|2|3|1|0|3|3|1|2|
|__Doc2__|1|3|1|0|0|0|3|2|
|__Doc3__|1|0|1|2|0|1|0|1|
|__Doc4__|0|2|1|3|3|0|0|3|
|__Doc5__|2|2|2|0|1|0|3|2|
|__Doc6__|1|0|3|3|1|2|3|1|
|__Doc7__|2|2|0|2|0|2|3|2|


### Solution



## 5.4 The TFIDF transformation

#### We need to normalize or de-bias the count matrix!
- Some documents are shorter, others are longer => there is a bias towards longer documents
- Some terms appear in most of the documents => there is bias towards frequent terms

### 5.4.1 TFIDF - Term Frequency times Inverse Document Frequency
- number of documents: n
- term frequency: C_ij/T_i (frequency of word in a document)
- document frequency: D_j/n (rank of the word for its relevancy in the corpus)
- inverse document frequency (textbook): ln(n/D_j)
- inverse document frequency (scikit learn): ln[(n+1)/(D_j+1)]+1
    - adding 1 to numerator and denominator prevents zero divisions (as if extra document was seen containing every term in the corpus exactly once)
    - adding 1 to idf is that terms with zero idf (terms that occur in all documents in a training set) will not be entirely ignored

- W_ij = tf * idf
### <center>W_ij = c_ij / T_i * ln(D_j / n)</center>
#### <center>or</center>
### <center>W_ij = (c_ij / T_i) * (ln[(D_j + 1)/(n + 1)]+1)</center>


- normalize the weights so the weights are between 0 and 1 for each document

### W_ij = (c_ij / T_i)*(ln[(D_j + 1)/(n + 1)]+1)


|__W__ |term_1|term_2|...|term_j|...|term_m|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|( c_11/T_1 )*( ln[(D_1+1)/(n+1)]+1 )|( c_12/T_1 )*( ln[(D_2+1)/n+1)]+1 )|...|( c_1j/T_1 )*( ln[(D_j+1)/n+1)]+1 )|...|( c_1m/T_1 )*( ln[(D_m+1)/n+1)]+1 )|
|__Doc_2__|( c_21/T_2 )*( ln[(D_1+1)/(n+1)]+1 )|( c_22/T_2 )*( ln[(D_2+1)/n+1)]+1 )|...|( c_2j/T_2 )*( ln[(D_j+1)/n+1)]+1 )|...|( c_2m/T_2 )*( ln[(D_m+1)/n+1)]+1 )|
|__...__|...|...|...|...|...|...|
|__Doc_i__|( c_i1/T_i )*( ln[(D_1+1)/(n+1)]+1 )|( c_i2/T_i )*( ln[(D_2+1)/n+1)]+1 )|...|( c_ij/T_i )*( ln[(D_j+1)/n+1)]+1 )|...|( c_im/T_i )*( ln[(D_m+1)/n+1)]+1 )|
|__...__|...|...|...|...|...|...|
|__Doc_n__|( c_n1/T_n )*( ln[(D_1+1)/(n+1)]+1 )|( c_n2/T_n )*( ln[(D_2+1)/n+1)]+1 )|...|( c_nj/T_n )*( ln[(D_j+1)/n+1)]+1 )|...|( c_nm/T_n )*( ln[(D_m+1)/n+1)]+1 )|


### then, normalize the weights so they are between 0 and 1 (L2)
### TFIDF_ij = W_ij / sqrt( sum( W_ij^2 ) ) 

# Exercise 4
### Calculate the TFIDF matrix for the following count matrix:


|c |term_1|term_2|term_3|term_4|term_5|term_6|<font color='red'>T</font>|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__Doc_1__|0|1|1|0|1|1|<font color='red'>4</font>|
|__Doc_2__|0|2|1|0|1|1|<font color='red'>5</font>|
|__Doc_3__|1|0|1|1|1|1|<font color='red'>5</font>|
|__<font color='blue'>D</font>__|<font color='blue'>1</font>|<font color='blue'>2</font>|<font color='blue'>3</font>|<font color='blue'>1</font>|<font color='blue'>3</font>|<font color='blue'>3</font>||

### 5.4.2 TFIDF Transformation in Corpus Preparation

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['This is the document.',
          'This document is the document.',
          'And this is the paper.']

c_vectorizer = CountVectorizer()
X = c_vectorizer.fit_transform(corpus)
print(c_vectorizer.get_feature_names())
print(X.toarray())

t_vectorizer = TfidfVectorizer()
tfidf = t_vectorizer.fit_transform(corpus)
print(tfidf.toarray())

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['The character said: It was the best of times, ', 
          'The character said: it was the worst of times, ', 
          'The character said: it was the Age of Wisdom, ', 
          'it was the Age of Foolishness. - Charles Dickens']
norm_corpus = [normalize(doc) for doc in corpus]
lemmed_corpus = [lemmatize(doc) for doc in norm_corpus]
clean_corpus = [' '.join(doc) for doc in lemmed_corpus]
clean_corpus

In [None]:
# the bigram TFIDF matrix
vectorizer = TfidfVectorizer(ngram_range=(2,2))
X = vectorizer.fit_transform(clean_corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

## 5.5 Topic Models

- A type of statistical model for discovering the abstract “topics” that occur in a collection of documents
- Can think of it as a form of dimensionality reduction

### 5.5.1 Latent Dirichlet Allocation (LDA)
- A flavor of topic modeling that can be used to classify text in a document to a particular topic. 
- The LDA model discovers the different topics that the documents represent and how much of each topic is present in a document
- Very popular since its inception in 2003 by David Blei.

![title](./lda.jpg)

### 5.5.3 Example LDA Code Using `scikit learn`

#### ABC News Headlines Corpus from Kaggle

Format: CSV

- publish_date: Date of publishing for the article in yyyyMMdd format
- headline_text: Text of the headline in Ascii, English, lowercase
- Start Date: 2003-02-19 End Date: 2017-12-31
- Total Records: 1,103,663

Rohit Kulkarni (2017), A Million News Headlines [CSV Data file], doi:10.7910/DVN/SYBGZL, Retrieved from: https://www.kaggle.com/therohk/million-headlines/downloads/million-headlines.zip/8

In [None]:
import pandas as pd
data = pd.read_csv('./abcnews-date-text.csv', error_bad_lines=False)
corpus = list(data['headline_text'])
corpus = corpus[:10000]
print(corpus[:5])

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# set some parameters
n_topics = 30
n_top_words = 10
ngram = 2

# prepare the corpus
norm_corpus = [normalize(doc) for doc in corpus]
lemmed_corpus = [lemmatize(doc) for doc in norm_corpus]
clean_corpus = [' '.join(doc) for doc in lemmed_corpus]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(clean_corpus)

# model the cleaned corpus
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=5, learning_method='online', 
                                learning_offset=50., random_state=0).fit(X)
lda

In [None]:
help(LatentDirichletAllocation)

In [None]:
# the topic word weights
print(lda.components_)
print(lda_components_.shape)

In [None]:
# the document topic weights
lda_X = LatentDirichletAllocation(n_components=n_topics, max_iter=5, learning_method='online', 
                                learning_offset=50., random_state=0).fit_transform(X)
print(lda_X)
print(lda_X.shape)

#### LDA
- Pro: The number of topics, K, is one of the only tuneable parameters in the model, making it simple and easy to use
- Con: Evaluation is subjective and requires subject matter expertise

In [None]:
def display_topics(model, feature_names, n_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        print('Topic ' + str(topic_idx) + ':')
        print('|'.join([feature_names[i] for i in np.argsort(topic)[:-n_top_words-1:-1]]))

In [None]:
X_feature_names = vectorizer.get_feature_names()
display_topics(lda, X_feature_names, n_top_words)

## 5.6 Other popular NLP packages

- Spacy
- Gensim

# Recap
- Main goal of NLP in ML is to convert variable length documents into fixed length numbers
- Stemming and lemmatization are two attempts to reduce derived words to their bases
- Bag of words counts how many times each unique word appears in a document
- n grams counts how many times unique phrases of length n appear in a document
- n > 3 is rarely used
- While n grams are simple to calculate, the count matrix is sparse and requires a lot of memory to store => computationally inefficient 

# Recap
- TFIDF stands for Term Frequency times Inverse Document Frequency
- The goal of TFIDF is to de-bias the count matrix
- W_ij = c_ij / T_i * log( D_j/n )
- TFIDF_ij = W_ij / sqrt( sum( W_ij^2 ) )

# Recap
- LDA is one flavor of many different topic modeling algorithms
- It can be used to summarize what a large, prohibitively long corpus of documents is about
- It builds a topic per document model and words per topic model, modeled as Dirichlet distributions
- The number of topics, K, is one of the only tuneable parameters in the model
- Evaluation is subjective and requires subject matter expertise 