To be able to run this notebook, one has to install the gensim and nltk packages:
```
conda install -c anaconda nltk
conda install -c anaconda gensim
```

## Describing topics from a cluster using LDA

In [1]:
import pandas as pd
import numpy as np

We approximate a cluster that we might find by the sample dataset:

In [2]:
data = pd.read_json('./data/quotes-2019-nytimes.json.bz2', lines=True)

In [5]:
# Useful data to us is the quotations' text only
quotes = data['quotation']

### Useful libraries
We will be using the Gensim and Natural Language Toolkit (NLTK) to help us process the quotes' text. 

In [19]:
import nltk
nltk.download('wordnet')

from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import LdaModel
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import EnglishStemmer


lemmatizer = WordNetLemmatizer()
stemmer = EnglishStemmer()

[nltk_data] Downloading package wordnet to C:\Users\Julian
[nltk_data]     Blackwell\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


### Text preprocessing
We first preprocess the quotes:
- Split text into words, apply lowercase, and remove punctuation
- Ignore words of length < 3
- Remove stopwords
- Lemmatize words 

In [75]:
def lemmatize(word):
    'Apply lemmatization to a word'
    # Important note: For the moment a default tag 'n' for nouns is used
    # To improve: find and set the correct tag for each word
    return lemmatizer.lemmatize(word, pos='n')

def stem(word):
    'Apply stemming to a word'
    return stemmer.stem(word)

def preprocess_quotes(qs):
    'Split quote into words, apply lowercase, remove punctuation, ignore words of length <= 3, remove stopwords'
    processed_quotes = []
    for q in qs:
        processed = []
        # Convert quote to list of lowercase tokens, ignoring those w/ length < 3
        for token in simple_preprocess(q, min_len=3):
            # Ignore stopwords
            if token not in STOPWORDS:
                # Lemmatize and stem token
                processed.append(lemmatize(token))
        processed_quotes.append(processed)
                
    return processed_quotes

**Note**: We can consider adding n-grams (bigrams at least) to bunch words that appear frequently together

### Produce bag of words
We then map our preprocessed quotes to produce a "bag of words corpus" which is a dictionary of words and their frequency.

In [13]:
def produce_bow_corpus(processed_qs):
    'Produce a bag of words corpus given processed quotes'
    dictionary = corpora.Dictionary()
    return [dictionary.doc2bow(q, allow_update=True) for q in processed_qs], dictionary

### Saving and loading BOW corpus
We provide code to save and load a corpus we produced to avoid repeating computation.

In [11]:
def save_bow_corpus(bow_corpus, file_id='0'):
    'Save a bag of words corpus with a given identifier'
    corpora.MmCorpus.serialize('./data/BOW_corpus_{}.mm'.format(file_id), bow_corpus)
    
def load_bow_corpus(file_id='0'):
    'Load a bag of words corpus with a given identifier'
    return corpora.MmCorpus('./data/BOW_corpus_{}.mm'.format(file_id))

### Use LDA to extract topics
We now run LDA on our bag of words corpus.  
**Note**: We will have to tune some parameters for the LDA model, mainly:
- $\alpha$: The a-priori belief on document-topic distribution
- $\eta$: The a-priori belief on topic-word distribution
- `n_topics`: The number of topics to model

In [72]:
def lda_model(bow_corpus, id2word, n_topics):
    return LdaModel(corpus=bow_corpus, num_topics=n_topics, id2word=id2word)

### Putting everything together

In [73]:
def lda_quotes_topics(quotes, n_topics):
    processed_quotes = preprocess_quotes(quotes)
    bow, wordmap = produce_bow_corpus(processed_quotes)
    model = lda_model(bow, wordmap, n_topics)
    for topic_id, words in model.show_topics(formatted=False):
        print(topic_id, [x[0] for x in words])

In [82]:
lda_quotes_topics(quotes, 12)

10 ['want', 'year', 'actually', 'election', 'seen', 'number', 'people', 'market', 'talking', 'chance']
5 ['right', 'talk', 'trump', 'art', 'far', 'china', 'people', 'left', 'happened', 'human']
4 ['got', 'wanted', 'coming', 'knew', 'percent', 'hand', 'help', 'open', 'kind', 'going']
9 ['american', 'history', 'little', 'great', 'moment', 'opportunity', 'hope', 'bit', 'situation', 'best']
2 ['person', 'going', 'come', 'win', 'wasn', 'question', 'like', 'mind', 'took', 'vote']
8 ['state', 'president', 'house', 'love', 'united', 'fact', 'country', 'white', 'support', 'went']
3 ['like', 'time', 'day', 'said', 'look', 'end', 'tell', 'year', 'money', 'felt']
6 ['long', 'change', 'care', 'time', 'future', 'business', 'start', 'term', 'health', 'stop']
7 ['new', 'trump', 'job', 'deal', 'high', 'york', 'night', 'level', 'plan', 'justice']
1 ['woman', 'political', 'trying', 'problem', 'case', 'today', 'decision', 'law', 'better', 'campaign']


Above are some topics that LDA extracted which we can use to try and label our cluster (in this case there seems to be a bias towards political/election news, along with some business and justice information). An observation here is that adding bigrams would have likely shown "New York" as a prevalent term if one has a look at topic 7 where "new" and "york" are present. Some optimization will have to be made in terms of how many topics we are trying produce to accurately describe a cluster.

Here is the gensim documentation on LDA as reference: https://radimrehurek.com/gensim/models/ldamodel.html