To be able to run this notebook, one has to install the gensim and nltk packages:
```
conda install -c anaconda nltk
conda install -c anaconda gensim
```

## Describing topics from a cluster using LDA

### Useful libraries
We will be using the Gensim and Natural Language Toolkit (NLTK) to help us process the quotes' text. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from wordcloud import WordCloud
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import LdaModel, LdaMulticore
from gensim.models.phrases import Phrases
from gensim.parsing.preprocessing import STOPWORDS
from nltk import pos_tag
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import EnglishStemmer


lemmatizer = WordNetLemmatizer()
stemmer = EnglishStemmer()

### Text preprocessing
We first preprocess the quotes:
- Split text into words, apply lowercase, and remove punctuation
- Ignore words of length < 3
- Remove stopwords
- Lemmatize words 
- Add bigrams

In [15]:
def wordnet_pos(tag):
    'Convert a POS tag to its equivalent wordnet tag'
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('V'):
        return wn.VERB
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    else:
        return wn.NOUN # default option if all else fails

def lemmatize(tagged):
    'Apply lemmatization to a word'
    # Decompose word and its tag
    word, tag = tagged   
    return lemmatizer.lemmatize(word, pos=wordnet_pos(tag))

def stem(word):
    'Apply stemming to a word'
    return stemmer.stem(word)

def preprocess_quotes(qs):
    'Split quote into words, apply lowercase, remove punctuation, ignore words of length < 3, remove stopwords'
    processed_quotes = []
    for q in qs:
        processed = []
        # Convert quote to list of lowercase tokens, ignoring those w/ length < 3
        # Apply POS tagging
        for token in pos_tag(simple_preprocess(q, min_len=3)):
            # Ignore stopwords
            if token[0] not in STOPWORDS:
                # Lemmatize token
                processed.append(lemmatize(token))
        processed_quotes.append(processed)
    
    # Add bigrams
    bigram = Phrases(processed_quotes, min_count=10, delimiter='_', threshold=5)
    for i in range(len(processed_quotes)):
        for token in bigram[processed_quotes[i]]:
            if '_' in token:
                # Add token to quote if it is a bigram
                processed_quotes[i].append(token)
    
    return processed_quotes

**Note**: We can consider adding n-grams (bigrams at least) to bunch words that appear frequently together

### Produce bag of words
We then map our preprocessed quotes to produce a "bag of words corpus" which is a dictionary of words and their frequency.

In [4]:
def produce_bow_corpus(processed_qs):
    'Produce a bag of words corpus given processed quotes'
    dictionary = corpora.Dictionary(processed_qs)
    dictionary.filter_extremes(no_below=5, no_above=0.5)
    return [dictionary.doc2bow(q, allow_update=True) for q in processed_qs], dictionary

### Saving and loading BOW corpus
We provide code to save and load a corpus we produced to avoid repeating computation.

In [5]:
def save_bow_corpus(bow_corpus, file_id='0'):
    'Save a bag of words corpus with a given identifier'
    corpora.MmCorpus.serialize('./data/BOW_corpus_{}.mm'.format(file_id), bow_corpus)
    
def load_bow_corpus(file_id='0'):
    'Load a bag of words corpus with a given identifier'
    return corpora.MmCorpus('./data/BOW_corpus_{}.mm'.format(file_id))

### Use LDA to extract topics
We now run LDA on our bag of words corpus.

In [6]:
def lda_model(bow_corpus, id2word, n_topics):
    return LdaMulticore(corpus=bow_corpus, num_topics=n_topics, id2word=id2word, workers=6, passes=10, random_state=123)

### Putting everything together

In [7]:
# LDA Pipeline
def get_lda_model(quotes, n_topics):
    '''Return LDA model'''
    processed_quotes = preprocess_quotes(quotes)
    bow, wordmap = produce_bow_corpus(processed_quotes)
    model = lda_model(bow, wordmap, n_topics)
    return model
        
def model_topics(model):
    '''Show top 10 words for each topic'''
    for topic_id, words in model.show_topics(formatted=False):
        print(topic_id, [x[0] for x in words])
        
# --- SAVING AND LOADING LDA MODEL --- #
def save_model(model, file_name):
    model.save('../datasets/lda/'+file_name)
    
def load_model(file_name):
    return LdaModel.load('../datasets/lda/'+file_name)

Here is the gensim documentation on LDA as reference: https://radimrehurek.com/gensim/models/ldamodel.html

### Load data

In [8]:
CLEAN_QUOTES = '../data/clean_quotes.csv.bz2'
CLUSTERS = '../data/clusters.csv.bz2'
QUOTES_PATH = '../data/quotes-2020.json.bz2'

In [9]:
clean_quotes = pd.read_csv(CLEAN_QUOTES).drop_duplicates()[['quoteID', 'journal']]
clean_quotes.head(2)

Unnamed: 0,quoteID,journal
0,2020-01-24-000168,people.com
3,2020-01-21-031706,people.com


In [10]:
cluster_assignments = pd.read_csv(CLUSTERS, index_col=0)['cluster_id']
cluster_assignments.head(2)

journal
1011now.com      -1.0
1070thefan.com   -1.0
Name: cluster_id, dtype: float64

In [11]:
clustered = cluster_assignments.groupby(cluster_assignments)

In [12]:
n_clusters = len(cluster_assignments.unique()) - 1 # ignore noise cluster (-1 assignemnt)
print(f'Number of clusters: {n_clusters}')

Number of clusters: 15


In [13]:
groups = [clustered.get_group(n) for n in range(n_clusters)]

In [14]:
quotes = []

def process_chunk(chunk):
        print(f'Processing chunk')
        quotes.append(chunk[['quoteID', 'quotation']])      

with pd.read_json(QUOTES_PATH, lines=True, compression='bz2', chunksize=1000000) as df_reader:
    for chunk in df_reader:
        process_chunk(chunk)
print('Done processing!')
        
quotes = pd.concat(quotes)

Processing chunk
Processing chunk
Processing chunk
Processing chunk
Processing chunk
Processing chunk
Done processing!


In [16]:
# {Quotation ID -> Quotation} dictionary for fast retrieval
quotes_dict = dict(quotes.values)

In [17]:
# Match quote IDs to their clusters
ids = [clean_quotes.merge(groups[i], on='journal')['quoteID'] for i in range(len(groups))]
ids[0][:2]

0    2020-04-06-037825
1    2020-04-06-060329
Name: quoteID, dtype: object

In [18]:
# Obtain cluster quotes
cluster_quotes = [[quotes_dict[id_] for id_ in ids[i]] for i in range(len(ids))]
cluster_quotes[0][:2]

["Right now we're just reacting to... it's a different retail chain, whether we could get physical copies to people, is the internet infrastructure there to support all countries... We're right now looking at all sorts of different options,",
 "We'd rather put our focus on finishing the actual game and getting it to people,"]

In [19]:
for i in range(len(cluster_quotes)):
    print(f'Cluster {i}: {len(cluster_quotes[i])} quotes')

Cluster 0: 2602 quotes
Cluster 1: 15305 quotes
Cluster 2: 44992 quotes
Cluster 3: 85009 quotes
Cluster 4: 24309 quotes
Cluster 5: 25059 quotes
Cluster 6: 196466 quotes
Cluster 7: 504899 quotes
Cluster 8: 1309483 quotes
Cluster 9: 9287 quotes
Cluster 10: 36923 quotes
Cluster 11: 335119 quotes
Cluster 12: 239235 quotes
Cluster 13: 12086 quotes
Cluster 14: 2163368 quotes


In [21]:
# Run LDA on each cluster
for i in range(n_clusters):
    print(f'Processing cluster {i}')
    model = get_lda_model(cluster_quotes[i], 4)
    print(f'Saving LDA model...')
    save_model(model, f'cluster-{i}')

Processing cluster 0
Saving LDA model...
Processing cluster 1
Saving LDA model...
Processing cluster 2
Saving LDA model...
Processing cluster 3
Saving LDA model...
Processing cluster 4
Saving LDA model...
Processing cluster 5
Saving LDA model...
Processing cluster 6
Saving LDA model...
Processing cluster 7
Saving LDA model...
Processing cluster 8
Saving LDA model...
Processing cluster 9
Saving LDA model...
Processing cluster 10
Saving LDA model...
Processing cluster 11
Saving LDA model...
Processing cluster 12
Saving LDA model...
Processing cluster 13
Saving LDA model...
Processing cluster 14
Saving LDA model...


In [22]:
for i in range(n_clusters):
    print(f'Cluster {i} topics:')
    model_topics(load_model(f'cluster-{i}'))

Cluster 0 topics:
0 ['like', 'thing', 'think', 'go', 'year', 'look', 'different', 'good', 'new', 'console']
1 ['game', 'want', 'work', 'time', 'experience', 'new', 'people', 'like', 'go', 'year']
2 ['game', 'people', 'new', 'play', 'year', 'lot', 'video', 'work', 'like', 'know']
3 ['switch', 'nintendo', 'time', 'story', 'nintendo_switch', 'like', 'world', 'new', 'people', 'team']
Cluster 1 topics:
0 ['life', 'people', 'god', 'time', 'know', 'new', 'think', 'go', 'world', 'human']
1 ['church', 'people', 'lord', 'god', 'good', 'catholic', 'pray', 'community', 'woman', 'day']
2 ['god', 'life', 'love', 'time', 'human', 'way', 'need', 'jesus', 'church', 'christ']
3 ['people', 'come', 'world', 'right', 'holy', 'want', 'god', 'country', 'today', 'heart']
Cluster 2 topics:
0 ['people', 'need', 'time', 'family', 'health', 'public', 'work', 'support', 'country', 'virus']
1 ['want', 'come', 'year', 'play', 'level', 'great', 'club', 'time', 'game', 'continue']
2 ['government', 'people', 'party', '

#### Wordclouds
We generate wordclouds for our cluster topics to visually represent our LDA output

In [30]:
# Generate wordclouds for each cluster
for i in range(n_clusters):
    lda_model = load_model(f'cluster-{i}')
    for t in range(lda_model.num_topics):
        plt.figure()
        plt.imshow(WordCloud(background_color='white').fit_words(dict(lda_model.show_topic(t, 200))))
        plt.axis('off')
        plt.savefig(f'../datasets/wordclouds/cluster{i}_wordcloud{t}')
        plt.close()