The objective with this scrip is to take the output from scraping the Green Party Media release page and create a topic model using LDA analysis. 
*   Author: Colin MacDonald for RA2, June 2020
*   Resources: This script borrows heavily from the following:
     - https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
     - https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

The input data requires the csv output file from the Greenparty Web Scraper. In this case a data set of 50 most current articles was collected. 

In [None]:
!pip install pyldavis # Install the required Library
!pip install numpy
!pip install gensim

Import the required modules

In [41]:
import pandas as pd
import os
import re
import gensim
import nltk
import spacy
import numpy as np
import tqdm
import gensim.corpora as corpora
import pyLDAvis
import pyLDAvis.gensim
import pickle 
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
from pprint import pprint
from gensim.models import CoherenceModel
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Upload the Greenparty csv file to the colab folder, content/sample_data. For this instance we'll drop all but the "content" column. 



In [42]:
greenparty_full= pd.read_csv(r'/content/sample_data/Copy of greenpartywebscrape50.csv')
greenparty = greenparty_full.drop(columns=['Date', 'Header', 'Link'],axis=1)
greenparty.head()

Unnamed: 0.1,Unnamed: 0,Content
0,0,"TORONTO – On June 23, 2020, TVO will host the ..."
1,1,OTTAWA – The 10 contenders for the leadership ...
2,2,"OTTAWA – Last week, Green Party parliamentary ..."
3,3,OTTAWA – The Green Party of Canada has announc...
4,4,OTTAWA – The Green Party of Canada has accepte...


Initial Cleaning. First round will be: 


1.   Reducing all words to lower case
2.   rendering sentences, tokenizing, into lists of words by eliminating punctuation



In [43]:
greenparty['content_processed'] = greenparty['Content'].map(lambda x: re.sub('[,.!?-]', '', x))
greenparty['content_processed'] = greenparty['content_processed'].map(lambda x: x.lower())
greenparty['content_processed'].head()

0    toronto – on june 23 2020 tvo will host the fi...
1    ottawa – the 10 contenders for the leadership ...
2    ottawa – last week green party parliamentary l...
3    ottawa – the green party of canada has announc...
4    ottawa – the green party of canada has accepte...
Name: content_processed, dtype: object

function that itterates through the sentences and renders to a list of words

In [44]:
%%time
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations
        
data = greenparty.content_processed.values.tolist()
data_words = list(sent_to_words(data))
#print(data_words[:1])

CPU times: user 52.4 ms, sys: 1.59 ms, total: 54 ms
Wall time: 56.5 ms


Use the Gensim module to create the bigram and trigram models, ie groups of two and three words that commonly occur together

In [45]:
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)



Remove the stop words (words that do not significantly contribute to topic meaning) and define the bigram/trigram/lemmatization functions.

In [46]:
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'say', 'need', 'new', 'must', 'first', 'would', 'time', 'many', 'small', 
                   'also', 'call', 'year', 'may', 'step', 'also', 'use', 'pronoun', 'know', 'hold', 'even', 'live', 'learn', 'fair',
                   'approach', 'exciting', 'field', 'meet', 'range', 'able', 'feature', 'page', 'thrill', 'night', 'say'])
                  
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [47]:
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:1])

[['official', 'leadership', 'contestant', 'participate', 'twodebate', 'format', 'moderate', 'courtney', 'thrill', 'host', 'official', 'leadership', 'debate', 'say', 'stream', 'broadcast', 'event', 'allow', 'people', 'opportunity', 'tune', 'diverse', 'contestant', 'respect', 'journalist', 'bring', 'encyclopedic', 'knowledge', 'canadian', 'politic', 'issue', 'role', 'moderator', 'contestant', 'discuss', 'broad', 'topic', 'debate', 'provide', 'viewer', 'unique', 'opportunity', 'leadership', 'candidate', 'connect', 'remotely', 'home', 'debate', 'stream', 'consecutively', 'agenda', 'twitter', 'periscope', 'link', 'follow', 'twitt', 'broadcast', 'edtcontestant', 'pauldebate', 'edtcontestant', 'contestant', 'leadership', 'contest']]


This next section creates first the dictionary then the corpus. Each unique word is given an id and a frequency count. 

In [48]:
id2word = corpora.Dictionary(data_lemmatized)

texts = data_lemmatized

corpus = [id2word.doc2bow(text) for text in texts]

print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 4), (11, 1), (12, 3), (13, 1), (14, 1), (15, 2), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 4), (26, 1), (27, 1), (28, 1), (29, 2), (30, 2), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1)]]


Next we'll take the corpus, dictionary inputs from above and train the LDA model. Topic number, chunk size, passes inputs are just the suggested defaults. 

In [49]:
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=10, random_state=100, chunksize=100,
                                       passes=10, per_word_topics=True)

  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc)
  score += np.sum(cnt

A quick check of the output, showing the relative frequency of certain words within each topic. This provides a quick scan of:


*   any additional stop words that need to be removed (eg. say a high occurance of 'now' or 'also').
*   a quick scan of the words with the highest occurance can aid in topic definition



In [37]:
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.022*"say" + 0.011*"community" + 0.009*"covid" + 0.008*"municipality" + '
  '0.006*"include" + 0.006*"territory" + 0.006*"could" + 0.006*"risk" + '
  '0.006*"member" + 0.006*"health"'),
 (1,
  '0.024*"emergency" + 0.014*"government" + 0.011*"public" + 0.011*"say" + '
  '0.011*"climate" + 0.009*"welfare" + 0.009*"emission" + 0.008*"canadian" + '
  '0.007*"crisis" + 0.007*"provide"'),
 (2,
  '0.018*"information" + 0.016*"year" + 0.015*"leader" + 0.013*"work" + '
  '0.013*"member" + 0.012*"include" + 0.011*"party" + 0.010*"vote" + '
  '0.010*"leadership_election" + 0.010*"vote_go"'),
 (3,
  '0.016*"say" + 0.012*"leadership" + 0.011*"canadian" + 0.010*"contestant" + '
  '0.009*"government" + 0.009*"debate" + 0.007*"member" + 0.006*"party" + '
  '0.006*"allow" + 0.006*"opportunity"'),
 (4,
  '0.016*"airline" + 0.011*"say" + 0.011*"pandemic" + 0.011*"passenger" + '
  '0.010*"voucher" + 0.010*"canadian" + 0.008*"green" + 0.008*"travel" + '
  '0.008*"refund" + 0.005*"government"'),
 (

The following calculates a baseline coherence score, lets say the probability of a group of words being contextually realated. We can use this later when tunning the model as adjustments to other parameters (alpha/beta) should give a score higher than this baseline. 

In [17]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.35618398842540977


The sections run a set of simulations on the coherence based on changes in the variables: 
*   Number of Topics (K) 
*   Dirichlet hyperparameter alpha: Document-Topic Density
*   Dirichlet hyperparameter beta: Word-Topic Density

The output is a pd.datafram and csv file that can be quickly opened in excel. The example below shows the output for a=.91, b=.91 and a Cv score that maximizes at ~.48 vs a .39 baseline at the 8 topics point.

In [19]:
def compute_coherence_values(corpus, dictionary, k, a, b):
  lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=k,random_state=np.random.seed(0), chunksize=100, passes=10, alpha=a, eta=b, per_word_topics=True)  
                                         
                                           
  coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')

  return coherence_model_lda.get_coherence()

In [None]:
grid = {}
grid['Validation_Set'] = {}

# Topics range
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')# Validation sets

num_of_docs = len(corpus)
corpus_sets = [# gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25), 
               # gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5), 
               #gensim.utils.ClippedCorpus(corpus, num_of_docs*0.75), 
               corpus]

corpus_title = ['75% Corpus', '100% Corpus']

model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }# Can take a long time to run
if 1 == 1:
    pbar = tqdm.tqdm(total=540)
    
    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    
                    pbar.update(1)
    pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False)
    pbar.close()

Based on the comparisons of the k, a, b parameters a best fit is chosen, in this case 8 topics (k=8), a=0.91, b=0.91. The LDA model is run again with these parameters. 

In [None]:
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=8, random_state=100, chunksize=100, passes=10,
                                      alpha=0.91, eta=0.91)

                                           

The pyLDAvis module allows visualization of the output, topic clustering and top 30 words/terms with frequency. 

In [None]:

pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
LDAvis_prepared