# Building a Basic Topic Model from the (Generated) ATS Document Database

Here I explore the ATS document database --- which should have already been scraped, cleaned, and compiled by scripts held in `/src/`, and managed via `make scrape` and `make data` --- and then train a basic topic model on the corpus. I explore the resulting topics briefly, and show changes by year of meeting.

## Requirements

Let's start by meeting some basic requirements:

In [1]:
!pip install pandas
!pip install numpy
!pip install wordcloud
!pip install gensim==3.8.3
!pip install seaborn
!pip install matplotlib

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the

We'll also be using `mallet` (via the `gensim` wrapper), so we need to make sure that's installed for use:

In [None]:
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip

In [None]:
!unzip -o mallet-2.0.8.zip

In [None]:
mallet_path = './mallet-2.0.8/bin/mallet' # for use when actually passing an LDAMallet wrapper to gensim

Import the requirements for loading the data and visualising the initial exploration:

In [None]:
import pandas as pd
import numpy as np
import os
import sys
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()

## Loading Data

In [None]:
df = pd.read_pickle('../data/processed/documents/ats_documents.pkl')

This dataframe was generated by a feature processing script (which should have been done executed via `Make`, see above). It looks like this:

In [None]:
df.head()

Key info:

In [None]:
df.info()

Let's confirm the number of empty text fields seen above:

In [None]:
print(f'{len(df[df.raw_text.isnull()])} of {len(df)}')

Remove these null-valued values:

In [None]:
clean_df = df[df.raw_text.notnull()]

In [None]:
print(len(clean_df))

Filtered by language:

In [None]:
for language in ['e','f','s','r']:
    print(f'{language}: {len(clean_df[clean_df.paper_language_abbreviation == language])}')

We'll use the english documents for this model:

In [None]:
english_clean_df = clean_df[clean_df.paper_language_abbreviation == 'e']

## Cleaning Data

Return a lower case version:

In [None]:
import re # load the regular expression library

english_clean_df['processed_text'] = english_clean_df['raw_text'].map(lambda x: x.lower())

## Exploring Data

We can generate an obligatory wordcloud for the whole document corpus as follows:

In [None]:
from wordcloud import WordCloud

# list of texts:
all_text = ','.join(list(english_clean_df['processed_text'].values))

# wordcloud object:
wordcloud = WordCloud(background_color='white', max_words=1000, contour_width=3, contour_color='steelblue')

# generate:
wordcloud.generate(all_text)

# viz:
wordcloud.to_image()

(This is, predictably, uninformative.)

We can, similarly, get a sense of the total number of papers per year:

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(x='meeting_year', data=english_clean_df, orient='h')

Filtered for (only) working papers is slightly more informative here:

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(x='meeting_year', data=english_clean_df[english_clean_df.paper_type_abbreviation == 'wp'], orient='h')

## Preparing for the LDA Model

We start by importing the `nltk` and its stopwords:

In [None]:
import nltk
nltk.download('stopwords')

We also need levenshtein distance for later viz:

In [None]:
!pip install python-Levenshtein

Import the necessary pre-processing and model-building util wrappers from `gensim`:

In [None]:
import gensim
import gensim.corpora as corpora
from gensim.models.wrappers import ldamallet
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

Import `spacy` for lemmatization utils:

In [None]:
!pip install spacy

In [None]:
import spacy
import spacy.cli
spacy.cli.download("en_core_web_sm")

## Preprocessing for LDA

Build a list of all the document texts:

In [None]:
data = list(english_clean_df.processed_text)

Build trigram and bigram models from those texts:

In [None]:
bigram = gensim.models.Phrases(data, min_count=20, threshold=100)
trigram = gensim.models.Phrases(bigram[data], threshold=100)

bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

Define a preprocessing function:

In [None]:
# we only need the spacy tagger; there's no need for parser and named entity recognizer, for faster implementation
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# we load the english-language stopword corpus from nltk library
stop_words = nltk.corpus.stopwords.words('english')

def process_words(texts, stop_words=stop_words, allowed_tags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """Convert a document into a list of lowercase tokens, build bigrams-trigrams, implement lemmatization"""
    # remove stopwords, short tokens and letter accents 
    texts = [[word for word in simple_preprocess(str(doc), deacc=True, min_len=3) if word not in stop_words] for doc in texts]
    
    # bi-gram and tri-gram implementation
    texts = [bigram_mod[doc] for doc in texts]
    texts = [trigram_mod[bigram_mod[doc]] for doc in texts]
    
    texts_out = []
    
    # implement lemmatization and filter out unwanted part of speech tags
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_tags])
    
    # remove stopwords and short tokens again after lemmatization
    texts_out = [[word for word in simple_preprocess(str(doc), deacc=True, min_len=3) if word not in stop_words] for doc in texts_out]    
    
    return texts_out

With this defined, we can actually call the preprocessing function on our list of texts:

In [None]:
data_ready = process_words(data)

A full dictionary for the vocab of the corpora:

In [None]:
id2word = corpora.Dictionary(data_ready) 
print('Total Vocabulary Size:', len(id2word))

Note that this contains a large amount of repetition and noise, which we'll deal with below (via thresholds).

The corpus, implemented as a bag of words for each text in the above (preprocessed) data list:

In [None]:
corpus = [id2word.doc2bow(text) for text in data_ready]

Now we can see the words/frequencies as a dataframe (as a sanity check:)

In [None]:
dict_corpus = {}

for i in range(len(corpus)):
  for idx, freq in corpus[i]:
    if id2word[idx] in dict_corpus:
      dict_corpus[id2word[idx]] += freq
    else:
       dict_corpus[id2word[idx]] = freq
       
dict_df = pd.DataFrame.from_dict(dict_corpus, orient='index', columns=['freq'])

In [None]:
plt.figure(figsize=(8,6))
sns.distplot(dict_df['freq'], bins=100)

In [None]:
dict_df.sort_values('freq', ascending=False).head(10)

Filter the extremely common or extremely uncommon words (above 50% of all documents, or fewer than 10 documents):

In [None]:
id2word.filter_extremes(no_below=10, no_above=0.5)

In [None]:
corpus = [id2word.doc2bow(text) for text in data_ready]

We've now got a  much more believeable vocab size:

In [None]:
print(len(id2word))


## LDA Model

Let's generate an Mallet LDA topic model for 100 topics and 500 iterations. **This takes quite a while**.

In [None]:
n_topics = 100
n_iterations = 500

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=n_topics, iterations=n_iterations, id2word=id2word)

We calculate a coherence score for the model:

In [None]:
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_ready, dictionary=id2word, coherence='c_v')

coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('Coherence Score: ', coherence_ldamallet)

Save progress to file:

In [None]:
import pickle
with open("../models/ldamallet.pkl", "wb") as f:
    pickle.dump(ldamallet, f)

Check topic distributions for the corpus: 

In [None]:
tm_results = ldamallet[corpus]

The most dominant topic of each document in the corpus:

In [None]:
corpus_topics = [sorted(topics, key=lambda record: -record[1])[0] for topics in tm_results]

The top 20 significant terms and their probabilities for each topic:

In [None]:
topics = [[(term, round(wt, 3)) for term, wt in ldamallet.show_topic(n, topn=20)] for n in range(0, ldamallet.num_topics)]

A term-topic matrix:

In [None]:
topics_df = pd.DataFrame([[term for term, wt in topic] for topic in topics], columns = ['Term'+str(i) for i in range(1, 21)], index=['Topic '+str(t) for t in range(1, ldamallet.num_topics+1)]).T

topics_df.head()

## Visualising the LDA Model

In [None]:
from gensim.models.ldamodel import LdaModel

def convertldaMalletToldaGen(mallet_model):
    model_gensim = LdaModel(
        id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
        alpha=mallet_model.alpha) 
    model_gensim.state.sstats[...] = mallet_model.wordtopics
    model_gensim.sync_state()
    return model_gensim

In [None]:
ldagensim = convertldaMalletToldaGen(ldamallet)

In [None]:
import pyLDAvis.gensim as gensimvis

vis_data = gensimvis.prepare(ldagensim, corpus, id2word, sort_topics=False)

pyLDAvis.display(vis_data)

### Dominant Topics for Each Document

In [None]:
# create a dataframe
corpus_topic_df = pd.DataFrame()

# get the Titles from the original dataframe
corpus_topic_df['Title'] = english_clean_df['paper_title']

corpus_topic_df['Dominant Topic'] = [item[0]+1 for item in corpus_topics]

corpus_topic_df['Contribution %'] = [round(item[1]*100, 2) for item in corpus_topics]

corpus_topic_df['Topic Terms'] = [topics_df.iloc[t[0]]['Terms per Topic'] for t in corpus_topics]

corpus_topic_df.head()

Now we can show the document counts for each topic and its percentage in the overall corpus:

In [None]:
dominant_topic_df = corpus_topic_df.groupby('Dominant Topic').agg(
                                  Doc_Count = ('Dominant Topic', np.size),
                                  Total_Docs_Perc = ('Dominant Topic', np.size)).reset_index()

dominant_topic_df['Total_Docs_Perc'] = dominant_topic_df['Total_Docs_Perc'].apply(lambda row: round((row*100) / len(corpus), 2))

dominant_topic_df

And also which document makes the highest contribution to each topic:

In [None]:
corpus_topic_df.groupby(‘Dominant Topic’).apply(lambda topic_set: (topic_set.sort_values(by=[‘Contribution %’], ascending=False).iloc[0])).reset_index(drop=True)

## Topics Over Time

The topic basic weights, using the `tm_results` object defined earlier:

In [None]:
df_weights = pd.DataFrame.from_records([{v: k for v, k in row} for row in tm_results])
df_weights.columns = ['Topic ' + str(i) for i in range(1,11)]
df_weights

We can add the year column from the original dataframe:

In [None]:
df_weights['Year'] = english_clean_df.meeting_year

And now we can get an average of yearly topic weights:

In [None]:
df_weights.groupby('Year').mean()

In [None]:
df_weights['Dominant'] = df_weights.drop('Year', axis=1).idxmax(axis=1)
df_weights.head()

In [None]:
df_dominance = df_weights.groupby('Year')['Dominant'].value_counts(normalize=True).unstack()
df_dominance

In [None]:
df_meetings = df_weights.groupby(['meeting_type', 'Year'])['Dominant'].value_counts(normalize=True).unstack()

df_meetings.head(15)

In [None]:
df_meetings.reset_index(inplace=True)

df_melted = df_meetings.melt(id_vars=['meeting_type', 'Year'], value_vars=['Topic ' + str(i) for i in range(1,11)], var_name='Topic', value_name='Prevelance')

df_melted

In [None]:
# create multiindex dataframe
df_meetings.set_index(['Journal', 'Year'], inplace=True)

# set the figure size
plt.rcParams['figure.figsize'] = [10, 6]

# loop over each meeting type
for j in df_meetings.index.levels[0]:
  
  # get cross-section and plot
  df_meetings.xs(j, level=0).plot.area()
  
  plt.title(j)
  plt.legend(loc='upper left')

plt.show()