# Topic Modelling

In this notebook, you'll do some topic modelling on a selection of 144 papers about the use of fire in the paleolithic in Europe. We'll try to see if there are clusters of papers that have a similar topic, to see if there's any running themes in this research field. 

As you might expect from the word 'clusters', topic modelling is a unsupervised learning method, so we don't have any labels for the texts, just the texts themselves. A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "pyramid" and "pharaoh" will appear more often in documents about Egyptian archaeology, "Stonehenge" and "Beaker Culture" will appear in documents about British archaeology, and "the" and "is" will appear approximately equally in both.

We'll use the LDAvis package to do all the heavy lifting, we just need to preprocess the data and feed it into a function. But let's start with installing and importing some packages:

In [None]:
# install packages, if needed
#!pip install git+https://github.com/bmabey/pyLDAvis.git
!pip install pandas wordcloud scikit-learn seaborn pyLDAvis

In [None]:
# Importing modules
import pandas as pd
import os
import re
from IPython.display import HTML
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.decomposition import LatentDirichletAllocation as LDA
#from pyLDAvis import sklearn as sklearn_lda
import pyLDAvis
import pyLDAvis.lda_model
import pickle 
from sklearn.preprocessing import normalize
import operator
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords

# set settings
warnings.simplefilter("ignore", DeprecationWarning)
sns.set_style('whitegrid')
%matplotlib inline


We'll also need some functions later on in this notebook, let's define those here:

In [None]:
# set up some functions we'll use later

# Create and display a wordcloud based on text
def display_wordcloud(text):
    # Create a WordCloud object
    wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue', width = 800, height = 600)
    
    # Generate a word cloud
    wordcloud.generate(text)

    # Set figure size
    plt.figure(figsize=(20, 15))
    
    # Visualize the word cloud
    plt.imshow(wordcloud) 

# Display top 10 most common words
def plot_most_common_words(count_data, count_vectorizer, number):
    #import matplotlib.pyplot as plt
    words = count_vectorizer.get_feature_names_out()
    total_counts = np.zeros(len(words))
    for t in count_data:
        total_counts+=t.toarray()[0]
    
    count_dict = (zip(words, total_counts))
    count_dict = sorted(count_dict, key=lambda x:x[1], reverse=True)[0:number]
    words = [w[0] for w in count_dict]
    counts = [w[1] for w in count_dict]
    x_pos = np.arange(len(words)) 
    
    plt.figure(2, figsize=(15, 15/1.6180))
    plt.subplot(title='Most common words')
    sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})
    sns.barplot(x=x_pos, y=counts, hue=x_pos, legend=None)
    plt.xticks(x_pos, words, rotation=90) 
    plt.xlabel('words')
    plt.ylabel('counts')
    plt.show()

# show topics
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names_out()
    for topic_idx, topic in enumerate(model.components_):
        number = topic_idx+1
        print("\nTopic #%d:" % number)
        print(" ".join([words[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

## Load Data

Ok, time to load the texts, they're in the data/fire-papers/ folder if you want to have a look at the data itself.

In [None]:
# set location of txt files
txt_path = 'data/fire-papers/'

# load the content of each paper, save in a dataframe
data = []
txtfiles = os.listdir(txt_path)
for txtfile in txtfiles:
    filelocation = f"{txt_path}{txtfile}"
    txt = open(filelocation, "r", encoding = 'utf-8').read().replace("ﬁ", "fi") # replace joined 'fi' character with separate f and i
    filename = txtfile.replace('.txt','')
    data.append([filename,txt])

papers = pd.DataFrame.from_records(data, columns=['filename', 'txt'])

# check what that looks like, the head() function shows the top 5 rows
papers.head()

Before we do anything, it's worth making a wordcloud showing the most common words. That way we can get a quick overview of the contents of the data.

In [None]:
# Join the different titles together, and display wordcloud
all_text = ','.join(list(papers['txt'].values))
display_wordcloud(all_text)

## Preprocessing

Looking pretty good already, but let's do some basic preprocessing:

In [None]:
# Remove punctuation
papers['paper_text_processed'] = papers['txt'].map(lambda x: re.sub('[,\.!?]', '', x))

# Remove numbers
papers['paper_text_processed'] = papers['paper_text_processed'].map(lambda x: re.sub('[0-9]+', '', x))

# Convert to lowercase
papers['paper_text_processed'] = papers['paper_text_processed'].map(lambda x: x.lower())

# Print out the 5 preprocessed texts
print(papers['paper_text_processed'].head())


In [None]:
# Join the different processed titles together, and display wordcloud again
processed_text = ','.join(list(papers['paper_text_processed'].values))
display_wordcloud(processed_text)


Looking better, but there's still some words in here that are pretty useless when trying to find sub-themes of fire research, such as the words 'et al' are super common, and also 'fig' is not useful. Let's delete these.

In [None]:
papers['paper_text_processed'] = papers['paper_text_processed'].str.replace('et al', '') # remove et al
papers['paper_text_processed'] = papers['paper_text_processed'].str.replace('fig', '') # remove fig


# Join the different processed titles together, and display wordcloud again
processed_text = ','.join(list(papers['paper_text_processed'].values))
display_wordcloud(processed_text)


Not bad! Do you see any other words that you think might not be useful for topic modelling? You can go back to the previous cell and add code to delete those words, if you wish.

Once you're done, continue with the next cell to set up a so called 'vectoriser', this turns the words into vectors, which can then be used for machine learning, and this is the format that LDA expects. We also lemmatize the words, and remove punctuation.

In [None]:
# the wordcloud automatically filters out punctuation, but LDA doesn't, so let's filter those out as well
papers['paper_text_processed'] = papers['paper_text_processed'].str.replace('[^\w\s]', '')

# set up lemmatizer
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles) if len(t) > 2 ]


# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(
    tokenizer=LemmaTokenizer(),
    stop_words=stopwords.words('english'),
)

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(papers['paper_text_processed'])

# don't worry about the warning below

Now the words are counted, we can easily display the most common words:

In [None]:
# Visualise the most common words
plot_most_common_words(count_data, count_vectorizer,20)

Looking good, but there are some words in there that do not make a lot of sense, for example 'wa', 'also', and 'journal'. Let's remove stop words, and add some words of our own we want to delete:

In [None]:
# create custom stop words list, by starting with the default NLTK one, then adding our own terms
custom_stopwords = stopwords.words('english')
extra_stopwords = ['wa','also','journal']
custom_stopwords += extra_stopwords

# Initialise the count vectorizer with the English stop words
count_vectorizer = CountVectorizer(
    tokenizer=LemmaTokenizer(),
    stop_words=custom_stopwords,
)

# Fit and transform the processed titles
count_data = count_vectorizer.fit_transform(papers['paper_text_processed'])

# Visualise the 10 most common words
plot_most_common_words(count_data, count_vectorizer, 20)


Do you see any other words that should be deleted? Go back to the previous cell and add more words to the extra_stopwords list. You can also change the number (20) when we call the function 'plot_most_common_words' if you want to see more words. Once you think the data is looking good, continue below.

## Train

Now we've done all the preprocessing, it's time to actually train the model!

In [None]:
# Set the parameters (come back here later!)
number_topics = 3
number_words = 30

# Create and fit the LDA model
lda = LDA(n_components=number_topics, n_jobs=-1)
lda.fit(count_data)

# Print the topics found by the LDA model
print("Topics found via LDA:")
print_topics(lda, count_vectorizer, number_words)


Ok, we've got 3 clusters as found by LDA, now comes the hard part: trying to think of a topic name for each of the clusters. By looking at the top words for each topic, can you define or guess what the theme is? 

To make this a bit easier and intuitive, we can also create an interactive topic explorer:

In [None]:
# Export to interactive HTML file
LDAvis_prepared = pyLDAvis.lda_model.prepare(lda, count_data, count_vectorizer)
output_location = 'lda-output/ldavis_'+ str(number_topics) +'_topics.html'
pyLDAvis.save_html(LDAvis_prepared, 'lda-output/ldavis_'+ str(number_topics) +'_topics.html')


The HTML file is saved under lda-output/ldavis_3_topics.html, which you can open with your browser, but we can also open it within jupyter notebook:

In [None]:
# display interactive HTML
HTML(filename=output_location)

Now we don't just see the words per topic, but also how much each words contributes to a topic, and a cluster visualisation on the left. The size of the circles indicates how many documents are in that cluster, and the distance between the circles indicates how different they are. So currently, with 3 topics, they're all fairly different. 

Now comes the second hard part of topic modelling: figuring out how many topics is the optimal number. Go back to the cell where we set the number of topics, and update it from 3 to any other number. Play around and see if you can find a number where the topics are distinct, but not containing too many different words/themes. 

## Export the topics

Optional: once you're happy with what the topics look like, we can export a CSV so we can see which topic has been assigned to each paper. Once you have run the cell below, you can find the file at lda-output/topics_per_paper.csv

In [None]:
# Create list of topic per document
txts = papers['paper_text_processed'].tolist()

tf = count_vectorizer.fit_transform(txts)
doc_topic = lda.transform(tf)

topics = []
for n in range(doc_topic.shape[0]):
    
    rounded = np.around(doc_topic[n], decimals=3)
    ordered = {}
    i = 1
    for j in rounded:
        ordered[i] = j
        i += 1
    ordered = sorted(ordered.items(), key=operator.itemgetter(1))
    ordered.reverse()
    
    doctopics = ''
    for k in ordered:
        if k[1] > 0.20:
            doctopics += '|{}|{}'.format(k[0],k[1])
            
    doctopics = doctopics[1:]
    topics.append(doctopics)
 

# add topic numbers to df, drop txt columns, save to csv
papers['topic_numbers'] = topics
del papers['paper_text_processed']
del papers['txt']
papers.to_csv('lda-output/topics_per_paper.csv')

And that's all for this assignment!