# Week 4: Topic modelling nursery rhymes with Bag of Words features and Latent Dirichlet Allocation (LDA)

In this notebook we are going to look at how to perform topic modelling with Bag of Words as the input features. There is another notebook **very similiar** to this one, except it uses [**TF-IDF**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (Term Frequency - Inverse Document Frequency) as the features for topic modelling along with a different topic modelling algorithm. Compare the results from this notebook to the TF-IDF one and see how the code and results differ. 

First lets do some imports:

In [None]:
import os
import nltk
import pandas as pd

from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

Lets download a library of English stop words and the semantic word database [wordnet](https://wordnet.princeton.edu/https://wordnet.princeton.edu/) that we will use for lemmatisation. 

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

First we need to define this function which gets us the [Part of Speech tag](https://en.wikipedia.org/wiki/Part-of-speech_tagging) (POS), to tell us what type of word each word in our dataset is, such as whether a word is a [noun](https://www.merriam-webster.com/dictionary/noun), a [verb](https://www.merriam-webster.com/dictionary/verb), an [adjective](https://www.merriam-webster.com/dictionary/adjective) or an [adverb](https://www.merriam-webster.com/dictionary/adverb). There are other POS tags, but these are the four we need for the NLTK lemmatiser.

This will help us when we come to perform lemmatisation, as this gives us more context about each word and makes our lemmatisation algorithm more effective:

In [None]:
# Function originally from: https://www.programcreek.com/python/?CodeExample=get%20wordnet%20pos
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    
    # Return the tag, if the tag is not found return noun. 
    return tag_dict.get(tag, wordnet.NOUN)

This function goes through every text document in a folder and performs lemmatisation on the contents:

In [None]:
def load_text_documents(folder_path):
    document_texts = []
    document_labels = []

    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.endswith(".txt"):
                with open(os.path.join(root, file), 'r', encoding='utf-8') as f:
                    text = f.read()
                
                lemmatizer = WordNetLemmatizer()
                # Apply lemmatizer to each word in the nursery rhyme
                lemmitized_text = " ".join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in text.split()])
                document_texts.append(lemmitized_text)
                document_labels.append(os.path.basename(file[:-4]))
    
    return document_texts, document_labels

Put in the path for the nursery rhyme dataset and load in the documents:

<a id='load-data'></a>

In [None]:
folder_path = "../data/nursery-rhymes"
document_texts, document_labels = load_text_documents(folder_path)
print(f'loaded {len(document_labels)} documents')

Lets look at the first document and see it has loaded correctly:

In [None]:
print(f'The first document is {document_labels[0]}, which goes:')
print(document_texts[0])

Now lets define our stop words. We are combining generic English stop words with stop words specific to our dataset of nursery rhymes (if you adapt this code to another dataset, **make sure to modify these stop words**):

In [None]:
english_stop_words = stopwords.words('english')
nursery_rhyme_stop_words = ['chorus', 'repeat', '3x', 'verse', 'version', 'versions', 'intro', 'finale', 'lyrics']
stop_words = english_stop_words + nursery_rhyme_stop_words

Now lets use the `CountVectorizer` class to get our bag of words features for each document:

<a id='vectorizer'></a>

In [None]:
vectorizer = CountVectorizer(stop_words=stop_words, ngram_range=(1,1))
bag_of_words = vectorizer.fit_transform(document_texts)
vocab = vectorizer.get_feature_names_out()
print(f'Our bag of words is a matrix of the shape and size {bag_of_words.shape}')

Lets look at our bag of words features matrix (aka a table) for all documents as a pandas dataframe:

In [None]:
bow_df = pd.DataFrame(bag_of_words.toarray(), columns=vocab, index=document_labels)
bow_df

Now lets look at the bag of words features for the first nusery rhyme. We will remove all of the words with zero counts to make it easier to make sense of:

In [None]:
single_row_df = bow_df.iloc[0]
single_row_df = single_row_df.replace(0.0,None)
single_row_df = single_row_df.dropna()
single_row_df

Here we will define the number of topics we are using. Come back to this follwing cell later on to change the number of topics you are using:

<a id='num-topics'></a>

In [None]:
num_topics = 16
pd.options.display.max_columns=num_topics #Make sure we display them all
labels = ['topic{}'.format(i) for i in range(num_topics)]

Here we will define our Latent Dirichlet Allocation (LDA) algorithm:

In [None]:
lda = LatentDirichletAllocation(n_components=num_topics,random_state=123, learning_method='batch')

Now lets fit our LDA model to our data:

In [None]:
lda_topics = lda.fit_transform(bag_of_words)

Lets see some of the weightings between our topics and our words.

(Note that lda.components_ is a 2d array of elements, each of which is higher if the association between the given topic and word is stronger, and lower if the association is weaker. This is not a normalised probability distribution, so the elements within a topic won't sum to 1. Note also that lda.components_ has topics in the rows and words in the columns, so we can use .T to get the "transposed" version in which rows and columns are swapped.)

In [None]:
topic_weights = pd.DataFrame(lda.components_.T, index=vocab, columns=labels)
topic_weights.sample(20)

And the most relevent words for each topic:

In [None]:
num_terms = 20
for i in range(num_topics):
    print("___topic " + str(i) + "___")
    topicName = "topic" + str(i)
    weightedlist = topic_weights.get(topicName).sort_values()[-num_terms:]
    print(weightedlist.index.values)

And the association between our documents (individual nursery rhymes or other data samples) and our topics:

In [None]:
lda_topic_vectors_df = pd.DataFrame(lda_topics, index=document_labels, columns=labels)
lda_topic_vectors_df.sample(10)

And we can sort by importance for a particular topic. 

Try changing the topic that you are sorting by and see if you can see a correspondence between the most import words in the topic with the lyrics of the nursery rhyme:

In [None]:
lda_topic_vectors_df.sort_values(by=['topic1'], ascending=False)

## Tasks

**Task 1:** Compare this notebook to the TF-IDF + LSA topic modelling notebook. What differences do you see? Are the topics any better when using the other algorithm?

**Task 2:** Change the [number of topics](#num-topics). How does that effect the topics? Is using more or less topics better?

**Task 3:** Adjust the n-gram parameters [in the cell that defines the bag of words vectorizer](#vectorizer), i.e. make the range `1,2` if you want to include individual words and bi-grams, or `2,3` if you want to use bi-grams and tri-grams. How does that effect the topics?

**Task 4:** Once you have done that, try loading in a different dataset and try out topic modelling on that. There is a [dataset of limericks](https://git.arts.ac.uk/tbroad/limerick-dataset), a [dataset of haikus](https://git.arts.ac.uk/tbroad/haiku-dataset), and a [dataset of EPL fan chants](https://git.arts.ac.uk/tbroad/SFW-EPL-fan-chants-dataset) (nursery rhymes for grown men) that have been created to be in the same format as the nursery rhymes dataset. Simply download them (unzip if you need to) and move the dataset folder into the folder `../data/my-data` and [edit the path](#load-data) for the new dataset. 