# Week 4: Topic modelling nursery rhymes with TF-IDF features and Latent Semantic Analysis (LSA)

In this notebook we are going to look at how to perform topic modelling with [**TF-IDF**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (Term Frequency - Inverse Document Frequency) as the input features. There is another notebook **very similiar** to this one, except it uses Bag of Words (BoW) as the features for topic modelling and the LDA algorithm. Compare the results from this notebook to the BoW one and see how the code and results differ. 

First lets do some imports:

In [1]:
import os
import nltk
import pandas as pd

from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD


Lets download a library of English stop words and the semantic word database [wordnet](https://wordnet.princeton.edu/https://wordnet.princeton.edu/) that we will use for lemmatisation. 

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ROG\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ROG\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ROG\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

First we need to define this function which gets us the [Part of Speech tag](https://en.wikipedia.org/wiki/Part-of-speech_tagging) (POS), to tell us what type of word each word in our dataset is, such as whether a word is a [noun](https://www.merriam-webster.com/dictionary/noun), a [verb](https://www.merriam-webster.com/dictionary/verb), an [adjective](https://www.merriam-webster.com/dictionary/adjective) or an [adverb](https://www.merriam-webster.com/dictionary/adverb). There are other POS tags, but these are the four we need for the NLTK lemmatiser.

 This will help us when we come to perform lemmatisation, as this gives us more context about each word and makes our lemmatisation algorithm more effective:

In [3]:
# Function originally from: https://www.programcreek.com/python/?CodeExample=get%20wordnet%20pos
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

This function goes through every text document in a folder and performs lemmatisation on the contents:

In [4]:
def load_text_documents(folder_path):
    document_texts = []
    document_labels = []

    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.endswith(".txt"):
                with open(os.path.join(root, file), 'r', encoding='utf-8') as f:
                    text = f.read()
                
                lemmatizer = WordNetLemmatizer()
                # Apply lemmatizer to each word in the nursery rhyme
                lemmitized_text = " ".join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in text.split()])
                document_texts.append(lemmitized_text)
                document_labels.append(os.path.basename(file[:-4]))
    
    return document_texts, document_labels

Put in the path for the nursery rhyme dataset and load in the documents:

<a id='load-data'></a>

In [5]:
# folder_path = r"D:\OneDrive - University of the Arts London\NLP-23-24\data\test"
folder_path = "../data/haikus"
document_texts, document_labels = load_text_documents(folder_path)
print(f'loaded {len(document_labels)} documents')

KeyboardInterrupt: 

Lets look at the first document and see it has loaded correctly:

In [None]:
print(f'The first document is {document_labels[0]}, which goes:')
print(document_texts[0])

The first document is , which goes:
Age how sure!


Now lets define our stop words. We are combining generic English stop words with stop words specific to our dataset of nursery rhymes (if you adapt this code to another dataset, **make sure to modify these stop words**):

In [None]:
english_stop_words = stopwords.words('english')
nursery_rhyme_stop_words = ['chorus', 'repeat', '3x', 'verse', 'version', 'versions', 'intro', 'finale', 'lyrics']
stop_words = english_stop_words + nursery_rhyme_stop_words

Now lets use the `TfidfVectorizer` class to get our TF-IDF features for each document:

<a id='vectorizer'></a>

In [None]:
vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=(1,1))
tf_idf = vectorizer.fit_transform(document_texts)
vocab = vectorizer.get_feature_names_out()
print(f'Our bag of words is a matrix of the shape and size {tf_idf.shape}')

Our bag of words is a matrix of the shape and size (27081, 19256)


Lets look at our TF-IDF features matrix (aka a table) for all documents as a pandas dataframe:

In [None]:
tfidf_df = pd.DataFrame(tf_idf.toarray(), columns=vocab, index=document_labels)
tfidf_df

Unnamed: 0,00,01,10,100,11,...,zucchini,zuckerberg,zuleika,ēn,ēng
,0.0,0.0,0.00000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0
1-11-11,0.0,0.0,0.00000,0.00000,0.819225,...,0.0,0.0,0.0,0.0,0.0
1-a-yellow-band-of-light-upon-the-street,0.0,0.0,0.00000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0
10-my-ship-has-tasted,0.0,0.0,0.37681,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0
100-degree-heat,0.0,0.0,0.00000,0.44783,0.000000,...,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
gives-hope-to-the-valiant-and-promise-of-war,0.0,0.0,0.00000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0
gives-them-extra-time,0.0,0.0,0.00000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0
gives-too-late-whats-not-believed,0.0,0.0,0.00000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0
gives-too-soon-into-weak-hands-whats,0.0,0.0,0.00000,0.00000,0.000000,...,0.0,0.0,0.0,0.0,0.0


Now lets look at the TF-IDF features for the first nusery rhyme. We will remove all of the words with zero counts to make it easier to make sense of:

In [None]:
single_row_df = tfidf_df.iloc[0]
single_row_df = single_row_df.replace(0.0,None)
single_row_df = single_row_df.dropna()
single_row_df

age     0.730377
sure    0.683045
Name: , dtype: object

### Implementing LSA (TruncatedSVD)

Subtrach the mean from each value in the matrix/dataframe/table:

In [None]:
tfidf_df = tfidf_df - tfidf_df.mean()

You can set the number of topics you want in the following cell:

<a id='num-topics'></a>

In [None]:
num_topics = 11
pd.options.display.max_columns=num_topics #Make sure we display them all
labels = ['topic{}'.format(i) for i in range(num_topics)] 

Now lets calculate our topics using the [TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) (aka LSA) algorithm from sci-kit learn:

In [None]:
svd = TruncatedSVD(n_components = num_topics, n_iter = 11) #You can change n_iter: Higher numbers will take longer but may (or may not) give you better results
svd_topic_vectors = svd.fit_transform(tfidf_df.values)

Lets see some of the weightings between our topics and our words:

In [None]:

topic_weights = pd.DataFrame(svd.components_.T, index=vocab, columns=labels)
topic_weights.sample(20)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
low,-0.003822,-0.002775,-6.9e-05,-0.001013,0.000996,-6e-05,0.002346,-0.004309417,-0.004937,0.003925,0.001115
overlord,-9.9e-05,-9.3e-05,3e-06,-0.000108,1.8e-05,6.6e-05,2.4e-05,-8.346027e-05,-5.8e-05,-5.9e-05,4.4e-05
raw,-0.00078,-0.00054,0.000141,-0.00067,5.8e-05,5.8e-05,-0.000196,-0.0006514606,-0.000274,0.001178,-0.000904
gasworks,0.000151,0.000628,-0.00109,-0.000324,-0.000173,-0.00012,-0.000271,7.763137e-05,-0.000321,-5.5e-05,-0.000175
corners,-0.000281,-0.000278,-8.2e-05,-0.000155,-0.000117,-7.7e-05,-0.000143,-1.449977e-05,-0.000113,3.7e-05,-4.7e-05
nemnan,-0.00012,-0.000104,-1.8e-05,-0.000102,-6e-06,4e-06,-1e-05,-5.003342e-06,-2.1e-05,1e-06,-5.4e-05
precise,-0.000249,-0.000243,-7.3e-05,-0.00014,-3.1e-05,-2.2e-05,-5.4e-05,-3.874976e-06,-0.000111,4.3e-05,-4.7e-05
blatantly,-0.00015,-0.000133,4.8e-05,-0.000191,0.000109,7.8e-05,-3.6e-05,2.648841e-05,3.5e-05,-3.2e-05,-2.6e-05
lightly,-0.000544,-0.000523,-6.5e-05,-0.000371,-3.9e-05,7.8e-05,-0.000146,3.713471e-05,-0.00027,-1e-06,-5.7e-05
slider,-0.00016,-0.000101,-0.000245,0.000871,-0.000353,-0.000125,-0.000558,-0.0004745958,0.00048,2.9e-05,4.9e-05


And the most relevent words for each topic:

In [None]:
num_terms = 20
for i in range(num_topics):
    print("___topic " + str(i) + "___")
    topicName = "topic" + str(i)
    weightedlist = topic_weights.get(topicName).sort_values()[-num_terms:]
    print(weightedlist.index.values)

___topic 0___
['wanna' 'way' 'one' 'people' 'make' 'someone' 'let' 'thing' 'everyone'
 'time' 'ever' 'feel' 'back' 'want' 'know' 'shit' 'fuck' 'like' 'go' 'get']
___topic 1___
['see' 'back' 'someone' 'ever' 'really' 'say' 'never' 'thing' 'always'
 'look' 'want' 'one' 'people' 'even' 'make' 'love' 'feel' 'know' 'like'
 'go']
___topic 2___
['good' 'thing' 'best' 'everyone' 'person' 'shit' 'bad' 'much' 'say'
 'always' 'really' 'ever' 'fuck' 'even' 'people' 'look' 'love' 'make'
 'feel' 'like']
___topic 3___
['ever' 'great' 'last' 'long' 'night' 'well' 'happy' 'feel' 'best' 'thing'
 'another' 'year' 'first' 'good' 'make' 'every' 'time' 'love' 'one' 'day']
___topic 4___
['think' 'well' 'thing' 'need' 'people' 'person' 'get' 'everyone' 'tell'
 'let' 'want' 'give' 'someone' 'say' 'always' 'never' 'even' 'much' 'know'
 'love']
___topic 5___
['cry' 'decision' 'way' 'happen' 'hurt' 'life' 'need' 'friend' 'sure'
 'good' 'know' 'bad' 'someone' 'really' 'well' 'happy' 'people' 'thing'
 'want' 'make'

And the association between our documents (individual nursery rhymes or other data samples) and our topics:

In [None]:
svd_topic_vectors_df = pd.DataFrame(svd_topic_vectors, index=document_labels, columns=labels)
svd_topic_vectors_df.sample(10)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
funny-how-i-use,-0.028158,-0.022191,0.002462,-0.022046,-0.001228,0.011051,0.01190752,-0.004582,0.005543,-0.007585,-0.008984
early-storm-warning,-0.029878,-0.029482,-0.009811,-0.017737,-0.007365,-0.002144,-0.0111006,-0.000906,-0.008606,0.001513,-0.003393
death,-0.037291,-0.031628,0.003428,-0.016238,0.000583,0.008248,0.003574383,-0.014906,-0.025326,-0.002405,0.004059
bringing,-0.047897,-0.034436,-0.008644,-0.020683,-0.003317,-0.012378,-0.02091547,-0.000568,-0.017016,0.001154,-0.005063
currently-using,-0.025684,-0.025513,-0.00179,-0.022443,-0.003708,-0.000975,6.374709e-07,-0.002958,-0.000923,-0.000617,-0.002917
cant-wait-to-bombard,-0.000959,0.035453,-0.007797,0.197418,-0.036704,0.002897,0.07349593,-0.134875,0.032211,-0.073422,-0.051058
bad-decisions-by,-0.02482,-0.018542,0.002438,-0.006362,-0.008628,0.013516,0.01003601,-0.011656,-0.008455,0.001252,-0.015133
accept-the-fact-that,-0.011093,0.007285,0.029115,-0.037244,0.011337,0.03631,0.04507096,-0.011213,0.087117,-0.037293,-0.084415
fake-or-genuine,-0.029604,-0.020347,-0.010615,-0.007916,-0.007046,-0.002101,-6.563337e-05,0.00525,-0.002543,0.008792,-0.005022
33-her-knittingneedles-clicked-and,-0.02557,-0.027297,-0.019508,0.031678,-0.025003,-0.036804,0.06669356,0.111116,0.027868,0.04559,-0.023453


And we can sort by importance for a particular topic. 

Try changing the topic that you are sorting by and see if you can see a correspondence between the most important words in the topic with the lyrics of the nursery rhyme:

In [None]:
svd_topic_vectors_df.sort_values(by=['topic1'], ascending=False)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10
aint-no-telling-when,0.194233,0.455978,-0.262186,-0.078836,-0.187010,-0.175662,0.040292,0.258332,-0.064488,0.024369,-0.076291
any-else-feel-like,0.148987,0.422396,0.087429,-0.086950,-0.193809,-0.113413,-0.119443,-0.031161,-0.134578,0.212705,-0.142474
aint-even-going,0.151649,0.413920,-0.335780,-0.142833,0.068669,0.000991,-0.002969,-0.128330,-0.004203,0.288331,-0.063719
everyone-is-not,0.172524,0.408512,0.046856,0.073188,-0.141290,0.169900,0.043685,0.019006,-0.290729,-0.106959,0.004850
as-mondays-go-feels,0.152911,0.398118,-0.117758,-0.012031,-0.171491,-0.092208,0.022053,-0.021861,-0.248067,-0.061756,-0.036767
...,...,...,...,...,...,...,...,...,...,...,...
finally-got-you,0.469727,-0.244443,-0.008688,-0.011306,0.009786,-0.016822,-0.063355,-0.020652,-0.065195,0.089631,-0.017027
after-that-i-pray,0.429810,-0.250555,-0.019274,0.026607,-0.007605,-0.036482,-0.061203,-0.000112,-0.045355,0.053723,0.003384
cant-get-enough-please,0.548134,-0.260125,0.025948,-0.015257,0.038900,0.005578,-0.040488,0.000605,-0.058717,0.053853,-0.015599
cause-now-im-getting,0.556160,-0.290220,0.021855,-0.011897,0.030056,-0.005711,-0.051629,-0.013786,-0.056356,0.054215,0.008182


## Tasks

**Task 1:** Compare this notebook to the Bag of Words + LDA topic modelling notebook. What differences do you see? Are the topics any better or more intelligible using this notebook?

**Task 2:** Change the [number of topics](#num-topics) generated by the topic modellig algorithm. How does that effect the topics? Is using more or less topics better?

**Task 3:** Adjust the n-gram parameters [in the cell that defines the TF-IDF vectorizer](#vectorizer), i.e. make the range `1,2` if you want to include individual words and bi-grams, or `2,3` if you want to use bi-grams and tri-grams. How does that effect the topics?

**Task 4:** Once you have done that, try loading in a different dataset and try out topic modelling on that. There is a [dataset of limericks](https://git.arts.ac.uk/tbroad/limerick-dataset), a [dataset of haikus](https://git.arts.ac.uk/tbroad/haiku-dataset), and a [dataset of EPL fan chants](https://git.arts.ac.uk/tbroad/SFW-EPL-fan-chants-dataset) (nursery rhymes for grown men) that have been created to be in the same format as the nursery rhymes dataset. Simply download them (unzip if you need to) and move the dataset folder into the folder `../data/my-data` and [edit the path](#load-data) for the new dataset. 