# Topic Modeling using LDA and TF-IDF Models

In this project, we will be using a Latent Dirichlet allocation (LDA) model to derive clusters of topics from song lyrics produced by the pop artist Khalid. The LDA model classifies and groups terms within a document based on a shared relevancy to the rest of the document. These groups, called topics, can give us insight on to an overarching theme present in the document. We then go on to visualize these topics, and see how they may relate/differ from each other across Khalid's different songs.

Additionally, we also employ a Term Frequency - Inverse Document Frequency (TF-IDF) model to see how it compares with the LDA model. This particular model also determines the relevancy of a word to its document, with its meaning increasing proportionally to how often the word appears in the document. It is balanced by its word frequency in the entire corpus.

## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import nltk

In [2]:
df = pd.read_csv("Khalid.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Artist,Title,Album,Year,Date,Lyric
0,0,Khalid,Young Dumb & Broke,American Teen,2017.0,2017-03-03,so you're still thinking of me just like i kno...
1,1,Khalid,Location,American Teen,2016.0,2016-04-30,send me your location let's focus on communica...
2,2,Khalid,Better,Suncity,2018.0,2018-09-14,better nothing baby nothing feels better i'm n...
3,3,Khalid,Talk,Free Spirit,2019.0,2019-02-07,can we just talk can we just talk talk about w...
4,4,Khalid,Saved,American Teen,2017.0,2017-01-13,4 the hard part always seems to last forever...


In [4]:
lyr = df['Lyric'].tolist()

## Cleaning and Preprocessing

We extend NTLK's default stopword list with terms found specifically in Khalid's songs which have no thematic significance

In [6]:
stopwords1 = nltk.corpus.stopwords.words('english')
newStopWords1 = ['i\'m','i\'ll','yeah','like','let','ooh','oh','keep','gotta','need','know','get','way','bring','we\'re','say','i\'ve']
stopwords1.extend(newStopWords1)

In [7]:
stop = stopwords1
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [8]:
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [31]:
doc_clean = [clean(doc).split() for doc in lyr]

## Preparing a Document-Term Matrix

We convert the text corpus into a document-term matrix via the Gensim package so that we can run the LDA model on it.

In [10]:
import gensim
from gensim import corpora

In [11]:
dictionary = corpora.Dictionary(doc_clean)

In [12]:
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [13]:
Lda = gensim.models.ldamodel.LdaModel

In [14]:
ldamodel = Lda(doc_term_matrix, num_topics=5, id2word = dictionary, passes=50)

By printing the topics found using the LDA model, we can see that Khalid's song lyrics typically revolve around the emotions he feels; "feel", "nothing", "love". It is difficult to discern more topics due to the vague nature of the terms, which may be remedied by updating the stop-word list.

In [15]:
print(ldamodel.print_topics(num_topics=5, num_words=5))

[(0, '0.059*"young" + 0.042*"dumb" + 0.028*"broke" + 0.013*"high" + 0.013*"nowhere"'), (1, '0.028*"reason" + 0.024*"talk" + 0.021*"give" + 0.019*"turning" + 0.017*"go"'), (2, '0.022*"day" + 0.019*"llévame" + 0.016*"got" + 0.012*"put" + 0.012*"khalid"'), (3, '0.036*"feel" + 0.035*"nothing" + 0.032*"better" + 0.021*"love" + 0.014*"come"'), (4, '0.022*"back" + 0.020*"right" + 0.020*"love" + 0.016*"promise" + 0.013*"usin"')]


## Visualalizing an LDA and TF-IDF Model

We employ the pyLDAvis module to visualize both our LDA and TF-IDF model by showing the topic clusters, along with term frequency in each cluster.

In [33]:
import pyLDAvis
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook

<function pyLDAvis._display.enable_notebook(local=False, **kwargs)>

In [34]:
ldavis = pyLDAvis.gensim_models.prepare(ldamodel,doc_term_matrix, dictionary)
ldavis

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [19]:
from gensim.models import TfidfModel

In [22]:
tfidf = gensim.models.TfidfModel(doc_term_matrix)
corpus_tfidf = tfidf[doc_term_matrix]

In [23]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=4, id2word=dictionary, passes=2, workers=4)

In [30]:
print(lda_model_tfidf.print_topics(num_topics=5, num_words=5))

[(0, '0.002*"angel" + 0.002*"nothin" + 0.002*"else" + 0.002*"dododo" + 0.002*"spirit"'), (1, '0.002*"reason" + 0.002*"llévame" + 0.002*"knocked" + 0.002*"wildflower" + 0.002*"let"'), (2, '0.004*"nothing" + 0.003*"better" + 0.003*"one" + 0.003*"feel" + 0.003*"released"'), (3, '0.004*"dumb" + 0.004*"young" + 0.003*"alive" + 0.002*"broke" + 0.002*"turning"')]


In [28]:
tfidf_vis = pyLDAvis.gensim_models.prepare(lda_model_tfidf, corpus_tfidf, dictionary)
tfidf_vis

  by='saliency', ascending=False).head(R).drop('saliency', 1)


To conclude, we can see that the LDA model visualization provides us with a better understanding of the topic clusters. The estimated term frequency (shown in red) shows us how relevant the term is to the given topic. As noted earlier, many of these terms are vague action words like "feel", "talk", "go", "give", and as such it is difficult to determine a meaningful connection between the terms and their topic. 