## Topic Modelling With LDA
Using Latent Dirichlet Allocation (LDA) to determine Topics in given documents.

The dataset used here is a collection of articles published on "Medium".

In [56]:
import pandas as pd

In [2]:
dt = pd.read_csv("articles.csv")

In [3]:
dt

Unnamed: 0,author,claps,reading_time,link,title,text
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared:\nChatbots were T..."
1,Conor Dewey,1.4K,7,https://towardsdatascience.com/python-for-data...,Python for Data Science: 8 Concepts You May Ha...,If you’ve ever found yourself looking up the s...
2,William Koehrsen,2.8K,11,https://towardsdatascience.com/automated-featu...,Automated Feature Engineering in Python – Towa...,Machine learning is increasingly moving from h...
3,Gant Laborde,1.3K,7,https://medium.freecodecamp.org/machine-learni...,Machine Learning: how to go from Zero to Hero ...,If your understanding of A.I. and Machine Lear...
4,Emmanuel Ameisen,935,11,https://blog.insightdatascience.com/reinforcem...,Reinforcement Learning from scratch – Insight ...,Want to learn about applied Artificial Intelli...
...,...,...,...,...,...,...
332,Daniel Simmons,3.4K,8,https://itnext.io/you-can-build-a-neural-netwo...,You can build a neural network in JavaScript e...,Click here to share this article on LinkedIn »...
333,Eugenio Culurciello,2.8K,13,https://towardsdatascience.com/artificial-inte...,"Artificial Intelligence, AI in 2018 and beyond...",These are my opinions on where deep neural net...
334,Devin Soni,5.8K,4,https://towardsdatascience.com/spiking-neural-...,"Spiking Neural Networks, the Next Generation o...",Everyone who has been remotely tuned in to rec...
335,Carlos E. Perez,3.9K,7,https://medium.com/intuitionmachine/neurons-ar...,Surprise! Neurons are Now More Complex than We...,One of the biggest misconceptions around is th...


In [37]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Lemmatization using NLTK library. Remove the stop-words before carrying out the LDA. To carry out topic modelling, we need to convert our text column into a vectorized form and therefore we import the TfidfVectorizer


In [33]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords=set(nltk.corpus.stopwords.words('english'))

Combined author, title and text of a document to form a new column "documents"

In [13]:
dt['documents'] = dt['author']+dt['title']+dt['text']
dt

Unnamed: 0,author,claps,reading_time,link,title,text,documents
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared:\nChatbots were T...",Justin LeeChatbots were the next big thing: wh...
1,Conor Dewey,1.4K,7,https://towardsdatascience.com/python-for-data...,Python for Data Science: 8 Concepts You May Ha...,If you’ve ever found yourself looking up the s...,Conor DeweyPython for Data Science: 8 Concepts...
2,William Koehrsen,2.8K,11,https://towardsdatascience.com/automated-featu...,Automated Feature Engineering in Python – Towa...,Machine learning is increasingly moving from h...,William KoehrsenAutomated Feature Engineering ...
3,Gant Laborde,1.3K,7,https://medium.freecodecamp.org/machine-learni...,Machine Learning: how to go from Zero to Hero ...,If your understanding of A.I. and Machine Lear...,Gant LabordeMachine Learning: how to go from Z...
4,Emmanuel Ameisen,935,11,https://blog.insightdatascience.com/reinforcem...,Reinforcement Learning from scratch – Insight ...,Want to learn about applied Artificial Intelli...,Emmanuel AmeisenReinforcement Learning from sc...
...,...,...,...,...,...,...,...
332,Daniel Simmons,3.4K,8,https://itnext.io/you-can-build-a-neural-netwo...,You can build a neural network in JavaScript e...,Click here to share this article on LinkedIn »...,Daniel SimmonsYou can build a neural network i...
333,Eugenio Culurciello,2.8K,13,https://towardsdatascience.com/artificial-inte...,"Artificial Intelligence, AI in 2018 and beyond...",These are my opinions on where deep neural net...,"Eugenio CulurcielloArtificial Intelligence, AI..."
334,Devin Soni,5.8K,4,https://towardsdatascience.com/spiking-neural-...,"Spiking Neural Networks, the Next Generation o...",Everyone who has been remotely tuned in to rec...,"Devin SoniSpiking Neural Networks, the Next Ge..."
335,Carlos E. Perez,3.9K,7,https://medium.com/intuitionmachine/neurons-ar...,Surprise! Neurons are Now More Complex than We...,One of the biggest misconceptions around is th...,Carlos E. PerezSurprise! Neurons are Now More ...


Apply lemmatization to the words so that the root words of all derived words are use. the stop-words are removed and words with lengths greater than 3 are used.

In [38]:
def clean_text(text):
      le=WordNetLemmatizer()
      word_tokens=nltk.tokenize.word_tokenize(text)
      tokens=[le.lemmatize(w) for w in word_tokens if w not in stopwords and len(w)>3]
      cleaned_text=" ".join(tokens)
      return cleaned_text

In [43]:
dt['cleaned_documents']=dt['documents'].apply(clean_text)

Carrying out a TFIDF vectorization on the documents column gives a document term matrix on which we can carry out the topic modelling. Vectorization compares the number of times a word appears in a document with the number of documents that contain the word

In [47]:
vect =TfidfVectorizer(stop_words='english',max_features=1000)
vect_text=vect.fit_transform(dt['cleaned_documents'])

The parameters given to the LDA model, include the number of topics, the learning method (which is the way the algorithm updates the assignments of the topics to the documents), the maximum number of iterations to be carried out and the random state.

In [48]:
from sklearn.decomposition import LatentDirichletAllocation
lda_model=LatentDirichletAllocation(n_components=10, learning_method='online',random_state=42,max_iter=1)
lda_top=lda_model.fit_transform(vect_text)

We can check the proportion of topics that have been assigned to the first document

In [54]:
print("Document 0: ")
for i,topic in enumerate(lda_top[0]):
  print("Topic ",i,": ",topic*100,"%")

Document 0: 
Topic  0 :  0.7559789688538099 %
Topic  1 :  0.755996113799946 %
Topic  2 :  0.7559862643689164 %
Topic  3 :  0.7559928294840177 %
Topic  4 :  0.7559757312908226 %
Topic  5 :  93.19613438684421 %
Topic  6 :  0.7559735067224205 %
Topic  7 :  0.7559847337551232 %
Topic  8 :  0.7559835291650461 %
Topic  9 :  0.7559939357156936 %


Let us check what are the top words that comprise the topics:

In [63]:
vocab = vect.get_feature_names_out()
for i, comp in enumerate(lda_model.components_):
  vocab_comp = zip(vocab, comp)
  sorted_words = sorted(vocab_comp, key= lambda x:x[1], reverse=True)[:10]
  print("Topic "+str(i)+": ")
  for t in sorted_words:
    print(t[0],end=" ")
  print(" ")

Topic 0: 
innovation intelligence letter human machine data child policy generation company  
Topic 1: 
data learning model image table network machine reward capsule algorithm  
Topic 2: 
cnn region segmentation image bounding object box pixel mask proposal  
Topic 3: 
network data human voice game training google like sheet cheat  
Topic 4: 
function network relu activation tensorflow gradient sigmoid zero data graph  
Topic 5: 
network neural model data image learning neuron layer training like  
Topic 6: 
member learning machine image model function column method value self  
Topic 7: 
network trump neural learning data sheet 1080 woman sex interview  
Topic 8: 
learning machine course action agent network model deep algorithm policy  
Topic 9: 
data network face learning neural course machine neuron human like  
