Introduction to topic modeling and its applications
---
Popular topic modeling algorithms: Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF)
---
Hands-on exercise: Implementing a topic modeling algorithm (e.g., LDA) using
Python libraries like Gensim

Topic modeling is a natural language processing (NLP) technique used to uncover hidden topics or themes within a collection of documents. It is a way to extract meaningful information from unstructured text data by identifying patterns and relationships among words and documents.

---



The goal of topic modeling is to automatically discover the main topics or themes that are present in a large corpus of text. It allows us to gain insights into the underlying structure of the text data and understand the different subjects or concepts that the documents cover.

---
Topic modeling can be applied in various domains and has numerous applications. Some examples include:

1. Document clustering: Grouping similar documents together based on their topics.
2. Document summarization: Identifying the most representative topics in a document collection to create concise summaries.
3. Information retrieval: Enhancing search engines by indexing documents based on their topics rather than just keywords.
4. Social media analysis: Analyzing trends and discussions on social media platforms by identifying the dominant topics.
5. Market research: Understanding customer opinions and feedback by extracting topics from customer reviews or surveys.



Latent Dirichlet Allocation (LDA):

- LDA is a generative probabilistic model used for topic modeling.
- It assumes that each document in a collection is a mixture of various topics, and each word in the document is attributable to one of the topics.
- LDA infers the underlying topics by analyzing the co-occurrence patterns of words across multiple documents.
- It models the distribution of topics in the corpus and the distribution of words within each topic.
- LDA treats **topics as probability distributions over words** and **documents as probability distributions over topics**.
- The algorithm iteratively updates the topic assignments for words and estimates the topic-word and document-topic distributions.
- By analyzing these distributions, LDA can identify the main topics present in a collection of documents.

Non-Negative Matrix Factorization (NMF):

- NMF is a matrix factorization technique commonly used for topic modeling.
- It represents a document-term matrix as the product of two non-negative matrices: a document-topic matrix and a topic-term matrix.
- NMF assumes that documents can be expressed as combinations of a fixed number of latent topics, and the topics can be represented by a fixed set of terms.
- The algorithm iteratively updates the topic assignments for documents and the term assignments for topics, aiming to minimize the reconstruction error between the original matrix and the factorized representation.
- NMF enforces non-negativity constraints, meaning that all the values in the matrices are non-negative, which helps in interpretability.
- The resulting **document-topic matrix** and **topic-term matrix** can be used to identify the prominent topics and their associated terms.

Latent Dirichlet Allocation Video Lecture

https://www.youtube.com/watch?v=T05t-SqKArY

## Hands-on Exercise: Implementing a Topic Modeling Algorithm (e.g., LDA) using Python Libraries like Gensim

---


In this hands-on exercise, we will learn how to implement a topic modeling algorithm, such as Latent Dirichlet Allocation (LDA), using Python libraries like Gensim. Topic modeling is a powerful technique for discovering hidden themes or topics in a collection of documents. We will walk through the steps involved in preprocessing the text data, training an LDA model, and analyzing the results.

---



Note: Before proceeding with this exercise, make sure you have Gensim and other necessary libraries installed. You can install Gensim using pip install gensim.

In [None]:
!pip install gensim

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Exercise 1: Preprocessing the Text Data
In this exercise, we will perform preprocessing on the text data to prepare it for topic modeling.

# Task:

*   Load a collection of documents.
*   Preprocess the text data by performing the following steps:
*   Convert the text to lowercase.
*   Tokenize the text into individual words.
*   Convert the text to lowercase.
*   Remove stopwords and punctuation.
*   Perform stemming or lemmatization to reduce words to their base forms.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import string
# Load documents

documents = [
    "Cricket is a good sport. I like football also. What do you like to play?",
    "Science is about experimenting and discovering laws of nature. Physics, Chemistry and Biology are fundamental sciences.",
    "Which is a better sport, cricket or football?"
]

# Preprocess the text data
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords and punctuation
    stop_words = set(stopwords.words('english'))
    punctuation = set(string.punctuation)
    tokens = [word for word in tokens if word not in stop_words and word not in punctuation]

    # Perform lemmatization or stemming
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    #tokens = [stemmer.stem(word) for word in tokens]  # Uncomment this line for stemming

    return tokens

# Apply preprocessing to all documents
preprocessed_documents = [preprocess_text(doc) for doc in documents]

# Print the preprocessed documents
for i, doc in enumerate(preprocessed_documents):
    print(f"Document {i+1}: {doc}")


Document 1: ['cricket', 'good', 'sport', 'like', 'football', 'also', 'like', 'play']
Document 2: ['science', 'experimenting', 'discovering', 'law', 'nature', 'physic', 'chemistry', 'biology', 'fundamental', 'science']
Document 3: ['better', 'sport', 'cricket', 'football']


  and should_run_async(code)


## Exercise 2: Training an LDA Model
In this exercise, we will train an LDA model on the preprocessed text data.

Task:


*   Convert the preprocessed text data into a bag-of-words representation.
*   Train an LDA model on the bag-of-words representation with the desired number of topics.





In [None]:
from gensim import corpora
from gensim.models import LdaModel

# Create a dictionary from the preprocessed documents
dictionary = corpora.Dictionary(preprocessed_documents)

# Create a bag-of-words representation of the documents
bow_corpus = [dictionary.doc2bow(doc) for doc in preprocessed_documents]

# Train the LDA model
num_topics = 2
lda_model = LdaModel(bow_corpus, num_topics=num_topics, id2word=dictionary, passes=10)

# Print the topics
topics = lda_model.print_topics(num_topics=num_topics)
for topic in topics:
    print(topic)


(0, '0.135*"science" + 0.081*"discovering" + 0.081*"law" + 0.081*"biology" + 0.081*"physic" + 0.081*"chemistry" + 0.081*"nature" + 0.081*"experimenting" + 0.081*"fundamental" + 0.027*"better"')
(1, '0.122*"like" + 0.122*"football" + 0.122*"cricket" + 0.122*"sport" + 0.073*"also" + 0.073*"play" + 0.073*"good" + 0.073*"better" + 0.025*"science" + 0.025*"fundamental"')


  and should_run_async(code)


# Exercise 3: Analyzing the Results
In this exercise, we will analyze the results of the trained LDA model.

Task:

*   Print the most representative document for each topic.
*   Visualize the topics using the pyLDAvis library.


In [None]:
!pip install pyLDAvis

  and should_run_async(code)


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pyLDAvis.gensim

# Print the most representative document for each topic
for i in range(num_topics):
    topic_documents = lda_model.get_document_topics(bow_corpus)
    print(topic_documents)
    sorted_documents = sorted(topic_documents, key=lambda x: x[i][1], reverse=True)
    most_representative_doc = sorted_documents[0]
    print(f"Topic {i+1}: Document {most_representative_doc[0]} - Probability: {most_representative_doc[1][0]}")

# Visualize the topics
lda_visualization = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)
pyLDAvis.display(lda_visualization)


  and should_run_async(code)


<gensim.interfaces.TransformedCorpus object at 0x7fced5a753c0>
Topic 1: Document (0, 0.95298934) - Probability: 1
<gensim.interfaces.TransformedCorpus object at 0x7fced5a74850>
Topic 2: Document (0, 0.05736766) - Probability: 1


### Exercise: Print the topic distribution for each of the 4 documents

In [None]:
for doc_id, topic_distribution in enumerate(topic_documents):
        print(f"Document ID: {doc_id}")

        if isinstance(topic_distribution, int):
            print("No topics found for this document.")
        else:
            for topic, probability in topic_distribution:
                print(f"Topic {topic}: Probability {probability}")

        print()  # Print an empty line to separate documents


Document ID: 0
Topic 0: Probability 0.05736686661839485
Topic 1: Probability 0.9426330924034119

Document ID: 1
Topic 0: Probability 0.9529885649681091
Topic 1: Probability 0.047011446207761765

Document ID: 2
Topic 0: Probability 0.1029861643910408
Topic 1: Probability 0.8970138430595398



  and should_run_async(code)


In [None]:
# From above it is clear that 1st and 3rd Documents are about the same Topic (Sports), wheras the 2nd Document is about a different topic (Science)