# What is topic modeling?

### Summary

This section introduces topic modeling, an unsupervised machine learning technique used to identify patterns and group similar documents based on their content. It explains how topic modeling algorithms analyze text data to discover underlying themes and provides a practical example to illustrate the concept.

### Highlights

- 🔍 Topic modeling identifies patterns and groups similar documents.
- 🤖 It's an unsupervised learning technique, requiring no labeled data.
- 📝 Algorithms detect word patterns to determine document themes.
- 🧩 Grouping documents based on shared themes simplifies large text datasets.
- 💡 Example themes include equipment, media, and government regulations.
- 🚀 Topic modeling automates the identification of key themes in text.
- 🧠 It can uncover patterns that humans might miss in large datasets.
- 🛠️ Two common algorithms are Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA).

# When to use topic modeling?

### Summary

This section provides practical examples of how topic modeling can be applied in various real-world scenarios, such as organizing news articles, analyzing customer feedback, and conducting social listening. It emphasizes the efficiency and time-saving benefits of using topic modeling to extract key themes from large volumes of text data.

### Highlights

- 📰 Grouping news articles or research papers under relevant topics.
- 🗣️ Analyzing customer feedback and reviews to understand consumer sentiment.
- 📱 Monitoring social media data for brand-related discussions.
- ⏱️ Topic modeling automates the process, saving time and effort.
- 🤖 Identifying key themes in text data more efficiently than manual methods.
- 📊 Organizing large datasets of text into meaningful categories.
- 💡 Discovering insights and patterns that may be missed by human analysis.

# Latent Dirichlet Allocation (LDA)

### Summary

This section explains the Latent Dirichlet Allocation (LDA) algorithm, a technique used for topic modeling. It details how LDA identifies latent topics within documents by analyzing word frequencies and distributions. The algorithm iteratively assigns words to topics, refining these assignments until a stable set of topics is achieved.

### Highlights

- 🧩 LDA identifies latent topics by analyzing word patterns in documents.
- 📝 Documents are assumed to primarily focus on a single topic, with some words from others.
- 🔄 LDA uses an iterative process to refine topic assignments.
- 📊 The algorithm considers word proportions within documents and across the corpus.
- 🔢 The number of topics (k) is pre-defined.
- 🎲 Initial word assignments are random, but refined over iterations.
- 💡 The algorithm reaches a steady state, providing final topic assignments.

# LDA in Python

### Summary

This section demonstrates how to implement Latent Dirichlet Allocation (LDA) for topic modeling using the Gensim library in Python. It covers the steps from data loading and preprocessing to model training and topic interpretation.

### Highlights

- 🐍 Utilizing Gensim for LDA topic modeling.
- 📝 Preprocessing text data: removing punctuation, lowercasing, stop words, tokenizing, and stemming.
- 📚 Creating a dictionary of unique words and a document-term matrix.
- 🤖 Training the LDA model with specified parameters.
- 📊 Printing and interpreting the top words for each identified topic.
- 🛠️ Adjusting the number of topics or refining data preprocessing for better results.
- 💡 Exploring alternative topic modeling methods for potentially more informative outcomes.

### Code Examples

```python
import pandas as pd
import re
import nltk
from gensim import corpora
from gensim.models import LdaModel

# Load data
data = pd.read_csv("news_articles.csv")
articles = data['content'].tolist()

# Preprocessing
def preprocess(text):
    text = re.sub(r'[^\w\s]', '', text).lower()
    tokens = nltk.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')
    tokens = [word for word in tokens if word not in stopwords]
    stemmer = nltk.stem.PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
    return tokens

processed_articles = [preprocess(article) for article in articles]

# Create dictionary and document-term matrix
dictionary = corpora.Dictionary(processed_articles)
doc_term_matrix = [dictionary.doc2bow(article) for article in processed_articles]

# Train LDA model
num_topics = 2
lda_model = LdaModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics)

# Print topics
topics = lda_model.print_topics(num_words=10)
for topic in topics:
    print(topic)
```

# Latent Semantic Analysis (LSA)

### Summary

This section introduces Latent Semantic Analysis (LSA), a topic modeling technique based on the distributional hypothesis and Singular Value Decomposition (SVD). It explains how LSA transforms text documents into vector representations, enabling the identification of similar words and documents.

### Highlights

- 🧠 LSA relies on the distributional hypothesis: words with similar meanings appear together.
- 🔢 Singular Value Decomposition (SVD) is used for dimensionality reduction.
- 📊 SVD transforms the document-term matrix into document-topic and term-topic matrices.
- 📉 LSA vectors capture different aspects of meaning in the text.
- 💡 LSA can identify similar words and documents through clustering and similarity scores.
- 🧮 SVD helps in understanding the variance explained by each latent topic.
- 🛠️ LSA provides a structured approach to analyze and interpret text data.

# LSA in Python

### Summary

This section demonstrates how to implement Latent Semantic Analysis (LSA) for topic modeling using the Gensim library. It highlights the similarities and differences between LSA and LDA, and shows how to interpret the resulting topics.

### Highlights

- 🐍 Utilizing Gensim's `LsiModel` for LSA topic modeling.
- 🔄 LSI and LSA are interchangeable terms in topic modeling.
- 📊 Comparing LSA and LDA results, noting slight differences in topic composition.
- 📝 Interpreting the most important words for each topic generated by LSA.
- 🛠️ Preparing for the next step: optimizing the number of topics for better results.

### Code Examples

```python
from gensim.models import LsiModel

# Assuming doc_term_matrix, dictionary, and num_topics are already defined

# Train LSA model
lsa_model = LsiModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics)

# Print topics
topics = lsa_model.print_topics(num_words=10)
for topic in topics:
    print(topic)
```

# How many topics?

### Summary

This section explains how to determine the optimal number of topics for LSA using coherence scores. It demonstrates how to iterate through different numbers of topics, calculate coherence scores, and visualize them to identify the best number of topics. It also emphasizes the importance of considering business context and intuition alongside mathematical metrics.

### Highlights

- 📊 Coherence scores help determine the optimal number of topics.
- 🐍 Utilizing matplotlib for visualizing coherence scores.
- 🔄 Iterating through different numbers of topics to calculate coherence.
- 📈 Identifying the number of topics with the highest coherence score.
- 🤖 Training the final LSA model with the optimal number of topics.
- 🧠 Balancing mathematical accuracy with business context and intuition.
- 💡 Manually inspecting topics to ensure they are meaningful and coherent.

### Code Examples

```python
import matplotlib.pyplot as plt
from gensim.models import LsiModel
from gensim.models.coherencemodel import CoherenceModel

# Assuming doc_term_matrix, dictionary, and articles are already defined

coherence_values = []
model_list = []
min_topics = 2
max_topics = 11

for num_topics in range(min_topics, max_topics + 1):
    model = LsiModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=num_topics)
    model_list.append(model)
    coherence_model = CoherenceModel(model=model, texts=articles, dictionary=dictionary, coherence='c_v')
    coherence_values.append(coherence_model.get_coherence())

# Plot coherence scores
plt.plot(range(min_topics, max_topics + 1), coherence_values)
plt.xlabel("Number of Topics")
plt.ylabel("Coherence Score")
plt.show()

# Train final LSA model with optimal number of topics
final_num_topics = 3  # Based on the plot
final_lsa_model = LsiModel(corpus=doc_term_matrix, id2word=dictionary, num_topics=final_num_topics)

# Print final topics
final_topics = final_lsa_model.print_topics(num_words=10)
for topic in final_topics:
    print(topic)
```