# Section 5: Topic Modeling

This notebook covers topic modeling techniques, including Latent Dirichlet Allocation (LDA) and using Gensim for topic modeling.

## 1. Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample documents
documents = [
    'This is the first document about politics.',
    'This document is the second document about sports.',
    'And this is the third one about politics and sports.',
    'Is this the first document about sports?'
]

# Create a CountVectorizer object
vectorizer = CountVectorizer(stop_words='english')

X = vectorizer.fit_transform(documents)

# Create an LDA object
lda = LatentDirichletAllocation(n_components=2, random_state=42)

# Fit the LDA model
lda.fit(X)

# Get the topics
for index, topic in enumerate(lda.components_):
    print(f'Topic #{index}')
    print([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-3:]])
    print('
')

## 2. Gensim for Topic Modeling

Gensim is a popular open-source library for unsupervised topic modeling and natural language processing. It is designed to be efficient and scalable, and it provides a number of tools for working with text data.

One of the main advantages of using Gensim for topic modeling is that it provides a more comprehensive set of tools than scikit-learn. For example, Gensim allows you to easily create and manipulate corpora and dictionaries, and it provides a number of different topic modeling algorithms, including LDA, LSI, and HDP.

In [None]:
import gensim
from gensim import corpora
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Sample documents
documents = [
    'This is the first document about politics.',
    'This document is the second document about sports.',
    'And this is the third one about politics and sports.',
    'Is this the first document about sports?'
]

# Preprocess the documents
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

def preprocess(doc):
    tokens = word_tokenize(doc.lower())
    return [token for token in tokens if token not in stop_words and token not in punctuation]

processed_docs = [preprocess(doc) for doc in documents]

# Create a dictionary and a corpus
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# Create an LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

# Print the topics
for topic in ldamodel.print_topics():
    print(topic)