# Latent Dirichlet Allocation (LDA) Implementation

Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete dataset such as text corpora. It is also a topic model that is used for discovering abstract topics from a collection of documents.

# Import necessary libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

# Load a sample dataset (20 newsgroups)

In [3]:
data = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
documents = data.data

# Text preprocessing and vectorization

In [4]:
vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X = vectorizer.fit_transform(documents)

# LDA Initialization and Training

In [5]:
lda = LatentDirichletAllocation(n_components=10, random_state=42)

# Fit LDA on the document-term matrix
lda.fit(X)

# Display the top words for each topic

In [6]:
feature_names = np.array(vectorizer.get_feature_names_out())

for topic_idx, topic in enumerate(lda.components_):
    top_words_idx = topic.argsort()[:-10 - 1:-1]
    top_words = feature_names[top_words_idx]
    print(f"Topic #{topic_idx + 1}: {', '.join(top_words)}")

Topic #1: com, lines, subject, organization, writes, article, space, nasa, posting, gov
Topic #2: file, windows, use, dos, program, image, files, available, window, ftp
Topic #3: ax, max, g9v, b8f, a86, 145, pl, 1d9, 0d, db
Topic #4: com, writes, article, lines, subject, organization, just, edu, don, like
Topic #5: government, people, israel, gun, president, law, state, new, war, armenian
Topic #6: edu, subject, lines, organization, university, posting, host, nntp, uk, ac
Topic #7: god, people, don, think, know, say, just, like, does, believe
Topic #8: edu, writes, article, organization, subject, lines, university, cs, posting, host
Topic #9: drive, key, use, chip, card, scsi, bit, encryption, clipper, disk
Topic #10: 10, 00, game, team, 25, 15, 20, 12, 1993, games
