## Topic Modeling Assignment

Topic modeling is a natural language processing technique used to discover hidden themes in a collection of documents. Latent Dirichlet Allocation (LDA) is a popular topic modeling algorithm that assumes each document is a mixture of topics and each topic is a mixture of words, using probabilistic methods to identify these structures. It helps in summarizing, organizing, and exploring large text datasets.

In this assigment, we will use LDA on 20 news group dataset from sklearn.

The 20 Newsgroups dataset is a popular dataset provided by scikit-learn for text classification and topic modeling tasks. It contains newsgroup posts organized into 20 different categories, making it a valuable resource for experimenting with natural language processing (NLP) techniques.
The dataset is divided into 20 topics, including technology, politics, sports, religion, and more. Examples of categories include:
- comp.graphics
- rec.sport.hockey
- sci.space
- talk.politics.misc

The data consists of newsgroup posts (documents) with associated labels indicating their categories.


## Imports and Libraries

In [None]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import numpy as np
from tests import load_datasets_test, preprocess_text_test, check_vectorizer_test, check_train_lda_test

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


## Load dataset

[This](https://scikit-learn.org/dev/modules/generated/sklearn.datasets.fetch_20newsgroups.html) is the official scikit documentation on fetch 20 news.

TODO:
- Load training data from fetch_20newsgroups
- Load testing data from fetch_20newsgroups

In [10]:
# Load 20 Newsgroups dataset
newsgroups_train = # TODO
newsgroups_test = # TODO

load_datasets_test(newsgroups_train, newsgroups_test)

## Pre-processing text

We'll look at cleaning and pre-processing text here.

In [None]:
# Text preprocessing function
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    tokens = # TODO: Tokenize

    tokens = # TODO: lowercase all the tokens

    tokens = # TODO: remove stop words

    #returns True if all the characters are alphabet letters (a-z)
    tokens = [token for token in tokens if token.isalpha()]

    tokens = # TODO: lemmatize
    return ' '.join(tokens)

# Apply preprocessing to the dataset
train_data = [preprocess_text(text) for text in newsgroups_train.data]
test_data = [preprocess_text(text) for text in newsgroups_test.data]

preprocess_text_test(train_data[0])

## Vectorization

Refer to sklearn documentation on CountVectorizer.

In [4]:

vectorizer = # TODO: CountVectorizer with max_features 10000 and that takes ngrams of 1 and 2, stop words from English
train_vectors = # TODO: fit vectorizer on train data

check_vectorizer_test(vectorizer)

## Modeling and training

In [5]:
# Train LDA model
def train_lda(n_topics, train_vectors):
    # TODO: Call LatentDirichletAllocation passing n_topics for components, learning decay of 0.7, random state 42 and number of jobs as -1
    # TODO: Fit LDA on train_vectors
    return lda



# Display top terms for each topic
def display_topics(model, feature_names, no_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))


In [None]:
# Train LDA with different numbers of topics
n_topics_list = [10, 20, 30]
lda_models = []
for n_topics in n_topics_list:
    model = # TODO: Call train_lda for each value in n_topics_list
    check_train_lda_test(lda)
    lda_models.append(model)

# Display topics for each trained LDA model
for idx, lda_model in enumerate(lda_models):
    print(f"\nTop Terms for Model with {n_topics_list[idx]} Topics:\n")
    display_topics(lda_model, vectorizer.get_feature_names_out())


## Coherence score calculation

In [None]:
# Compute topic coherence using gensim
def compute_coherence_score(lda_model, texts, vectorizer):
    lda_topics = [[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-10 - 1:-1]] for topic in lda_model.components_]
    dictionary = Dictionary([text.split() for text in texts])
    corpus = [dictionary.doc2bow(text.split()) for text in texts]
    coherence_model = CoherenceModel(topics=lda_topics, texts=[text.split() for text in texts],
                                     dictionary=dictionary, coherence='u_mass')
    return coherence_model.get_coherence()

# Evaluate coherence scores
coherence_scores = []
for lda_model in lda_models:
    score = # TODO: call compute_coherence_score for each model
    coherence_scores.append(score)
    
# Plot coherence scores
plt.figure(figsize=(10, 6))
plt.plot(n_topics_list, coherence_scores, marker='o')
plt.title('Topic Coherence Scores vs. Number of Topics')
plt.xlabel('Number of Topics')
plt.ylabel('Coherence Score')
plt.grid()
plt.show()