Application of LDA and LSA:

Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) are both statistical techniques used for topic modeling in natural language processing (NLP).

LDA is a generative probabilistic model that assumes documents are mixtures of topics and that each topic is characterized by a distribution of words. It is particularly effective for discovering hidden thematic structures in large collections of text, making it useful for applications such as document classification, recommendation systems, and content summarization. LDA generates interpretable topics, allowing users to understand the themes present in the data.

LSA, on the other hand, is based on singular value decomposition (SVD) and reduces the dimensionality of the term-document matrix. It identifies patterns in the relationships between terms and documents, capturing the underlying semantic structure. LSA is commonly used for information retrieval, document clustering, and improving search results by understanding the context of terms.

In [None]:
!pip install gensim
!pip install nltk

In [None]:
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora
from gensim.models import LdaModel
import matplotlib.pyplot as plt

# Function to perform LDA
def perform_lda(documents, num_topics=5):
    """
    Perform Latent Dirichlet Allocation (LDA) on the provided documents.
    
    Parameters:
    - documents: List of lemmatized documents.
    - num_topics: Number of topics to extract.

    Returns:
    - lda_model: Trained LDA model.
    - corpus: Corpus for LDA.
    - dictionary: Dictionary for LDA.
    """
    # Prepare the documents for LDA
    texts = [doc.split() for doc in documents]
    
    # Create a dictionary and corpus
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    
    # Train the LDA model
    lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
    
    return lda_model, corpus, dictionary

# Function to perform LSA
def perform_lsa(documents, num_topics=5):
    """
    Perform Latent Semantic Analysis (LSA) on the provided documents.

    Parameters:
    - documents: List of lemmatized documents.
    - num_topics: Number of topics to extract.

    Returns:
    - lsa_model: Trained LSA model.
    - svd: SVD transformation.
    """
    # Convert documents to a document-term matrix
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(documents)

    # Perform SVD
    svd = TruncatedSVD(n_components=num_topics)
    lsa_model = svd.fit_transform(X)
    
    return lsa_model, svd, vectorizer

# Example usage
def analyze_topics(global_attributes_dict):
    """
    Analyze topics using LDA and LSA and display results.

    Parameters:
    - global_attributes_dict: Global attributes dictionary for analysis.
    """
    # Prepare a list of lemmatized chapter texts for analysis
    documents = [' '.join(attrs) for attrs in global_attributes_dict.values() if attrs]

    # Perform LDA
    lda_model, corpus, dictionary = perform_lda(documents, num_topics=5)
    
    # Display the topics found by LDA
    print("LDA Topics:")
    for idx, topic in lda_model.print_topics(-1):
        print(f"Topic {idx + 1}: {topic}")

    # Perform LSA
    lsa_model, svd, vectorizer = perform_lsa(documents, num_topics=5)

    # Display the topics found by LSA
    print("\nLSA Topics:")
    terms = vectorizer.get_feature_names_out()
    for i, topic in enumerate(svd.components_):
        print(f"Topic {i + 1}: ", end="")
        print(" + ".join([f"{terms[j]} * {topic[j]:.4f}" for j in topic.argsort()[-3:]]))

# Call the analyze_topics function with the global attributes dictionary
analyze_topics(global_attributes_dict)


How to Compare LDA and LSA:
1) Calculate Coherence Scores: We'll compute coherence scores for both LDA and LSA topics.
2) Visualize Results: Finally, we'll visualize the comparison.

In [None]:
# Function to compare LDA and LSA topic models
def compare_lda_lsa(global_attributes_dict, num_topics=5):
    documents = [' '.join(attrs) for attrs in global_attributes_dict.values() if attrs]

 # Visualization: Plot LDA and LSA side by side for comparison
    fig, ax = plt.subplots(1, 2, figsize=(12, 6))
    
    # Plot LDA topics
    ax[0].barh(range(len(lda_topics)), [len(topic.split()) for topic in lda_topics], color='skyblue')
    ax[0].set_title('LDA Topic Word Count')
    ax[0].set_yticks(range(len(lda_topics)))
    ax[0].set_yticklabels([f"Topic {i+1}" for i in range(len(lda_topics))])
    
    # Plot LSA topics
    ax[1].barh(range(len(lsa_topics)), [len(topic.split()) for topic in lsa_topics], color='lightcoral')
    ax[1].set_title('LSA Topic Word Count')
    ax[1].set_yticks(range(len(lsa_topics)))
    ax[1].set_yticklabels([f"Topic {i+1}" for i in range(len(lsa_topics))])
    
    plt.tight_layout()
    plt.show()

# Call the function with the global attributes dictionary
compare_lda_lsa(global_attributes_dict)


LDA and LSA add practical value by providing distinct perspectives on the dataset's underlying structure, and their effects are visible in the graph.

LDA focuses on discovering probabilistic topic distributions within the dataset. This means that it extracts topics that represent well-separated, interpretable groups of words. In the graph, this adds value by clustering terms into clearly defined topics, making it easier to differentiate between various themes in the documents. LDA helps identify dominant, well-structured topics that are coherent and relevant to specific document sections.

LSA, on the other hand, leverages matrix factorization (SVD) to find latent patterns and relationships between words that might not be immediately apparent. This adds value by highlighting subtle, semantic relationships across documents, showing how certain terms co-occur even if they do not frequently appear together. In the graph, LSA can provide insights into hidden connections between terms, revealing deeper structures in the text that go beyond simple topic modeling.

Both methods enrich the analysis in complementary ways: LDA offers clear, interpretable topics, while LSA uncovers broader patterns of term association, helping to capture a more comprehensive picture of the dataset's semantic structure.