<a href="https://colab.research.google.com/github/epythonlab/PythonLab/blob/master/Topic_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling and It's Practical Uses

## Introduction:

Hello everyone, and welcome to today's tutorial on topic modeling and its practical uses! I'm excited to have you join me as we explore this fascinating topic together.

Have you ever wondered how computers can understand and extract meaning from large volumes of text data? That's where topic modeling comes in! Today, I'll dive into the world of natural language processing and uncover the hidden thematic structures within documents.

## What is Topic Modeling?

Let's start by defining topic modeling. Imagine you have a collection of documents—articles, emails, or social media posts. Topic modeling allows you to uncover the underlying themes or topics present in these documents.

## Understanding Latent Dirichlet Allocation (LDA)

One of the most popular algorithms for topic modeling is Latent Dirichlet Allocation, or LDA for short. LDA assumes that each document is a mixture of topics, and each topic is a distribution over words. It's like solving a puzzle to find the hidden themes!

## Implementing Topic Modeling with Python

Now, let's roll up our sleeves and dive into some Python coding! I'll use libraries like `gensim` and `scikit-learn` to perform topic modeling step by step. Whether you're using sample data or your own dataset, the process remains the same.

## Generating Sample Data(Optional)

But before you dive into coding, let's have some fun generating sample data! Don't worry—if you already have your own dataset, you can skip this step and use your data instead. Let's explore the power of topic modeling with your very own simulated documents!

In [10]:
import os

def generate_sample_documents(doc_dir):
    # Create the directory if it doesn't exist
    if not os.path.exists(doc_dir):
        os.makedirs(doc_dir)

    # Sample topics and corresponding text
    topics = ["Technology", "Health", "Finance"]
    topic_texts = [
        "In today's rapidly advancing world of technology, staying updated with the latest innovations is crucial for success. From artificial intelligence to blockchain technology, there are endless possibilities for businesses to leverage technology to gain a competitive edge.",
        "Maintaining good health is essential for a happy and fulfilling life. Regular exercise, balanced nutrition, and proper sleep are key factors in preventing illness and promoting overall well-being. Remember to prioritize your health and make healthy choices every day.",
        "Financial planning is an important aspect of securing your future. Whether it's saving for retirement, investing in the stock market, or managing debt, having a solid financial strategy can help you achieve your long-term goals. Make informed decisions and seek professional advice when necessary."
    ]

    # Generate and save sample documents
    for i, topic in enumerate(topics):
        filename = os.path.join(doc_dir, f"Document_{i}.txt")
        with open(filename, "w", encoding="utf-8") as file:
            file.write(topic_texts[i])

# Call the function to generate sample documents
generate_sample_documents("sample_documents")


## Preprocessing Text Data

Before diving into topic modeling, you need to prepare your text data. This includes tasks like **tokenization**, **removing stopwords**, and **stemming** or **lemmatization**. But fear not! I'll walk you through each step.

In [20]:
import os
import nltk # import nltk for preprocessing text data
from nltk.tokenize import word_tokenize  # Importing word_tokenize for tokenization
from nltk.corpus import stopwords  # Importing stopwords from NLTK
from nltk.stem import WordNetLemmatizer  # Importing WordNetLemmatizer for lemmatization

# Download NLTK resources (if not already downloaded)
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')  # Downloading the 'punkt' tokenizer
nltk.download('stopwords')  # Downloading the stopwords corpus
nltk.download('wordnet')  # Downloading the WordNet corpus for lemmatization

def preprocess_text(text):
    """
    Preprocesses a text by tokenizing, removing stopwords, and lemmatizing the tokens.

    Args:
        text (str): Input text to be preprocessed.

    Returns:
        str: Preprocessed text.
    """
    # Tokenize text
    tokens = word_tokenize(text)  # Tokenizing the input text

    # Remove stopwords
    stop_words = set(stopwords.words('english'))  # Getting English stopwords
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]  # Removing stopwords

    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()  # Initializing WordNetLemmatizer
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]  # Lemmatizing tokens

    # Join tokens back into a single string
    preprocessed_text = ' '.join(lemmatized_tokens)  # Joining tokens into a single string

    return preprocessed_text

def preprocess_documents(doc_dir):
    """
    Preprocesses all documents in a directory.

    Args:
        doc_dir (str): Path to the directory containing documents.

    Returns:
        list: List of preprocessed documents.
    """
    preprocessed_documents = []
    for filename in os.listdir(doc_dir):
        if filename.endswith('.txt'):
            file_path = os.path.join(doc_dir, filename)
            with open(file_path, 'r', encoding='utf-8') as file:
                text = file.read()
                preprocessed_text = preprocess_text(text)
                preprocessed_documents.append(preprocessed_text)
    return preprocessed_documents

# Call the function to preprocess sample documents
preprocessed_docs = preprocess_documents("sample_documents")

# Print preprocessed documents
for idx, doc in enumerate(preprocessed_docs):
    print(f"Preprocessed Document {idx+1}:\n{doc}\n")


Preprocessed Document 1:
Financial planning important aspect securing future . Whether 's saving retirement , investing stock market , managing debt , solid financial strategy help achieve long-term goal . Make informed decision seek professional advice necessary .

Preprocessed Document 2:
today 's rapidly advancing world technology , staying updated latest innovation crucial success . artificial intelligence blockchain technology , endless possibility business leverage technology gain competitive edge .

Preprocessed Document 3:
Maintaining good health essential happy fulfilling life . Regular exercise , balanced nutrition , proper sleep key factor preventing illness promoting overall well-being . Remember prioritize health make healthy choice every day .



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Training the LDA Model

Next, I'll train the **LDA** model using my preprocessed text data. This involves setting parameters such as the **number of topics** and **the number of iterations**. Once trained, the model will reveal the underlying topics present in your documents.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
def perform_topic_modeling(num_topics, documents_path, output_csv):
    """
    Performs topic modeling using LDA and generates a document-topic matrix CSV.

    Args:
        num_topics: Number of topics to model.
        documents_path: Path to a directory containing documents.
        output_csv: Path to the output CSV file.
    """
    # Preprocess documents
    documents = preprocess_documents(documents_path)

    # Vectorize documents using TF-IDF
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(documents)

    # Train LDA model
    lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=1)
    lda_model.fit(tfidf_matrix)

    # Get document topic matrix
    document_topic_matrix = lda_model.transform(tfidf_matrix)

    # Create pandas dataframe
    df = pd.DataFrame(document_topic_matrix, columns=[f"Topic_{i}" for i in range(num_topics)])
    df.index = pd.Index(range(len(documents)))

    # Save to CSV
    df.to_csv(output_csv, index=True)

    print(f"Document-topic matrix saved to: {output_csv}")

# Example usage
num_topics = 5  # Change this to the desired number of topics
documents_path = "sample_documents"  # Replace with your documents directory
output_csv = "document_topics.csv"

perform_topic_modeling(num_topics, documents_path, output_csv)

Document-topic matrix saved to: document_topics.csv



The expected output of the provided code is a CSV file containing a document-topic matrix generated by performing topic modeling using Latent Dirichlet Allocation (LDA) on a set of preprocessed documents.

The document-topic matrix represents the distribution of topics across each document. Each row in the matrix corresponds to a document, and each column corresponds to a topic. The values in the matrix represent the weight or probability of each topic in the respective document.

After running the code, you should see a message indicating that the document-topic matrix has been saved to a CSV file specified by the output_csv parameter. The CSV file will contain the document-topic matrix, with each row representing a document and each column representing a topic.

## Recommendations

Now that I've successfully generated document-topic matrix,next dive into analyzing the results!

The document-topic matrix provides you with a wealth of information about the underlying themes present in your documents. Let's explore some next steps you can take to gain deeper insights.

1. **Topic Interpretation**: you'll begin by examining the topics identified by the model. Analyzing the top words associated with each topic will help you understand the main themes present in your documents.

2. **Document Classification:** Next, you'll explore how to classify or categorize your documents based on their dominant topics. This step will help you organize and label your documents for further analysis.

3. **Visualization:** Visualizing the document-topic matrix can provide you with a clearer understanding of the relationships between documents and topics. You'll explore techniques such as **heatmaps** or multidimensional scaling to bring your data to life.

4. **Evaluation:** It's important to assess the quality of your topic modeling results. You'll delve into various evaluation metrics, including **coherence scores and human judgment**, to determine how effective your model is at capturing the underlying themes.

## Practical Applications of Topic Modeling

Topic modeling isn't just a theoretical concept—it has real-world applications across various domains. Whether it's recommending relevant content, analyzing customer feedback, or tracking trends on social media, topic modeling can unlock valuable insights for your projects.

## Conclusion

As I conclude today's tutorial, I hope you've gained a deeper understanding of topic modeling and its practical uses. Whether you're a beginner or an experienced data scientist, topic modeling opens up a world of possibilities for extracting insights from text data.

## Closing Remark

Thank you all for joining me today! Whether you're using sample data or your own dataset, I encourage you to explore topic modeling further and apply it to your projects. If you enjoyed this tutorial, don't forget to like, comment, and subscribe for more exciting content on data science and machine learning. Until next time, happy learning!