## Task 1 DATA EXPLORATION

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv("text_docs - text_docs.csv")

# Display the first 5 rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

# Print the total number of rows and unique documents
total_rows = df.shape[0]
unique_docs = df['document_id'].nunique()
#Statistics: We calculate the total number of rows (shape[0]) and unique document IDs (nunique())
#to better understand the dataset structure.

print(f"\nTotal number of rows: {total_rows}")
print(f"Unique documents: {unique_docs}")


First 5 rows of the dataset:
   document_id                                               text
0            1  The stock market has been experiencing volatil...
1            2  The economy is growing, and businesses are opt...
2            3  Climate change is a critical issue that needs ...
3            4  Advances in artificial intelligence have revol...
4            5  The rise of electric vehicles is shaping the f...

Total number of rows: 10
Unique documents: 10


## TASK2 GENERATE TOPICS USING LDA

In [3]:
##STEP 1 Preprocess text data

In [5]:
# Import necessary libraries for text preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Preprocess the text
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# Apply preprocessing to the text data
df['processed_text'] = df['text'].apply(preprocess_text)

In [8]:
##STEP2 CREATE DTM

from sklearn.feature_extraction.text import CountVectorizer

# Create a document-term matrix (DTM)
vectorizer = CountVectorizer()
dtm = vectorizer.fit_transform(df['processed_text'])

# Check the shape of the DTM
print(f"\nShape of the Document-Term Matrix: {dtm.shape}")



Shape of the Document-Term Matrix: (10, 62)


In [10]:
#STEP3 APPLY LDA MODEL

from sklearn.decomposition import LatentDirichletAllocation

# Apply Latent Dirichlet Allocation (LDA)
lda = LatentDirichletAllocation(n_components=5, random_state=42)  # 5 topics
lda.fit(dtm)

# Display the top words for each topic
n_top_words = 5
for topic_idx, topic in enumerate(lda.components_):
    print(f"\nTopic #{topic_idx+1}:")
    print(" ".join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))



Topic #1:
introduction treatments healthcare technologies evolving

Topic #2:
platforms digital change climate critical

Topic #3:
industry future due experiencing stock

Topic #4:
renewable world energy projects investing

Topic #5:
worldwide revolutionized artificial industries intelligence


LDA Model: The LDA model is trained using the LatentDirichletAllocation from sklearn. We specify n_components=5 to extract 5 topics.

Top Words per Topic: For each topic, we display the top 5 words with the highest weights (importance). These words represent the essence of the topic.