<a href="https://colab.research.google.com/github/ayush-a-r/Basic-NLP/blob/main/Untitled3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install gensim
import gensim                               #Imports the gensim library, which is a popular Python library for topic modeling, document similarity, and vector space modeling. It includes the LDA model, which we’ll use for topic modeling.
from gensim import corpora                  #Imports the corpora module from Gensim, which provides utilities for handling a corpus of documents. It includes methods for creating a dictionary (mapping words to unique IDs) and for creating document-term matrices.
from gensim.models import LdaModel          #Imports the LdaModel class from Gensim. This class is used for training an LDA model on a corpus to discover topics within a collection of documents.
from nltk.corpus import stopwords           #Imports the stopwords corpus from NLTK (Natural Language Toolkit), which contains a list of common words (e.g., "the", "is", "and") that are often removed from text during preprocessing.
from nltk.tokenize import word_tokenize     #Imports the word_tokenize function from NLTK, which is used for splitting a sentence into individual words or tokens.
import nltk             #Imports the main NLTK library to access other utilities like stopwords and tokenizers.
nltk.download('punkt_tab')

# Download NLTK stopwords (run only once)
nltk.download('punkt')       #Downloads the Punkt tokenizer models, which are necessary for word tokenization (splitting sentences into words).
nltk.download('stopwords')   #Downloads the stopwords list, which contains a set of common words in English that are typically removed from text before processing.

# Sample documents
documents = [
    "Artificial intelligence is transforming the technology industry.",
    "Machine learning and AI are shaping the future of automation.",
    "Deep learning algorithms are a subset of machine learning.",
    "Quantum computing will revolutionize industries like AI.",
    "Healthcare is benefiting from AI and machine learning advances.",
]

#This defines a list of sample documents (sentences) that will be used for topic modeling. These sentences are focused on topics related to artificial intelligence (AI) and machine learning (ML).

# Preprocess the documents
def preprocess(doc):                                    #Defines a function that preprocesses a document (sentence) to prepare it for modeling by removing stopwords and non-alphabetic words.
    stop_words = set(stopwords.words('english'))        #Loads the set of English stopwords from NLTK into the stop_words variable. These are words like "and", "the", "is", etc., that generally do not carry important meaning for topic modeling.
    tokens = word_tokenize(doc.lower())                 #Tokenizes the input doc (document) into individual words (tokens) and converts all words to lowercase using .lower() to ensure uniformity (e.g., "AI" and "ai" will be treated as the same).
    return [word for word in tokens if word.isalpha() and word not in stop_words]    #Filters out any tokens that are non-alphabetic (such as punctuation or numbers) and any stopwords. It returns a list of meaningful words (tokens).

processed_docs = [preprocess(doc) for doc in documents]               #Applies the preprocess() function to each document in the documents list. This results in a list of tokenized, lowercased, stopword-free words for each document.

# Create a dictionary and document-term matrix
dictionary = corpora.Dictionary(processed_docs)                          #Creates a dictionary using the processed documents. The dictionary maps each unique word (token) to a unique ID. This is an essential step before building a document-term matrix (DTM).
doc_term_matrix = [dictionary.doc2bow(doc) for doc in processed_docs]    #Converts each preprocessed document into a bag-of-words representation using dictionary.doc2bow(). The doc2bow function converts each document into a list of tuples, where each tuple represents a word ID and its frequency in the document.

# Train the LDA model (specifying 2 topics)
lda_model = LdaModel(doc_term_matrix, num_topics=2, id2word=dictionary, passes=15)   #Specifies that the model should discover 2 topics from the documents. id2word=dictionary: The dictionary created earlier is passed to the model to help interpret word IDs.. passes=15: Specifies the number of passes (iterations) over the entire corpus to optimize the model. More passes generally result in better topic quality but take more time.

# Print the topics with associated words
print("Topics discovered by LDA:")
topics = lda_model.print_topics(num_words=5)    #Prints the top 5 words for each discovered topic. This allows you to understand what each topic is about based on the most common words in the topic.
for topic in topics:
    print(topic)        #Iterates over the topics and prints them out. Each topic consists of a list of words that are highly associated with that topic.

# Document similarity (clustering example)
doc1_bow = dictionary.doc2bow(preprocess("AI and machine learning are advancing rapidly"))   #Preprocesses the new document, converts it into a bag-of-words format using the dictionary, and stores it in doc1_bow.
doc2_bow = dictionary.doc2bow(preprocess("Healthcare is benefiting from AI advances"))       #Preprocesses and converts the second document into a bag-of-words representation, storing it in doc2_bow.

similarity = gensim.matutils.cossim(doc1_bow, doc2_bow)   #Computes the cosine similarity between the two document vectors (doc1_bow and doc2_bow). Cosine similarity is a measure of similarity between two vectors based on the cosine of the angle between them. A higher cosine value indicates more similarity.
print("\nDocument Similarity (cosine):", similarity)   #Prints the cosine similarity value between the two documents


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m66.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...


Topics discovered by LDA:
(0, '0.155*"learning" + 0.121*"machine" + 0.086*"ai" + 0.052*"subset" + 0.052*"algorithms"')
(1, '0.068*"artificial" + 0.068*"industry" + 0.068*"transforming" + 0.068*"technology" + 0.068*"intelligence"')

Document Similarity (cosine): 0.2886751345948129


[nltk_data]   Unzipping corpora/stopwords.zip.
