# 9.

A) Implement n-gram model.

B) Implement Latent Dirichilet Allocation model and Latent Semantic Analysis for topic modelling using text: 'the quick brown fox',
       'the slow brown dog',
       'the quick red dog',
       'the lazy yellow fox '



9A) Implement N-gram Model (Unigram, Bigram, Trigram)
An n-gram model predicts the next word based on the previous n-1 words. Here's how we can implement unigrams, bigrams, and trigrams for the given text.

In [1]:
import nltk
from collections import defaultdict
from nltk import ngrams

# Download the required data package for word tokenization
nltk.download('punkt_tab')
nltk.download('punkt')


# Sample text
documents = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

# Tokenize and prepare data
tokens = [nltk.word_tokenize(doc.lower()) for doc in documents]

# Flatten all tokens for n-grams
flat_tokens = [word for sublist in tokens for word in sublist]

# Generate and print n-grams
def generate_ngrams(tokens, n):
    return list(ngrams(tokens, n))

print("Unigrams:", generate_ngrams(flat_tokens, 1))
print("Bigrams:", generate_ngrams(flat_tokens, 2))
print("Trigrams:", generate_ngrams(flat_tokens, 3))

Unigrams: [('the',), ('quick',), ('brown',), ('fox',), ('the',), ('slow',), ('brown',), ('dog',), ('the',), ('quick',), ('red',), ('dog',), ('the',), ('lazy',), ('yellow',), ('fox',)]
Bigrams: [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'the'), ('the', 'slow'), ('slow', 'brown'), ('brown', 'dog'), ('dog', 'the'), ('the', 'quick'), ('quick', 'red'), ('red', 'dog'), ('dog', 'the'), ('the', 'lazy'), ('lazy', 'yellow'), ('yellow', 'fox')]
Trigrams: [('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'the'), ('fox', 'the', 'slow'), ('the', 'slow', 'brown'), ('slow', 'brown', 'dog'), ('brown', 'dog', 'the'), ('dog', 'the', 'quick'), ('the', 'quick', 'red'), ('quick', 'red', 'dog'), ('red', 'dog', 'the'), ('dog', 'the', 'lazy'), ('the', 'lazy', 'yellow'), ('lazy', 'yellow', 'fox')]


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Gauri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Gauri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


9B) Topic Modeling using LDA and LSA

We will use:

LDA (Latent Dirichlet Allocation) — probabilistic model.

LSA (Latent Semantic Analysis) — SVD-based method.

In [2]:
!pip install nltk scikit-learn gensim




In [3]:
!pip install --upgrade gensim



In [4]:
!pip install --upgrade numpy

Collecting numpy
  Using cached numpy-2.2.5-cp312-cp312-win_amd64.whl.metadata (60 kB)
Using cached numpy-2.2.5-cp312-cp312-win_amd64.whl (12.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
Successfully installed numpy-2.2.5


  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.2.5 which is incompatible.


In [5]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from gensim import corpora, models
import gensim
import pprint

nltk.download('stopwords')
from nltk.corpus import stopwords

# Input documents
docs = [
    'the quick brown fox',
    'the slow brown dog',
    'the quick red dog',
    'the lazy yellow fox'
]

# Preprocessing
stop_words = set(stopwords.words('english'))
texts = [[word for word in doc.lower().split() if word not in stop_words] for doc in docs]

# Create dictionary and corpus for gensim LDA
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# LDA Model
lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)
print("\n--- LDA Topics ---")
pprint.pprint(lda_model.print_topics())

# LSA Model using Scikit-Learn
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(docs)

lsa_model = TruncatedSVD(n_components=2)
lsa_topic_matrix = lsa_model.fit_transform(X)

terms = vectorizer.get_feature_names_out()

print("\n--- LSA Topics ---")
for i, comp in enumerate(lsa_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)
    print(f"Topic {i}:")
    print([t[0] for t in sorted_terms[:5]])



--- LDA Topics ---
[(0,
  '0.205*"fox" + 0.201*"brown" + 0.122*"yellow" + 0.122*"lazy" + 0.115*"slow" '
  '+ 0.113*"quick" + 0.081*"dog" + 0.042*"red"'),
 (1,
  '0.255*"dog" + 0.207*"quick" + 0.188*"red" + 0.078*"slow" + 0.072*"brown" + '
  '0.067*"fox" + 0.067*"lazy" + 0.066*"yellow"')]

--- LSA Topics ---
Topic 0:
['brown', 'quick', 'dog', 'fox', 'red']
Topic 1:
['fox', 'lazy', 'yellow', 'brown', 'quick']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gauri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


 Output Summary:

N-gram will show word sequences.

LDA will group similar topics probabilistically.

LSA will show concept clusters via dimensionality reduction.