# LAB 5 - NLP


1. Extractive Summarization

Step 1-1 Import Libraries and define input sentences


In [145]:
# Import necessary libraries
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example sentences
sentences = [
"Artificial intelligence is transforming industries.",
"Applications of AI include healthcare, finance, and education.",
"AI improves efficiency but raises ethical concerns like privacy.",
"Healthcare benefits from AI in diagnostics and patient care."
]

Step 1-2 Define function to calculate Cosine Similarity Matrix

In [146]:
# Function to calculate cosine similarity matrix
def build_similarity_matrix(sentences):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer. fit_transform(sentences)
    similarity_matrix = cosine_similarity(tfidf_matrix)
    return similarity_matrix

Step 1-3 Define function for TextRank Algorithm

In [147]:
# Function for TextRank Algorithm
def textrank(sentences, similarity_matrix, damping=0.85, max_iter=100, tol=1e-4):
    G = nx.Graph()
    for i in range(len(sentences)):
        for j in range(len(sentences)):
            if i != j:
                G.add_edge(i, j, weight=similarity_matrix[i][j])

    scores = nx.pagerank(G, alpha=damping, max_iter=max_iter, tol=tol)
    ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)),
        reverse=True
    )
    return ranked_sentences


Step 1-4 Build Similarity Matrix

In [148]:
# Build similarity matrix
similarity_matrix = build_similarity_matrix(sentences)

Step 1-5 Apply TextRank

In [149]:
# Apply TextRank
textrank_results = textrank(sentences, similarity_matrix)

Step 1-6 Display the result

In [150]:
# Print TextRank Results (All sentences with their scores)
print("TextRank Sentence Scores:")
for score, sentence in textrank_results:
    print(f"Score: {score:.4f}, Sentence: {sentence}")

TextRank Sentence Scores:
Score: 0.3934, Sentence: Applications of AI include healthcare, finance, and education.
Score: 0.3882, Sentence: Healthcare benefits from AI in diagnostics and patient care.
Score: 0.1707, Sentence: AI improves efficiency but raises ethical concerns like privacy.
Score: 0.0476, Sentence: Artificial intelligence is transforming industries.


Step 1-7 Implement LexRank

In [151]:
# Function for LexRank Algorithm
def lexrank(sentences, similarity_matrix, threshold=0.2):
    n = len(sentences)
    G = nx.Graph()

    #build the graph with thresholding
    for i in range(n):
        for j in range(n):
            if i != j and similarity_matrix[i][j] > threshold:
                G.add_edge(i, j, weight=similarity_matrix[i][j])

    # Compute PageRank
    scores = nx.pagerank(G, max_iter=100, tol=1e-6)
    # Collect all scores
    ranked_sentences = sorted(
        ((scores[node], sentences[node]) for node in scores),
        reverse=True
    )
    return scores, ranked_sentences


In [152]:
# Build similarity matrix
similarity_matrix = build_similarity_matrix(sentences)

# Apply LexRank
lexrank_scores, lexrank_results = lexrank(sentences, similarity_matrix)

# Print all LexRank Scores (All sentences with their scores)
print("LexRank Sentence Scores (All Sentences):")
for node, score in lexrank_scores.items():
    print(f"Sentence {node + 1}: Score: {score:.4f}")

# Print LexRank Summary (Ranked Sentences)
print("\nLexRank Summary:")
for score, sentence in lexrank_results:
    print(f"Score: {score:.4f}, Sentence: {sentence}")


LexRank Sentence Scores (All Sentences):
Sentence 2: Score: 0.5000
Sentence 4: Score: 0.5000

LexRank Summary:
Score: 0.5000, Sentence: Healthcare benefits from AI in diagnostics and patient care.
Score: 0.5000, Sentence: Applications of AI include healthcare, finance, and education.


2. Abstractive Summarization

Step 2-1 Import Libraries and Define the Source Text and Target Text

In [153]:
import torch
import torch.nn as nn
import torch.optim as optim

# Sample data: input sentence and target summary
source_text = ["artificial intelligence is transforming industries"]
target_text = ["ai transforms industries"]

# Vocabulary
vocab = ["<pad>", "<sos>", "<eos>", "artificial", "intelligence", "is",
"transforming", "industries", "ai", "transforms"]
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

# Convert sentences to indices
def tokenize(text, word2idx):
    return [[word2idx["<sos>"]] + [word2idx[word]
                                  for word in sentence.split()] +
                                  [word2idx["<eos>"]] for sentence in text]

source_indices = tokenize(source_text, word2idx)
target_indices = tokenize(target_text, word2idx)

Step 2-2 Converting to PyTorch Tensors


In [154]:
source_indices = tokenize(source_text, word2idx)
target_indices = tokenize(target_text, word2idx)

# Convert to tensors
source_tensor = torch.tensor(source_indices, dtype=torch.long)
target_tensor = torch. tensor(target_indices, dtype=torch.long)

Step 2-3 Define Hyperparameters


In [155]:
# Hyperparameters
embedding_dim = 16
hidden_dim = 32
vocab_size = len(vocab)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Step 2-4 Define Encoder

In [156]:
# Encoder
class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.lstm(embedded)
        return hidden, cell

Step 2-5 Define Decoder

In [157]:
# Decoder
class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(Decoder, self) .__init__()
        self.embedding = nn. Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden, cell):
        x = x.unsqueeze(1) # Add batch dimension
        embedded = self.embedding(x)
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1))
        return prediction, hidden, cell

Step 2-6 Define Seq2seq Model

In [158]:
# Seq2Seq Model
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self) .__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_forcing_ratio=0.5):
        batch_size = target.shape[0]
        target_len = target. shape[1]
        vocab_size = self.decoder.fc.out_features

        outputs = torch.zeros(batch_size, target_len, vocab_size).to(device)
        hidden, cell = self.encoder(source)

        x = target[:, 0]
        for t in range(1, target_len):
            output, hidden, cell = self.decoder(x, hidden, cell)
            outputs[:, t, : ] = output

            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            x = target[:, t] if teacher_force else output.argmax(1)

        return outputs

Step 2-7 Instantiate the Encoder and Decoder

In [159]:
# Instantiate the model
encoder = Encoder(vocab_size, embedding_dim, hidden_dim).to(device)
decoder = Decoder(vocab_size, embedding_dim, hidden_dim).to(device)
model = Seq2Seq(encoder, decoder).to(device)

Step 2-8 Define the Loss Function and Optimizer

In [160]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=word2idx["<pad>"])
optimizer = optim.Adam(model.parameters())

Step 2-9 Perform Training

In [161]:
# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()

    source = source_tensor.to(device)
    target = target_tensor.to(device)

    output = model(source, target)
    output = output[:, 1:].reshape(-1, vocab_size)
    target = target[:, 1:].reshape(-1)

    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item() :.4f}")

Epoch [10/100], Loss: 2.1773
Epoch [20/100], Loss: 2.0275
Epoch [30/100], Loss: 1.8136
Epoch [40/100], Loss: 1.5201
Epoch [50/100], Loss: 1.2592
Epoch [60/100], Loss: 0.9904
Epoch [70/100], Loss: 0.8126
Epoch [80/100], Loss: 0.6529
Epoch [90/100], Loss: 0.5140
Epoch [100/100], Loss: 0.3992


Step 2-10 Generate Summary

In [162]:
# Generate a summary
model.eval()
with torch.no_grad():
    source = source_tensor.to(device)
    hidden, cell = encoder(source)
    x = torch.tensor([word2idx["<sos>"]]).to(device)
    summary = []

    for _ in range(10):
        output, hidden, cell = decoder(x, hidden, cell)
        x = output. argmax(1)
        word = idx2word[x.item()]
        if word == "<eos>":
            break
        summary.append(word)

print("Generated Summary:", " ".join(summary))

Generated Summary: ai transforms industries


# Laboratory Task

In [163]:
!pip install networkx rouge-score scikit-learn
# Import necessary libraries
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import brown
from rouge_score import rouge_scorer
nltk.download('brown')

# Load built-in dataset (Brown Corpus)
sentences = [" ".join(sentence) for sentence in brown.sents(categories='news')[:100]]

# Reference summary (manually crafted for simplicity)
# Simulating a reference summary using the first 5 sentences
reference_summary = " ".join(sentences[:5])



[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [164]:
similarity_matrix = build_similarity_matrix(sentences)
textrank_results = textrank(sentences, similarity_matrix)
# Print TextRank Results (All sentences with their scores)
print("TextRank Sentence Scores:")
for score, sentence in textrank_results:
    print(f"Score: {score:.4f}, Sentence: {sentence}")



TextRank Sentence Scores:
Score: 0.0205, Sentence: The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
Score: 0.0204, Sentence: `` This is one of the major items in the Fulton County general assistance program '' , the jury said , but the State Welfare Department `` has seen fit to distribute these funds through the welfare departments of all the counties in the state with the exception of Fulton County , which receives none of this money .
Score: 0.0181, Sentence: `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
Score: 0.0170, Sentence: The jury also commented on the Fulton ordinary's court which has been under fire for its practices in the appointment of appraisers

In [165]:
textrank_results[0][1]

"The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted ."

In [166]:
# Build similarity matrix
similarity_matrix = build_similarity_matrix(sentences)

# Apply LexRank
lexrank_scores, lexrank_results = lexrank(sentences, similarity_matrix)

# Print all LexRank Scores (All sentences with their scores)
print("LexRank Sentence Scores (All Sentences):")
for node, score in lexrank_scores.items():
    print(f"Sentence {node + 1}: Score: {score:.4f}")

# Print LexRank Summary (Ranked Sentences)
print("\nLexRank Summary:")

for score, sentence in lexrank_results:
    print(f"Score: {score:.4f}, Sentence: {sentence}")

LexRank Sentence Scores (All Sentences):
Sentence 2: Score: 0.0286
Sentence 4: Score: 0.0179
Sentence 7: Score: 0.0206
Sentence 10: Score: 0.0070
Sentence 15: Score: 0.0491
Sentence 19: Score: 0.0328
Sentence 3: Score: 0.0115
Sentence 44: Score: 0.0325
Sentence 14: Score: 0.0086
Sentence 16: Score: 0.0121
Sentence 17: Score: 0.0119
Sentence 24: Score: 0.0065
Sentence 30: Score: 0.0117
Sentence 21: Score: 0.0068
Sentence 22: Score: 0.0069
Sentence 25: Score: 0.0169
Sentence 27: Score: 0.0169
Sentence 36: Score: 0.0281
Sentence 37: Score: 0.0153
Sentence 38: Score: 0.0069
Sentence 40: Score: 0.0175
Sentence 56: Score: 0.0125
Sentence 94: Score: 0.0113
Sentence 45: Score: 0.0169
Sentence 49: Score: 0.0169
Sentence 46: Score: 0.0169
Sentence 48: Score: 0.0169
Sentence 50: Score: 0.0122
Sentence 74: Score: 0.0247
Sentence 54: Score: 0.0114
Sentence 57: Score: 0.0261
Sentence 55: Score: 0.0095
Sentence 90: Score: 0.0137
Sentence 58: Score: 0.0183
Sentence 61: Score: 0.0163
Sentence 64: Score

In [167]:
# Using the top 5 sentences as summary

textrank_summary=' '.join(sentence for _, sentence in textrank_results[:5])
lexrank_summary=' '.join(sentence for _, sentence in lexrank_results[:5])
print("Text Rank Summary:")
print(textrank_summary)
print("Lex Rank Summary:")
print(lexrank_summary)

Text Rank Summary:
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . `` This is one of the major items in the Fulton County general assistance program '' , the jury said , but the State Welfare Department `` has seen fit to distribute these funds through the welfare departments of all the counties in the state with the exception of Fulton County , which receives none of this money . `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' . The jury also commented on the Fulton ordinary's court which has been under fire for its practices in the appointment of appraisers , guardians and administrators and the awarding of fees and compensation . The jury did not elaborate , bu

In [168]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rouge3'], use_stemmer=True)

lexrank_scores = scorer.score(reference_summary, lexrank_summary)
textrank_scores = scorer.score(reference_summary, textrank_summary)


# Print ROUGE comparison
print("\nROUGE Scores Comparison:")
print("\nLexRank ROUGE Scores:")
for metric, score in lexrank_scores.items():
    print(f"{metric.upper()} Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}")

print("\nTextRank ROUGE Scores:")
for metric, score in textrank_scores.items():
    print(f"{metric.upper()} Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}")


ROUGE Scores Comparison:

LexRank ROUGE Scores:
ROUGE1 Precision: 0.4481, Recall: 0.5503, F1: 0.4940
ROUGE2 Precision: 0.2747, Recall: 0.3378, F1: 0.3030
ROUGE3 Precision: 0.2265, Recall: 0.2789, F1: 0.2500

TextRank ROUGE Scores:
ROUGE1 Precision: 0.5137, Recall: 0.6309, F1: 0.5663
ROUGE2 Precision: 0.4011, Recall: 0.4932, F1: 0.4424
ROUGE3 Precision: 0.3812, Recall: 0.4694, F1: 0.4207


# LexRank Performance
To find the best threshold for LexRank, I tested multiple values and compared the ROUGE scores. The threshold controls how similar sentences need to be to connect in the graph, which directly impacts the summary quality.

**Threshold: 0.5**

With a threshold of 0.5, LexRank didn’t perform well:

*   ROUGE-1 F1: 0.1023
*   ROUGE-2 F1: 0.0000
*   ROUGE-3 F1: 0.0000

The scores show that very few sentences were considered similar enough to connect, resulting in weak summaries.



**Threshold: 0.2**

When the threshold was reduced to 0.2, LexRank’s performance improved significantly:

*   ROUGE-1 F1: 0.4940
*  ROUGE-2 F1: 0.3030
*  ROUGE-3 F1: 0.2500

A lower threshold allowed more connections between sentences, leading to better-ranked summaries.

# TextRank Performance
TextRank’s scores stayed consistent across both thresholds:

*   ROUGE-1 F1: 0.5663
*   ROUGE-2 F1: 0.4424
*   ROUGE-3 F1: 0.4207

This shows that TextRank is less affected by changes in LexRank’s threshold and delivers stable results.

# Key Takeaways
**1. LexRank’s Sensitivity:** LexRank’s performance depends heavily on the threshold. A high threshold (like 0.5) creates weak connections, while a lower threshold (like 0.2) gives much better results.

**2. TextRank’s Stability:** TextRank consistently produces good summaries, regardless of LexRank’s threshold changes.

**3. Best Threshold for LexRank:** The best LexRank performance came with a threshold of 0.2, making it competitive with TextRank.

# Conclusion
Setting LexRank’s threshold to 0.2 gives the best balance between precision and recall, resulting in better summaries. While TextRank outperforms LexRank overall, tuning the threshold improves LexRank significantly.