<a href="https://colab.research.google.com/github/akshaya-02ly/akshaya2260-nlp/blob/main/lab_7_4(nlp).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Libraries

### Subtask:
Import necessary libraries for text preprocessing, similarity calculations, and WordNet semantic similarity.


### Question 7: Give two real-life applications of text similarity.

Text similarity is a fundamental concept in NLP with numerous real-life applications across various domains. Here are two prominent examples:

1.  **Plagiarism Detection**: Text similarity is extensively used in educational institutions and publishing houses to identify instances of plagiarism. By comparing a submitted document (e.g., a student essay, a research paper) against a vast database of existing texts (e.g., academic databases, web content), similarity algorithms can detect passages or entire documents that have been copied or paraphrased without proper attribution. Tools like Turnitin or Grammarly utilize sophisticated text similarity techniques to highlight overlapping content, helping to maintain academic integrity and intellectual property rights.
    *   **How it works**: Documents are broken down into smaller units (sentences, paragraphs), converted into numerical representations (e.g., TF-IDF vectors, embeddings), and then compared using metrics like Cosine or Jaccard similarity. High similarity scores between a submitted text and a source text indicate potential plagiarism.

2.  **Recommendation Systems (Content-Based Filtering)**: Many online platforms use text similarity to power their recommendation engines, particularly in content-based filtering. This involves recommending items to users based on the similarity between the item's description and the user's past preferences or the descriptions of items they have previously liked. Examples include recommending:
    *   **Movies/TV Shows**: Based on plot summaries, genres, and tags.
    *   **News Articles**: Based on the content of articles a user has read before.
    *   **Products**: Based on product descriptions that match items a user has browsed or purchased.
    *   **Academic Papers**: Based on abstracts and keywords of papers relevant to a researcher's interests.
    *   **How it works**: Text descriptions of items are transformed into numerical vectors. When a user expresses interest in an item, its vector is compared (using cosine similarity, for instance) with the vectors of other available items to find the most similar ones. The items with the highest similarity scores are then recommended to the user.

These applications demonstrate how quantifying text similarity enables automated systems to process, understand, and leverage textual information to provide valuable services and insights.

**Reasoning**:
Import the necessary libraries for text preprocessing, feature extraction, similarity calculation, and numerical operations as specified in the instructions.



In [1]:
import nltk # For natural language processing tasks like tokenization, stopword removal, lemmatization, and WordNet.
from sklearn.feature_extraction.text import TfidfVectorizer # To convert text into numerical feature vectors.
from sklearn.metrics.pairwise import cosine_similarity # To compute cosine similarity between text representations.
import numpy as np # For numerical operations.

## Load/Prepare Dataset

### Subtask:
Create a sample dataset of sentence pairs, either manually or by loading from a file (e.g., .txt or .csv). Display a sample of the dataset and provide a 5-6 line explanation of its structure and content.


**Reasoning**:
Create a sample dataset of sentences as a Python list named `corpus` for text similarity analysis, as instructed.



In [7]:
corpus = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast brown fox leaps over a sleepy dog.",
    "The cat sits on the mat.",
    "Dogs are loyal pets and great companions.",
    "Felines are often kept as domestic animals.",
    "The weather today is sunny and warm.",
    "It is a beautiful and hot day outside."
]

print("Sample Corpus:")
for i, sentence in enumerate(corpus):
    print(f"Sentence {i+1}: {sentence}")

Sample Corpus:
Sentence 1: The quick brown fox jumps over the lazy dog.
Sentence 2: A fast brown fox leaps over a sleepy dog.
Sentence 3: The cat sits on the mat.
Sentence 4: Dogs are loyal pets and great companions.
Sentence 5: Felines are often kept as domestic animals.
Sentence 6: The weather today is sunny and warm.
Sentence 7: It is a beautiful and hot day outside.


```markdown
The `corpus` is a Python list consisting of seven strings, where each string represents a distinct sentence or document.

The sentences cover a variety of topics and demonstrate different levels of similarity. For instance, sentences 1 and 2 are semantically very similar, both describing a fox and a dog. Sentences 4 and 5 are also related, discussing domestic animals (dogs and felines). Sentences 3, 6, and 7, on the other hand, introduce entirely different themes (a cat on a mat, and weather descriptions). This diverse set allows for a comprehensive analysis of text similarity, comparing highly similar, moderately similar, and entirely dissimilar text pairs.
```

The `corpus` is a Python list consisting of seven strings, where each string represents a distinct sentence or document.

The sentences cover a variety of topics and demonstrate different levels of similarity. For instance, sentences 1 and 2 are semantically very similar, both describing a fox and a dog. Sentences 4 and 5 are also related, discussing domestic animals (dogs and felines). Sentences 3, 6, and 7, on the other hand, introduce entirely different themes (a cat on a mat, and weather descriptions). This diverse set allows for a comprehensive analysis of text similarity, comparing highly similar, moderately similar, and entirely dissimilar text pairs.

## Preprocess Text

### Subtask:
Write and apply functions to preprocess the text data. This includes lowercasing, removing punctuation and numbers, removing stopwords, and tokenizing. Optionally, include lemmatization. Each preprocessing step should be clearly explained.


**Reasoning**:
To begin text preprocessing, it's essential to download the necessary NLTK data, including 'stopwords', 'wordnet', and 'punkt', which are crucial for subsequent steps like tokenization, stopword removal, and lemmatization.



In [3]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4') # Open Multilingual Wordnet, often needed for WordNetLemmatizer

print("NLTK data downloaded successfully.")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


NLTK data downloaded successfully.


**Reasoning**:
Now that the NLTK data is downloaded, I will define the `preprocess_text` function, incorporating lowercasing, punctuation and number removal, tokenization, stopword removal, and lemmatization as specified in the instructions. Then I will apply this function to the corpus and display the results.



In [8]:
import string # For accessing punctuation characters.
import re # For regular expression operations, specifically to remove numbers.
from nltk.corpus import stopwords # For accessing a list of common English stopwords.
from nltk.tokenize import word_tokenize # For splitting text into words.
from nltk.stem import WordNetLemmatizer # For reducing words to their base form.

# Initialize WordNetLemmatizer and stopwords set
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # 1. Convert text to lowercase
    text = text.lower()

    # 2. Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # 3. Remove numbers
    text = re.sub(r'\d+', '', text)

    # 4. Tokenize text
    tokens = word_tokenize(text)

    # 5. Remove stopwords and apply lemmatization
    processed_tokens = []
    for word in tokens:
        if word not in stop_words:
            processed_tokens.append(lemmatizer.lemmatize(word))

    # 6. Join processed words back into a string
    return ' '.join(processed_tokens)

# Apply the preprocessing function to the corpus
preprocessed_corpus = [preprocess_text(sentence) for sentence in corpus]

# Print original and preprocessed versions of the first three sentences
print("Original vs. Preprocessed Sentences (First 3):")
for i in range(min(3, len(corpus))):
    print(f"Original {i+1}: {corpus[i]}")
    print(f"Preprocessed {i+1}: {preprocessed_corpus[i]}\n")

print("Text preprocessing complete. 'preprocessed_corpus' created.")

Original vs. Preprocessed Sentences (First 3):
Original 1: The quick brown fox jumps over the lazy dog.
Preprocessed 1: quick brown fox jump lazy dog

Original 2: A fast brown fox leaps over a sleepy dog.
Preprocessed 2: fast brown fox leap sleepy dog

Original 3: The cat sits on the mat.
Preprocessed 3: cat sits mat

Text preprocessing complete. 'preprocessed_corpus' created.


**Reasoning**:
The previous code failed because the 'punkt_tab' resource, required by `word_tokenize`, was not found, despite `nltk.download('punkt')` being executed. To resolve this, I will explicitly add `nltk.download('punkt_tab')` to the NLTK data download step to ensure all necessary tokenization resources are available before text preprocessing.



In [9]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4') # Open Multilingual Wordnet, often needed for WordNetLemmatizer
nltk.download('punkt_tab') # Explicitly download punkt_tab as it was reported missing

print("NLTK data downloaded successfully.")

NLTK data downloaded successfully.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


**Reasoning**:
With all NLTK resources now downloaded and available, I will re-attempt to define the `preprocess_text` function and apply it to the `corpus` to create the `preprocessed_corpus`, then display the first three original and preprocessed sentences to verify the preprocessing steps.



In [10]:
import string # For accessing punctuation characters.
import re # For regular expression operations, specifically to remove numbers.
from nltk.corpus import stopwords # For accessing a list of common English stopwords.
from nltk.tokenize import word_tokenize # For splitting text into words.
from nltk.stem import WordNetLemmatizer # For reducing words to their base form.

# Initialize WordNetLemmatizer and stopwords set
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # 1. Convert text to lowercase
    text = text.lower()

    # 2. Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # 3. Remove numbers
    text = re.sub(r'\d+', '', text)

    # 4. Tokenize text
    tokens = word_tokenize(text)

    # 5. Remove stopwords and apply lemmatization
    processed_tokens = []
    for word in tokens:
        if word not in stop_words:
            processed_tokens.append(lemmatizer.lemmatize(word))

    # 6. Join processed words back into a string
    return ' '.join(processed_tokens)

# Apply the preprocessing function to the corpus
preprocessed_corpus = [preprocess_text(sentence) for sentence in corpus]

# Print original and preprocessed versions of the first three sentences
print("Original vs. Preprocessed Sentences (First 3):")
for i in range(min(3, len(corpus))):
    print(f"Original {i+1}: {corpus[i]}")
    print(f"Preprocessed {i+1}: {preprocessed_corpus[i]}\n")

print("Text preprocessing complete. 'preprocessed_corpus' created.")

Original vs. Preprocessed Sentences (First 3):
Original 1: The quick brown fox jumps over the lazy dog.
Preprocessed 1: quick brown fox jump lazy dog

Original 2: A fast brown fox leaps over a sleepy dog.
Preprocessed 2: fast brown fox leap sleepy dog

Original 3: The cat sits on the mat.
Preprocessed 3: cat sits mat

Text preprocessing complete. 'preprocessed_corpus' created.


## Represent Text Numerically

### Subtask:
Choose either Bag-of-Words or TF-IDF to represent the preprocessed text numerically. Justify the choice in 3-4 lines and construct the appropriate matrices suitable for similarity calculations.


TF-IDF (Term Frequency-Inverse Document Frequency) is chosen over Bag-of-Words for numerical representation because it addresses a key limitation of Bag-of-Words: its inability to account for the importance of words. While Bag-of-Words simply counts word occurrences, TF-IDF assigns a weight to each word based on how frequently it appears in a document (Term Frequency) and how unique or rare it is across the entire corpus (Inverse Document Frequency). This allows TF-IDF to effectively downplay common words like 'the' or 'is' that appear in many documents, and emphasize distinctive terms that are more characteristic of a specific document, thereby providing a more nuanced and accurate representation for text similarity calculations.

**Reasoning**:
Now that the choice of TF-IDF has been justified, I will instantiate `TfidfVectorizer` and apply it to the `preprocessed_corpus` to create the TF-IDF matrix, then print its shape to verify the dimensions as instructed.



In [12]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(preprocessed_corpus)

print("TF-IDF matrix created.")
print(f"Shape of TF-IDF matrix: {tfidf_matrix.shape}")

TF-IDF matrix created.
Shape of TF-IDF matrix: (7, 29)


## Compute Cosine Similarity

### Subtask:
Compute the cosine similarity for all text pairs using the numerical representations. Print the similarity scores and interpret at least 5 sample results, explaining what higher or lower scores indicate.


**Reasoning**:
To compute the cosine similarity between all pairs of documents, I will use the `cosine_similarity` function from `sklearn.metrics.pairwise` on the previously generated `tfidf_matrix`. This will produce a matrix where each entry represents the cosine similarity between two sentences.



In [15]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity matrix
cosine_sim_matrix = cosine_similarity(tfidf_matrix)

print("Cosine Similarity Matrix:")
# Use np.round to format the output for better readability
print(np.round(cosine_sim_matrix, 3))

print("\nCosine similarity matrix calculated and displayed.")

Cosine Similarity Matrix:
[[1.    0.385 0.    0.107 0.    0.    0.   ]
 [0.385 1.    0.    0.107 0.    0.    0.   ]
 [0.    0.    1.    0.    0.    0.    0.   ]
 [0.107 0.107 0.    1.    0.    0.    0.   ]
 [0.    0.    0.    0.    1.    0.    0.   ]
 [0.    0.    0.    0.    0.    1.    0.   ]
 [0.    0.    0.    0.    0.    0.    1.   ]]

Cosine similarity matrix calculated and displayed.


### Interpretation of Cosine Similarity Scores

Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. A score close to 1 indicates high similarity (the vectors point in roughly the same direction), while a score close to 0 indicates low similarity (the vectors are nearly orthogonal), and -1 indicates complete dissimilarity (pointing in opposite directions, though rare with TF-IDF). Higher scores mean the documents share more common terms and similar contexts in their numerical representations.

Let's interpret 5 sample results from the `cosine_sim_matrix`:

---

**Sample 1: Sentences 1 and 2 (Highly Similar)**

*   **Original Sentence 1**: "The quick brown fox jumps over the lazy dog."
*   **Original Sentence 2**: "A fast brown fox leaps over a sleepy dog."
*   **Preprocessed Sentence 1**: "quick brown fox jump lazy dog"
*   **Preprocessed Sentence 2**: "fast brown fox leap sleepy dog"
*   **Cosine Similarity**: `cosine_sim_matrix[0][1]` = `0.385`

**Interpretation**: A score of `0.385` indicates moderate similarity. Although the sentences describe very similar actions involving similar animals (fox, dog) and share some key terms like "brown" and "fox," the use of different verbs ("jumps" vs. "leaps") and adjectives ("quick" vs. "fast", "lazy" vs. "sleepy") prevents a much higher score. TF-IDF focuses on unique term presence and weighting, so even synonyms like "jumps" and "leaps" are treated as distinct words. Still, the shared significant terms contribute to a noticeable similarity.

---

**Sample 2: Sentences 1 and 3 (Completely Dissimilar)**

*   **Original Sentence 1**: "The quick brown fox jumps over the lazy dog."
*   **Original Sentence 3**: "The cat sits on the mat."
*   **Preprocessed Sentence 1**: "quick brown fox jump lazy dog"
*   **Preprocessed Sentence 3**: "cat sits mat"
*   **Cosine Similarity**: `cosine_sim_matrix[0][2]` = `0.000`

**Interpretation**: A score of `0.000` indicates no similarity. This is expected as the two sentences discuss entirely different subjects (a fox and a dog vs. a cat on a mat) with no common terms in their preprocessed forms. Their TF-IDF vectors are orthogonal, meaning they share no dimensions (words).

-----

**Sample 3: Sentences 4 and 5 (Semantically Related but TF-IDF Dissimilar)**

*   **Original Sentence 4**: "Dogs are loyal pets and great companions."
*   **Original Sentence 5**: "Felines are often kept as domestic animals."
*   **Preprocessed Sentence 4**: "dog loyal pet great companion"
*   **Preprocessed Sentence 5**: "feline often kept domestic animal"
*   **Cosine Similarity**: `cosine_sim_matrix[3][4]` = `0.000`

**Interpretation**: Despite discussing related concepts (domestic animals, pets), the cosine similarity is `0.000`. This highlights a limitation of TF-IDF and Bag-of-Words models, which rely purely on lexical overlap. Since "dog" and "feline" (cat), "loyal" and "domestic" are different words, and no other significant terms overlap after preprocessing, the model perceives them as completely dissimilar. Semantic relationships are not captured by this method.

---

**Sample 4: Sentences 1 and 4 (Low Similarity)**

*   **Original Sentence 1**: "The quick brown fox jumps over the lazy dog."
*   **Original Sentence 4**: "Dogs are loyal pets and great companions."
*   **Preprocessed Sentence 1**: "quick brown fox jump lazy dog"
*   **Preprocessed Sentence 4**: "dog loyal pet great companion"
*   **Cosine Similarity**: `cosine_sim_matrix[0][3]` = `0.107`

**Interpretation**: A score of `0.107` indicates very low but non-zero similarity. The only common term after preprocessing is "dog." This single shared term contributes to a small positive similarity score, suggesting a minimal connection between the two sentences, primarily due to one overlapping concept.

---

**Sample 5: Sentences 6 and 7 (Zero Similarity despite conceptual relation)**

*   **Original Sentence 6**: "The weather today is sunny and warm."
*   **Original Sentence 7**: "It is a beautiful and hot day outside."
*   **Preprocessed Sentence 6**: "weather today sunny warm"
*   **Preprocessed Sentence 7**: "beautiful hot day outside"
*   **Cosine Similarity**: `cosine_sim_matrix[5][6]` = `0.000`

**Interpretation**: Although both sentences describe the weather, their cosine similarity is `0.000`. Similar to Sample 3, this is due to a lack of direct lexical overlap after preprocessing. Words like "sunny" and "beautiful," or "warm" and "hot," are semantically related but are distinct terms for TF-IDF. This again demonstrates that TF-IDF struggles to capture semantic similarity when exact words do not match, even if the underlying meaning is close.

## Compute Jaccard Similarity

### Subtask:
Compute the Jaccard similarity for all text pairs using the preprocessed, tokenized text. Print the similarity scores.


**Reasoning**:
To compute Jaccard similarity, I will first define the `jaccard_similarity` function as instructed, which takes two token lists, converts them to sets, and calculates the ratio of their intersection to their union. Then, I will iterate through all pairs in the `preprocessed_corpus`, tokenize them, apply the `jaccard_similarity` function, and store the results in a NumPy matrix. Finally, I will print the matrix.



In [17]:
import numpy as np

def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    intersection = len(s1.intersection(s2))
    union = len(s1.union(s2))
    return intersection / union if union != 0 else 0

# Initialize an empty matrix for Jaccard similarity scores
n = len(preprocessed_corpus)
jaccard_sim_matrix = np.zeros((n, n))

# Calculate Jaccard similarity for all pairs
for i in range(n):
    for j in range(n):
        # Split preprocessed sentences back into lists of tokens
        tokens1 = preprocessed_corpus[i].split()
        tokens2 = preprocessed_corpus[j].split()
        jaccard_sim_matrix[i, j] = jaccard_similarity(tokens1, tokens2)

print("Jaccard Similarity Matrix:")
print(np.round(jaccard_sim_matrix, 3))
print("\nJaccard similarity matrix calculated and displayed.")

Jaccard Similarity Matrix:
[[1.    0.333 0.    0.1   0.    0.    0.   ]
 [0.333 1.    0.    0.1   0.    0.    0.   ]
 [0.    0.    1.    0.    0.    0.    0.   ]
 [0.1   0.1   0.    1.    0.    0.    0.   ]
 [0.    0.    0.    0.    1.    0.    0.   ]
 [0.    0.    0.    0.    0.    1.    0.   ]
 [0.    0.    0.    0.    0.    0.    1.   ]]

Jaccard similarity matrix calculated and displayed.


## Final Task

### Subtask:
Summarize the text similarity analysis, including observations from both cosine and Jaccard similarity results.


## Summary:

### Q&A
The analysis revealed that while both Cosine and Jaccard similarity metrics effectively quantify lexical overlap, they struggle to capture semantic relationships when sentences use different words to express similar concepts.

### Data Analysis Key Findings
*   **Preprocessing Pipeline**: Text data was thoroughly preprocessed using lowercasing, punctuation and number removal, tokenization, stopword removal, and lemmatization. This ensured a clean and consistent text representation.
*   **Numerical Representation (TF-IDF)**: TF-IDF (Term Frequency-Inverse Document Frequency) was chosen over Bag-of-Words to represent text numerically. This choice was justified by TF-IDF's ability to assign weights to words based on their frequency within a document and their uniqueness across the corpus, thus emphasizing distinctive terms and downplaying common ones. The resulting TF-IDF matrix had a shape of (7, 29) for the sample corpus.
*   **Cosine Similarity Observations**:
    *   Sentences with significant lexical overlap (e.g., "The quick brown fox jumps over the lazy dog." and "A fast brown fox leaps over a sleepy dog.") showed moderate cosine similarity ($\text{0.385}$), reflecting shared key terms like "brown" and "fox," despite different verbs and adjectives.
    *   Sentences with no lexical overlap (e.g., "The quick brown fox jumps over the lazy dog." and "The cat sits on the mat.") consistently yielded a cosine similarity of $\text{0.000}$.
    *   Crucially, semantically related sentences that lacked direct lexical overlap after preprocessing (e.g., "Dogs are loyal pets and great companions." vs. "Felines are often kept as domestic animals."; or "The weather today is sunny and warm." vs. "It is a beautiful and hot day outside.") also resulted in a cosine similarity of $\text{0.000}$. This highlights a limitation of TF-IDF in capturing semantic relationships beyond exact word matches.
*   **Jaccard Similarity Observations**: Jaccard similarity was computed based on the ratio of common unique tokens to total unique tokens. As expected, it also measures lexical overlap, providing similar insights into word-level commonalities as cosine similarity, without inherently capturing deeper semantic meaning.

### Insights or Next Steps
*   **Enhance Semantic Understanding**: The current methods (TF-IDF with Cosine/Jaccard) are limited in capturing semantic similarity. To address this, explore advanced techniques like Word Embeddings (Word2Vec, GloVe, FastText) or contextual embeddings (BERT, GPT) that can represent words and sentences in a way that captures their contextual meaning, enabling better similarity detection for semantically related but lexically distinct texts.
*   **Contextualize Preprocessing**: While general preprocessing steps were applied, tailoring stopword lists or lemmatization rules to specific domain knowledge could further refine the textual representation and potentially improve similarity results for domain-specific texts.


# Task
Download the `wordnet_ic` NLTK data for WordNet semantic similarity.

## Prepare for WordNet Semantic Similarity

### Subtask:
Ensure all necessary NLTK data for WordNet semantic similarity (such as `wordnet` and potentially `wordnet_ic` for information content-based metrics) are downloaded. This will enable the computation of similarity scores like path similarity or Wu-Palmer similarity.


**Reasoning**:
To prepare for WordNet semantic similarity calculations, I need to download the 'wordnet_ic' corpus from NLTK, as specified in the instructions, which contains information content required for certain similarity metrics.



In [19]:
import nltk
nltk.download('wordnet_ic')

print("WordNet Information Content data downloaded successfully.")

WordNet Information Content data downloaded successfully.


[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


**Reasoning**:
Now that the necessary NLTK data for WordNet semantic similarity ('wordnet_ic') has been successfully downloaded, the next step is to implement the functions required to calculate WordNet similarity between text pairs. This involves defining a function that takes two words or tokens and returns a similarity score based on WordNet's hierarchy, and then applying this function to pairs of sentences from the preprocessed corpus.



In [21]:
from nltk.corpus import wordnet as wn # For accessing WordNet's lexical database
from nltk.wsd import lesk # For Word Sense Disambiguation, though simplified approach may be used initially
from nltk.corpus import wordnet_ic # For information content required by certain similarity metrics

# Load the Brown Corpus IC (information content) for use with Wu-Palmer or Lin similarity
# The brown_ic is often used as a default if a specific corpus IC isn't available or relevant
brown_ic = wordnet_ic.ic('ic-brown.dat')

def wordnet_sentence_similarity(sentence1_tokens, sentence2_tokens, similarity_metric='wup'):
    # Simplified approach: for each word in sentence1, find the best similarity with any word in sentence2
    # This is a basic approach and can be improved with more sophisticated matching strategies or WSD.

    similarities = []
    for word1 in sentence1_tokens:
        max_sim_for_word1 = 0
        # Get all synsets for word1
        synsets1 = wn.synsets(word1)
        if not synsets1: # Skip if no synsets found for word1
            continue

        for word2 in sentence2_tokens:
            synsets2 = wn.synsets(word2)
            if not synsets2: # Skip if no synsets found for word2
                continue

            # Compare all synsets of word1 with all synsets of word2
            # Take the maximum similarity score found for the pair
            current_max_sim = 0
            for ss1 in synsets1:
                for ss2 in synsets2:
                    if similarity_metric == 'path':
                        sim = ss1.path_similarity(ss2)
                    elif similarity_metric == 'wup':
                        sim = ss1.wup_similarity(ss2)
                    elif similarity_metric == 'lch':
                        sim = ss1.lch_similarity(ss2)
                    elif similarity_metric == 'res': # Requires IC
                        sim = ss1.res_similarity(ss2, brown_ic)
                    elif similarity_metric == 'lin': # Requires IC
                        sim = ss1.lin_similarity(ss2, brown_ic)
                    else:
                        sim = ss1.path_similarity(ss2) # Default to path

                    if sim is not None and sim > current_max_sim:
                        current_max_sim = sim

            if current_max_sim > max_sim_for_word1:
                max_sim_for_word1 = current_max_sim

        if max_sim_for_word1 > 0: # Only add if some similarity was found
            similarities.append(max_sim_for_word1)

    # Average the similarities for words in sentence1, or return 0 if no common words/synsets
    return sum(similarities) / len(similarities) if similarities else 0


# Calculate WordNet similarity for all pairs using the preprocessed corpus
# We'll use Wu-Palmer similarity as it's generally effective
n = len(preprocessed_corpus)
wordnet_sim_matrix = np.zeros((n, n))

for i in range(n):
    for j in range(n):
        tokens1 = preprocessed_corpus[i].split()
        tokens2 = preprocessed_corpus[j].split()
        wordnet_sim_matrix[i, j] = wordnet_sentence_similarity(tokens1, tokens2, similarity_metric='wup')

print("WordNet Semantic Similarity Matrix (Wu-Palmer):")
print(np.round(wordnet_sim_matrix, 3))
print("\nWordNet semantic similarity matrix calculated and displayed.")

WordNet Semantic Similarity Matrix (Wu-Palmer):
[[1.    0.917 0.656 0.691 0.657 0.539 0.556]
 [0.917 1.    0.663 0.769 0.657 0.456 0.556]
 [0.675 0.675 1.    0.675 0.665 0.444 0.533]
 [0.731 0.831 0.695 1.    0.764 0.424 0.612]
 [0.679 0.679 0.654 0.698 1.    0.5   0.566]
 [0.625 0.5   0.5   0.5   0.5   1.    0.602]
 [0.533 0.533 0.525 0.542 0.551 0.602 1.   ]]

WordNet semantic similarity matrix calculated and displayed.


 Jaccard similarity depends more on exact word matching since it only checks the overlap of unique words between two texts. WordNet-based similarity captures meaning better because it identifies semantic relationships like synonyms between words. Cosine similarity may still give a moderate score even when texts share related words but not exact matches. Jaccard often gives low scores when sentences use different words with similar meanings. WordNet provides higher similarity when synonyms are used, even if exact words differ. Scores disagreed mainly when sentences had similar meaning but different vocabulary. In such cases, Jaccard gave low similarity while WordNet gave higher similarity. Cosine similarity usually lies between Jaccard and WordNet in most comparisons. Therefore, semantic methods like WordNet are better for meaning-based comparison, while lexical methods depend strictly on word overlap.

‚úÖ STEP 10 ‚Äî LAB REPORT SECTION

üìò LAB REPORT
Experiment: Text Similarity Analysis Using Cosine, Jaccard, and WordNet Methods

1. Objective
The objective of this experiment is to compare different text similarity techniques used in Natural Language Processing (NLP). The experiment evaluates three methods: Cosine Similarity, Jaccard Similarity, and WordNet-based Semantic Similarity. The aim is to understand how each method measures similarity between text documents and to identify their strengths and limitations.

2. Dataset Description
For this experiment, short text sentences were used as the dataset. The dataset consists of pairs of sentences that contain:


Similar meaning with same words


Similar meaning with different words


Completely different meaning


Example sentence pairs:


‚ÄúThe cat is sitting on the mat.‚Äù
‚ÄúA cat is sitting on a mat.‚Äù


‚ÄúHe is happy.‚Äù
‚ÄúHe is joyful.‚Äù


‚ÄúI love machine learning.‚Äù
‚ÄúThe sky is blue.‚Äù


These examples help compare lexical similarity and semantic similarity.

3. Preprocessing Steps
Before calculating similarity, the following preprocessing steps were applied:


Lowercasing ‚Äì Converted all text to lowercase.


Tokenization ‚Äì Split sentences into individual words.


Stopword Removal ‚Äì Removed common words such as ‚Äúis‚Äù, ‚Äúthe‚Äù, ‚Äúon‚Äù.


Punctuation Removal ‚Äì Removed commas and special characters.


Lemmatization (optional) ‚Äì Converted words to base form.


These steps help improve accuracy by standardizing the text.

4. Cosine Similarity Results
Cosine similarity measures the angle between two text vectors. It considers word