<a href="https://colab.research.google.com/github/Varshitha-bit/nlp/blob/main/Lab7_Text_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

IMPORT REQUIRED LIBRARIES:



In [1]:
import numpy as np # numerical computations
import pandas as pd # dataset handling
import re # text cleaning with regex
import string # punctuation removal


from sklearn.feature_extraction.text import TfidfVectorizer # TF-IDF representation
from sklearn.metrics.pairwise import cosine_similarity # cosine similarity


import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [2]:
documents = [
"Artificial intelligence is transforming healthcare",
"Machine learning improves medical diagnosis",
"Doctors use AI tools for patient care",
"Deep learning models analyze medical images",
"Healthcare systems rely on data analytics",
"Natural language processing helps analyze clinical notes",
"AI assists physicians in disease prediction",
"Medical data is sensitive and requires privacy",
"Hospitals adopt smart technologies",
"Technology improves patient treatment",
"Students study artificial intelligence",
"AI courses include machine learning",
"Education platforms use recommendation systems",
"Online learning is powered by data",
"Teachers use digital tools",
"Search engines rely on text similarity",
"Information retrieval uses ranking algorithms",
"Cosine similarity measures document similarity",
"Jaccard similarity compares word overlap",
"Semantic similarity captures meaning",
"WordNet contains synonyms and concepts",
"Language understanding is challenging",
"Text mining extracts useful patterns",
"Chatbots use NLP",
"Virtual assistants understand queries",
"User queries must be matched with documents",
"Information systems process text",
"Text preprocessing improves accuracy",
"Stopwords reduce noise",
"Lemmatization normalizes words"
]

In [3]:
queries = [
"AI in healthcare",
"medical diagnosis using machine learning",
"doctors and physicians",
"text similarity methods",
"semantic meaning of words",
"online education platforms",
"search engine ranking",
"NLP in chatbots",
"data privacy in hospitals",
"learning artificial intelligence"
]

In [6]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower() # lowercase
    text = re.sub(r'[^a-z\s]', '', text) # remove punctuation & numbers
    tokens = word_tokenize(text) # tokenize
    tokens = [t for t in tokens if t not in stop_words] # remove stopwords
    tokens = [lemmatizer.lemmatize(t) for t in tokens] # lemmatization
    return ' '.join(tokens)

clean_docs = [preprocess(doc) for doc in documents]
clean_queries = [preprocess(q) for q in queries]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [7]:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(clean_docs)
query_matrix = vectorizer.transform(clean_queries)

In [9]:
cosine_scores = cosine_similarity(query_matrix, tfidf_matrix)

for i, query in enumerate(queries):
    print(f"\nQuery: {query}")
    top_indices = cosine_scores[i].argsort()[-3:][::-1]
    for idx in top_indices:
        print(f"Score: {cosine_scores[i][idx]:.3f} | Doc: {documents[idx]}")


Query: AI in healthcare
Score: 0.358 | Doc: Artificial intelligence is transforming healthcare
Score: 0.333 | Doc: Healthcare systems rely on data analytics
Score: 0.274 | Doc: AI courses include machine learning

Query: medical diagnosis using machine learning
Score: 0.906 | Doc: Machine learning improves medical diagnosis
Score: 0.391 | Doc: AI courses include machine learning
Score: 0.316 | Doc: Deep learning models analyze medical images

Query: doctors and physicians
Score: 0.327 | Doc: AI assists physicians in disease prediction
Score: 0.322 | Doc: Doctors use AI tools for patient care
Score: 0.000 | Doc: Text preprocessing improves accuracy

Query: text similarity methods
Score: 0.538 | Doc: Search engines rely on text similarity
Score: 0.474 | Doc: Cosine similarity measures document similarity
Score: 0.307 | Doc: Information systems process text

Query: semantic meaning of words
Score: 0.633 | Doc: Semantic similarity captures meaning
Score: 0.284 | Doc: Lemmatization normali

In [11]:
def jaccard_similarity(a, b):
    set_a, set_b = set(a.split()), set(b.split())
    return len(set_a & set_b) / len(set_a | set_b)

for i in range(5):
    print("Query:", clean_queries[i])
    print("Document:", clean_docs[i])
    print("Jaccard Similarity:", jaccard_similarity(clean_queries[i], clean_docs[i]))
    print()

Query: ai healthcare
Document: artificial intelligence transforming healthcare
Jaccard Similarity: 0.2

Query: medical diagnosis using machine learning
Document: machine learning improves medical diagnosis
Jaccard Similarity: 0.6666666666666666

Query: doctor physician
Document: doctor use ai tool patient care
Jaccard Similarity: 0.14285714285714285

Query: text similarity method
Document: deep learning model analyze medical image
Jaccard Similarity: 0.0

Query: semantic meaning word
Document: healthcare system rely data analytics
Jaccard Similarity: 0.0



In [13]:
def wordnet_similarity(word1, word2):
    syn1 = wordnet.synsets(word1)
    syn2 = wordnet.synsets(word2)
    if syn1 and syn2:
        return syn1[0].wup_similarity(syn2[0])
    return 0

pairs = [("doctor", "physician"), ("student", "learner"), ("car", "vehicle"),
         ("hospital", "clinic"), ("teacher", "instructor")]

for w1, w2 in pairs:
    print(w1, w2, wordnet_similarity(w1, w2))

doctor physician 1.0
student learner 0.7058823529411765
car vehicle 0.8
hospital clinic 0.11764705882352941
teacher instructor 1.0


Cosine similarity works best for short texts when TF-IDF is used, as it balances term importance effectively. Jaccard similarity relies purely on exact word overlap and performs poorly when synonyms are used. WordNet-based similarity captures meaning better by using semantic relationships. Lexical methods fail when different words convey the same meaning. Semantic similarity excels in understanding conceptual similarity but is limited to word-level comparisons. Scores often disagree when texts share meaning but not vocabulary. Jaccard is the most brittle method. Cosine similarity is widely applicable, while WordNet is useful for deeper semantic analysis.

This lab demonstrated how different text similarity techniques behave on short textual data. Lexical similarity methods are efficient and simple but limited in understanding meaning. Semantic similarity using WordNet provides deeper insights but requires linguistic resources. Combining these methods leads to more robust NLP systems.