<a href="https://colab.research.google.com/github/deepakk177/ML---Lab-Programs-MTech-AI-2025-/blob/main/Machine_Learning_Lab_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine Learning Lab 4**
**Name :** Deepak Singh Porte

**Scholar Number :** 25215011122


Measures

You are given two text documents:

Doc1: Artificial intelligence and machine learning are transforming healthcare by enabling early diagnosis and personalized treatment.

Doc2: Machine learning techniques are widely applied in healthcare to support early disease detection, medical imaging, and treatment recommendations.

# **Question 1: Cosine Similarity (Document Similarity)**

Tasks:
1. Preprocess both documents (convert to lowercase, remove punctuation, split into words).
2. Build a vocabulary of all unique words across both documents.
3. Represent each document as a bag-of-words vector (word counts).
4. Write a function cosine_similarity(vec1, vec2) to compute cosine similarity:
Cosine(A, B) = (A · B) / (||A|| * ||B||)
5. Compute and print the cosine similarity between Doc1 and Doc2.

In [None]:
import string
from collections import Counter
import math

doc1 = "Artificial intelligence and machine learning are transforming healthcare by enabling early diagnosis and personalized treatment."
doc2 = "Machine learning techniques are widely applied in healthcare to support early disease detection, medical imaging, and treatment recommendations."

def preprocess_document(doc):
    doc_lower = doc.lower()
    doc_no_punct = doc_lower.translate(str.maketrans('', '', string.punctuation))
    words = doc_no_punct.split()
    return words

def calculate_cosine_similarity(words1, words2):
    vocab = sorted(set(words1 + words2))
    freq1 = Counter(words1)
    freq2 = Counter(words2)

    vector1 = [freq1[word] for word in vocab]
    vector2 = [freq2[word] for word in vocab]

    dot_product = sum(v1 * v2 for v1, v2 in zip(vector1, vector2))
    magnitude1 = math.sqrt(sum(v1 * v1 for v1 in vector1))
    magnitude2 = math.sqrt(sum(v2 * v2 for v2 in vector2))

    if magnitude1 == 0 or magnitude2 == 0:
        return 0, vocab, vector1, vector2

    cosine_similarity = dot_product / (magnitude1 * magnitude2)
    return cosine_similarity, vocab, vector1, vector2

# Preprocessing
words1 = preprocess_document(doc1)
words2 = preprocess_document(doc2)

print("Preprocessed Documents:")
print(f"Doc1 words: {words1}")
print(f"Doc2 words: {words2}")

# Calculate similarity
similarity, vocabulary, vec1, vec2 = calculate_cosine_similarity(words1, words2)

print(f"\nVocabulary: {vocabulary}")
print(f"Doc1 vector: {vec1}")
print(f"Doc2 vector: {vec2}")
print(f"\nCosine Similarity: {similarity:.4f}")

Preprocessed Documents:
Doc1 words: ['artificial', 'intelligence', 'and', 'machine', 'learning', 'are', 'transforming', 'healthcare', 'by', 'enabling', 'early', 'diagnosis', 'and', 'personalized', 'treatment']
Doc2 words: ['machine', 'learning', 'techniques', 'are', 'widely', 'applied', 'in', 'healthcare', 'to', 'support', 'early', 'disease', 'detection', 'medical', 'imaging', 'and', 'treatment', 'recommendations']

Vocabulary: ['and', 'applied', 'are', 'artificial', 'by', 'detection', 'diagnosis', 'disease', 'early', 'enabling', 'healthcare', 'imaging', 'in', 'intelligence', 'learning', 'machine', 'medical', 'personalized', 'recommendations', 'support', 'techniques', 'to', 'transforming', 'treatment', 'widely']
Doc1 vector: [2, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0]
Doc2 vector: [1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]

Cosine Similarity: 0.4573


# Question 2: Jaccard Similarity (Set-based Similarity)
You are given the same two documents.
Tasks:
1. Convert each document into a set of unique words.
2. Write a function jaccard_similarity(set1, set2) to compute Jaccard similarity:
Jaccard(A, B) = |A ∩ B| / |A ∪ B|
3. Compute and print the Jaccard similarity between the two sets.

In [None]:
import string

doc1 = "Artificial intelligence and machine learning are transforming healthcare by enabling early diagnosis and personalized treatment."
doc2 = "Machine learning techniques are widely applied in healthcare to support early disease detection, medical imaging, and treatment recommendations."

def preprocess_to_set(doc):
    doc_lower = doc.lower()
    doc_no_punct = doc_lower.translate(str.maketrans('', '', string.punctuation))
    words = doc_no_punct.split()
    return set(words)

def jaccard_similarity(set1, set2):
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union != 0 else 0

set1 = preprocess_to_set(doc1)
set2 = preprocess_to_set(doc2)

print("Doc1 unique words:", set1)
print("Doc2 unique words:", set2)

similarity = jaccard_similarity(set1, set2)
print(f"Jaccard Similarity: {similarity:.4f}")

Doc1 unique words: {'learning', 'treatment', 'transforming', 'enabling', 'artificial', 'by', 'early', 'healthcare', 'and', 'are', 'intelligence', 'diagnosis', 'machine', 'personalized'}
Doc2 unique words: {'techniques', 'learning', 'applied', 'disease', 'imaging', 'treatment', 'support', 'medical', 'in', 'recommendations', 'early', 'detection', 'healthcare', 'are', 'to', 'and', 'machine', 'widely'}
Jaccard Similarity: 0.2800


# **Question 3: Find if Cosine Similarity (Document Similarity) is Symmetric or Asymmetric**


In [None]:
import string
from collections import Counter
import math

doc1 = """The rapid advancement of renewable energy technologies has fundamentally transformed the global energy landscape over the past two decades.Solar panels have become significantly more efficient and cost-effective, while wind turbines now generate substantial portions of electricity in many countries.
Battery storage systems have evolved to store excess renewable energy for use during peak demand periods, addressing one of the primary challenges of intermittent energy sources.Governments worldwide are investing heavily in smart grid infrastructure to better manage distributed energy generation and consumption patterns.
Electric vehicles are gaining widespread adoption, creating new opportunities for vehicle-to-grid integration and energy storage solutions.The transition to renewable energy is not merely an environmental imperative but also an economic opportunity that creates millions of jobs in manufacturing, installation, and maintenance sectors.
Research institutions continue to develop breakthrough technologies such as perovskite solar cells, floating wind farms, and advanced geothermal systems that promise even greater efficiency and accessibility."""

doc2 = """Modern urban planning faces unprecedented challenges as cities continue to grow at an exponential rate, requiring innovative approaches to accommodate increasing populations while maintaining livability and sustainability.
Smart city initiatives integrate Internet of Things sensors, data analytics, and artificial intelligence to optimize traffic flow, reduce energy consumption, and improve public services delivery.Green building standards and sustainable architecture practices are becoming mandatory in many metropolitan areas, emphasizing energy efficiency, water conservation, and reduced carbon footprints.
Public transportation systems are being revolutionized through electric buses, autonomous vehicles, and integrated mobility platforms that seamlessly connect different modes of transport.Urban agriculture and vertical farming projects are emerging as solutions to food security concerns while reducing transportation costs and environmental impact. Mixed-use developments combine residential, commercial, and recreational spaces to create walkable neighborhoods that reduce dependency on private vehicles.
Climate resilience planning addresses rising sea levels, extreme weather events, and heat island effects through innovative infrastructure design and community preparedness programs."""

def preprocess_document(doc):
    doc_lower = doc.lower()
    doc_no_punct = doc_lower.translate(str.maketrans('', '', string.punctuation))
    words = doc_no_punct.split()
    return words

def calculate_cosine_similarity(words1, words2):
    vocab = sorted(set(words1 + words2))
    freq1 = Counter(words1)
    freq2 = Counter(words2)

    vector1 = [freq1[word] for word in vocab]
    vector2 = [freq2[word] for word in vocab]

    dot_product = sum(v1 * v2 for v1, v2 in zip(vector1, vector2))
    magnitude1 = math.sqrt(sum(v1 * v1 for v1 in vector1))
    magnitude2 = math.sqrt(sum(v2 * v2 for v2 in vector2))

    if magnitude1 == 0 or magnitude2 == 0:
        return 0, vocab, vector1, vector2

    cosine_similarity = dot_product / (magnitude1 * magnitude2)
    return cosine_similarity, vocab, vector1, vector2

words1 = preprocess_document(doc1)
words2 = preprocess_document(doc2)

print("Preprocessed Documents:")
print(f"Doc1 words: {words1}")
print(f"Doc2 words: {words2}")

similarity_1_2, vocabulary, vec1, vec2 = calculate_cosine_similarity(words1, words2)
similarity_2_1, _, _, _ = calculate_cosine_similarity(words2, words1)

print(f"\nVocabulary: {vocabulary}")
print(f"Doc1 vector: {vec1}")
print(f"Doc2 vector: {vec2}")

print(f"\nCosine Similarity (Doc1 -> Doc2): {similarity_1_2:.6f}")
print(f"Cosine Similarity (Doc2 -> Doc1): {similarity_2_1:.6f}")

is_symmetric = abs(similarity_1_2 - similarity_2_1) < 1e-10
if is_symmetric:
    print("Result: Cosine similarity is SYMMETRIC")
else:
    print("Result: Cosine similarity is ASYMMETRIC")

Preprocessed Documents:
Doc1 words: ['the', 'rapid', 'advancement', 'of', 'renewable', 'energy', 'technologies', 'has', 'fundamentally', 'transformed', 'the', 'global', 'energy', 'landscape', 'over', 'the', 'past', 'two', 'decadessolar', 'panels', 'have', 'become', 'significantly', 'more', 'efficient', 'and', 'costeffective', 'while', 'wind', 'turbines', 'now', 'generate', 'substantial', 'portions', 'of', 'electricity', 'in', 'many', 'countries', 'battery', 'storage', 'systems', 'have', 'evolved', 'to', 'store', 'excess', 'renewable', 'energy', 'for', 'use', 'during', 'peak', 'demand', 'periods', 'addressing', 'one', 'of', 'the', 'primary', 'challenges', 'of', 'intermittent', 'energy', 'sourcesgovernments', 'worldwide', 'are', 'investing', 'heavily', 'in', 'smart', 'grid', 'infrastructure', 'to', 'better', 'manage', 'distributed', 'energy', 'generation', 'and', 'consumption', 'patterns', 'electric', 'vehicles', 'are', 'gaining', 'widespread', 'adoption', 'creating', 'new', 'opportuniti

# **Question 4: Find if Jaccard Similarity (Set-based Similarity) is Symmetric or Asymmetric**

In [None]:
def preprocess_document(doc):
    doc_lower = doc.lower()
    doc_no_punct = doc_lower.translate(str.maketrans('', '', string.punctuation))
    words = doc_no_punct.split()
    return words

def calculate_jaccard_similarity(words1, words2):
    set1 = set(words1)
    set2 = set(words2)

    intersection = set1 & set2
    union = set1 | set2

    if len(union) == 0:
        return 0, set1, set2, intersection, union

    jaccard_similarity = len(intersection) / len(union)
    return jaccard_similarity, set1, set2, intersection, union

words1 = preprocess_document(doc1)
words2 = preprocess_document(doc2)

print("Preprocessed Documents:")
print(f"Doc1 words: {words1}")
print(f"Doc2 words: {words2}")

similarity_1_2, set1, set2, intersection, union = calculate_jaccard_similarity(words1, words2)
similarity_2_1, set2_rev, set1_rev, intersection_rev, union_rev = calculate_jaccard_similarity(words2, words1)

print(f"\nSet Information:")
print(f"Doc1 unique words: {sorted(set1)}")
print(f"Doc2 unique words: {sorted(set2)}")
print(f"Intersection: {sorted(intersection)}")
print(f"Union: {sorted(union)}")

print(f"\nSet Sizes:")
print(f"Doc1 set size: {len(set1)}")
print(f"Doc2 set size: {len(set2)}")
print(f"Intersection size: {len(intersection)}")
print(f"Union size: {len(union)}")

print(f"\nJaccard Similarity (Doc1 -> Doc2): {similarity_1_2:.6f}")
print(f"Jaccard Similarity (Doc2 -> Doc1): {similarity_2_1:.6f}")

is_symmetric = abs(similarity_1_2 - similarity_2_1) < 1e-10
if is_symmetric:
    print("Result: Jaccard similarity is SYMMETRIC")
else:
    print("Result: Jaccard similarity is ASYMMETRIC")

Preprocessed Documents:
Doc1 words: ['the', 'rapid', 'advancement', 'of', 'renewable', 'energy', 'technologies', 'has', 'fundamentally', 'transformed', 'the', 'global', 'energy', 'landscape', 'over', 'the', 'past', 'two', 'decadessolar', 'panels', 'have', 'become', 'significantly', 'more', 'efficient', 'and', 'costeffective', 'while', 'wind', 'turbines', 'now', 'generate', 'substantial', 'portions', 'of', 'electricity', 'in', 'many', 'countries', 'battery', 'storage', 'systems', 'have', 'evolved', 'to', 'store', 'excess', 'renewable', 'energy', 'for', 'use', 'during', 'peak', 'demand', 'periods', 'addressing', 'one', 'of', 'the', 'primary', 'challenges', 'of', 'intermittent', 'energy', 'sourcesgovernments', 'worldwide', 'are', 'investing', 'heavily', 'in', 'smart', 'grid', 'infrastructure', 'to', 'better', 'manage', 'distributed', 'energy', 'generation', 'and', 'consumption', 'patterns', 'electric', 'vehicles', 'are', 'gaining', 'widespread', 'adoption', 'creating', 'new', 'opportuniti