Problem Statement:
 
Given a list of subtopics and main topics, you have to group the subtopics to the corresponding main topic.

Subtopics:

Absorption in small intestine
Salivary digestion
Function of mucus in digestive system
Effect of acid on plant tissues
Nervous coordination in digestion
Taste perception
Function of salivary glands
Types and functions of teeth
Relationship between taste and smell
Structure of villus and coordination of digestive and circulatory systems
Digestion in small intestine
Peristaltic movement in the digestive system
Human dental formula
Parts of the digestive system
Stomach reflexes
Valves in the digestive system
Hormones related to hunger
Digestive enzymes and their sources
Pancreatic hormones
Hormones related to satiety


Main topics:

1. Nutrition in Animals
2. Digestion and Absorption
3. Control and Coordination

You have to:


Document all the approaches you have used and why they give those results with code.


## Keyword-based Mapping
### This approach classifies subtopics into main topics by matching predefined keywords.
##### Here, Used dictionary keyword_map maps each main topic to a list of keywords. 
##### These keywords are associated with specific concepts expected in the subtopics for that main topic.
##### For an example:
##### Nutrition in Animals is mapped to keywords like teeth, salivary, taste, relationship
##### Digestion and Absorption ---> digestive, enzymes, absorption
##### Control and Coordination ---> hormones, nervous, coordination
##### These keywords capture the patterns of the topics to help group the subtopics with the main topic.

In [1]:
subtopics = [
    "Absorption in small intestine",
    "Salivary digestion",
    "Function of mucus in digestive system",
    "Effect of acid on plant tissues",
    "Nervous coordination in digestion",
    "Taste perception",
    "Function of salivary glands",
    "Types and functions of teeth",
    "Relationship between taste and smell",
    "Structure of villus and coordination of digestive and circulatory systems",
    "Digestion in small intestine",
    "Peristaltic movement in the digestive system",
    "Human dental formula",
    "Parts of the digestive system",
    "Stomach reflexes",
    "Valves in the digestive system",
    "Hormones related to hunger",
    "Digestive enzymes and their sources",
    "Pancreatic hormones",
    "Hormones related to satiety"
]

In [2]:
main_topics = [
    "Nutrition in Animals",
    "Digestion and Absorption",
    "Control and Coordination"
]

In [3]:
#Let's see approach here,
#Created dictionary keyword_map ---> each main topic is associated with a list of relevant keywords.
#It iterates through each subtopic and checks if any of the keywords associated with a main topic are present in the subtopic or not.
#if a match is found, the subtopic is assigned to that main topic and the loop for the current subtopic breaks.

In [4]:
def keyword_based_mapping(subtopics):
    keyword_map = {
        "Nutrition in Animals": ["teeth", "salivary", "taste", "relationship"],
        "Digestion and Absorption": ["digestive", "enzymes", "absorption", "small intestine", "stomach", "peristaltic", "villus", "valves"],
        "Control and Coordination": ["hormones", "nervous", "coordination", "reflexes", "perception"]
    }
    # empty dictionary grouped 
    # with keys as the main topics
    # empty lists as values
    grouped = {topic: [] for topic in main_topics}

    for subtopic in subtopics:
        for topic, keywords in keyword_map.items():
            #used lower coz case-insensitive matching
            #if match is found, the subtopic is added to the corresponding main topic in grouped.
            if any(keyword in subtopic.lower() for keyword in keywords):
                grouped[topic].append(subtopic)
                break

    return grouped

In [5]:
print(keyword_based_mapping("Absorption in small intestine"))

{'Nutrition in Animals': [], 'Digestion and Absorption': [], 'Control and Coordination': []}


In [6]:
print("Keyword-based Mapping Results:")
print(keyword_based_mapping(subtopics))


Keyword-based Mapping Results:
{'Nutrition in Animals': ['Salivary digestion', 'Taste perception', 'Function of salivary glands', 'Types and functions of teeth', 'Relationship between taste and smell'], 'Digestion and Absorption': ['Absorption in small intestine', 'Function of mucus in digestive system', 'Structure of villus and coordination of digestive and circulatory systems', 'Digestion in small intestine', 'Peristaltic movement in the digestive system', 'Parts of the digestive system', 'Stomach reflexes', 'Valves in the digestive system', 'Digestive enzymes and their sources'], 'Control and Coordination': ['Nervous coordination in digestion', 'Hormones related to hunger', 'Pancreatic hormones', 'Hormones related to satiety']}


In [7]:
def results(results):
    for topic, subtopics in results.items():
        print(f"\n{topic}:")
        for subtopic in subtopics:
            print(f"  - {subtopic}")

print("Keyword-based Mapping Results:")
results(keyword_based_mapping(subtopics))

Keyword-based Mapping Results:

Nutrition in Animals:
  - Salivary digestion
  - Taste perception
  - Function of salivary glands
  - Types and functions of teeth
  - Relationship between taste and smell

Digestion and Absorption:
  - Absorption in small intestine
  - Function of mucus in digestive system
  - Structure of villus and coordination of digestive and circulatory systems
  - Digestion in small intestine
  - Peristaltic movement in the digestive system
  - Parts of the digestive system
  - Stomach reflexes
  - Valves in the digestive system
  - Digestive enzymes and their sources

Control and Coordination:
  - Nervous coordination in digestion
  - Hormones related to hunger
  - Pancreatic hormones
  - Hormones related to satiety


## Similarity-based Mapping
### This 2nd approach here we used NLP to measure the similarity between subtopics and main topics using TF-IDF vectorization and cosine similarity.
##### TF-IDF converts textual data into numerical vectors.
##### capturing the importance of words in the context of the text
##### cosine similarity computes the similarity between two vectors that is subtopics and main topics by measuring the cosine of the angle between them. 
##### cosine similarity of 1 indicates perfect similarity.

In [8]:
#TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def similarity_based_mapping(subtopics, main_topics):
    
    vectorizer = TfidfVectorizer()
    
    # it compute vectors for all subtopics and main topics together
    vectors = vectorizer.fit_transform(subtopics + main_topics)
    
    #here vectors are split into two parts
    #subtopic_vectors
    subtopic_vectors = vectors[:len(subtopics)]
    
    #main_topic_vectors
    main_topic_vectors = vectors[len(subtopics):]
    
    #computes a similarity matrix between these 2 vectors we splitted.
    similarity_matrix = cosine_similarity(subtopic_vectors, main_topic_vectors)
    
    grouped = {topic: [] for topic in main_topics}

    for i, subtopic in enumerate(subtopics):
        #for each subtopic here
        #the argmax function identifies the main topic with the highest similarity score
        #then subtopic is added to the list corresponding to this main topic in the grouped dictionary that created with empty list.
        best_match_idx = similarity_matrix[i].argmax()
        grouped[main_topics[best_match_idx]].append(subtopic)

    return grouped

In [9]:
print("\nSimilarity-based Mapping Results:")
print(similarity_based_mapping(subtopics, main_topics))


Similarity-based Mapping Results:
{'Nutrition in Animals': ['Function of mucus in digestive system', 'Effect of acid on plant tissues', 'Taste perception', 'Function of salivary glands', 'Peristaltic movement in the digestive system', 'Human dental formula', 'Parts of the digestive system', 'Stomach reflexes', 'Valves in the digestive system', 'Hormones related to hunger', 'Pancreatic hormones', 'Hormones related to satiety'], 'Digestion and Absorption': ['Absorption in small intestine', 'Salivary digestion', 'Types and functions of teeth', 'Relationship between taste and smell', 'Digestion in small intestine', 'Digestive enzymes and their sources'], 'Control and Coordination': ['Nervous coordination in digestion', 'Structure of villus and coordination of digestive and circulatory systems']}


In [10]:
print("\nSimilarity-based Mapping Results:")
results(similarity_based_mapping(subtopics, main_topics))


Similarity-based Mapping Results:

Nutrition in Animals:
  - Function of mucus in digestive system
  - Effect of acid on plant tissues
  - Taste perception
  - Function of salivary glands
  - Peristaltic movement in the digestive system
  - Human dental formula
  - Parts of the digestive system
  - Stomach reflexes
  - Valves in the digestive system
  - Hormones related to hunger
  - Pancreatic hormones
  - Hormones related to satiety

Digestion and Absorption:
  - Absorption in small intestine
  - Salivary digestion
  - Types and functions of teeth
  - Relationship between taste and smell
  - Digestion in small intestine
  - Digestive enzymes and their sources

Control and Coordination:
  - Nervous coordination in digestion
  - Structure of villus and coordination of digestive and circulatory systems


## LSA - Latent Semantic Analysis
#### Suppose we have a bunch of words. LSA tries to understand the hidden meanings and relationships between these words.
#### Each word is turned into a number
#### To simplify the nums we use Singular Value Decomposition
#### after simplifying, LSA can find words that have similar meanings
#### cosine_similarity finds how similar the number codes are for each subtopic and main topics

In [39]:
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

def lsa(subtopics, main_topics):
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(subtopics + main_topics)

    svd = TruncatedSVD(n_components=10)
    reduced_vectors = svd.fit_transform(vectors)

    subtopic_vectors = reduced_vectors[:len(subtopics)]
    main_topic_vectors = reduced_vectors[len(subtopics):]

    similarity_matrix = cosine_similarity(subtopic_vectors, main_topic_vectors)

    grouped = {topic: [] for topic in main_topics}
    for i, subtopic in enumerate(subtopics):
        best_match_idx = similarity_matrix[i].argmax()
        grouped[main_topics[best_match_idx]].append(subtopic)

    return grouped

In [41]:
results = lsa_mapping(subtopics, main_topics)
print("\nLSA Results:")
for topic, subs in results.items():
    print(f"\n{topic}:")
    for sub in subs:
        print(f"  - {sub}")



LSA Results:

Nutrition in Animals:
  - Function of mucus in digestive system
  - Effect of acid on plant tissues
  - Taste perception
  - Peristaltic movement in the digestive system
  - Parts of the digestive system
  - Valves in the digestive system
  - Hormones related to satiety

Digestion and Absorption:
  - Absorption in small intestine
  - Salivary digestion
  - Function of salivary glands
  - Relationship between taste and smell
  - Digestion in small intestine
  - Hormones related to hunger
  - Digestive enzymes and their sources

Control and Coordination:
  - Nervous coordination in digestion
  - Types and functions of teeth
  - Structure of villus and coordination of digestive and circulatory systems
  - Human dental formula
  - Stomach reflexes
  - Pancreatic hormones


## Bag of Words (BoW) + Cosine Similarity

#### Suppose we have a bag of words for each subtopic and each main topic.
#### First reate Word Bags for each subtopic and main topic, count how many times each word appears.
#### Then Ignore the order of the words it creates a "bag" of words for each.
#### Cosine similarity measures how much two bags of words overlap.
#### Then assign each subtopic to the main topic whose word bag is most similar.


In [36]:
from sklearn.feature_extraction.text import CountVectorizer

def bow_mapping(subtopics, main_topics):
    vectorizer = CountVectorizer()
    vectors = vectorizer.fit_transform(subtopics + main_topics)

    subtopic_vectors = vectors[:len(subtopics)]
    main_topic_vectors = vectors[len(subtopics):]

    similarity_matrix = cosine_similarity(subtopic_vectors, main_topic_vectors)

    grouped = {topic: [] for topic in main_topics}

    for i, subtopic in enumerate(subtopics):
        best_match_idx = similarity_matrix[i].argmax()
        grouped[main_topics[best_match_idx]].append(subtopic)

    return grouped


In [38]:
grouped_bow = bow_mapping(subtopics, main_topics)
print("\nBag of Words Results:")
for topic, subs in grouped_bow.items():
    print(f"\n{topic}:")
    for sub in subs:
        print(f"  - {sub}")



Bag of Words Results:

Nutrition in Animals:
  - Absorption in small intestine
  - Function of mucus in digestive system
  - Effect of acid on plant tissues
  - Nervous coordination in digestion
  - Taste perception
  - Function of salivary glands
  - Digestion in small intestine
  - Peristaltic movement in the digestive system
  - Human dental formula
  - Parts of the digestive system
  - Stomach reflexes
  - Valves in the digestive system
  - Hormones related to hunger
  - Pancreatic hormones
  - Hormones related to satiety

Digestion and Absorption:
  - Salivary digestion
  - Types and functions of teeth
  - Relationship between taste and smell
  - Digestive enzymes and their sources

Control and Coordination:
  - Structure of villus and coordination of digestive and circulatory systems
