<a href="https://colab.research.google.com/github/ankesh86/RecommendationSystems/blob/main/MCQArecommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, manhattan_distances, euclidean_distances
#from sklearn.feature_extraction import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import re
from gensim import models
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style
%matplotlib inline
from gensim.models import FastText as ft
from IPython.display import Image
import os
import json

# **Loading Data**

In [75]:
import pandas as pd
import json

# Load the JSON data from the file
with open('sample_data/test.json', 'r') as file:
    data = [json.loads(line) for line in file.read().splitlines() if line.strip()]

# Create a DataFrame from the JSON data
df = pd.DataFrame(data)

In [76]:
df['MCQAid'] = df.reset_index().index + 1

In [77]:
df.head()

Unnamed: 0,question,opa,opb,opc,opd,subject_name,topic_name,id,choice_type,MCQAid
0,Which of the following is derived from fibrobl...,TGF-13,MMP2,Collagen,Angiopoietin,Pathology,,84f328d3-fca4-422d-8fb2-19d55eb31503,single,1
1,In Alleged history of gun shot injury.there is...,Close shot entry wound,Close shot exit wound,Distant shot entry wound,distant shot exit wound,Forensic Medicine,,bb85e248-b2e9-48e8-a887-67c1aff15b6d,multi,2
2,Which macrolide is active against Mycobaterium...,Azithromycin,Roxithromycin,Clarithromycin,Framycetin,Pharmacology,,f6ce5597-c646-4a2b-8767-764f185be603,single,3
3,Xanthenuric acid is produced in metabolism of?,Tyrosine,Glycine,Methionine,Tryptophan,Unknown,,21fe4c49-f0cd-4c31-8eec-3966bbfb963e,single,4
4,Most common site of direct hernia,Hesselbach's triangle,Femoral gland,No site predilection,,Surgery,,9c82e23e-714e-422b-a4fa-e89aef919819,multi,5


In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6150 entries, 0 to 6149
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   question      6150 non-null   object
 1   opa           6150 non-null   object
 2   opb           6150 non-null   object
 3   opc           6150 non-null   object
 4   opd           6150 non-null   object
 5   subject_name  6150 non-null   object
 6   topic_name    0 non-null      object
 7   id            6150 non-null   object
 8   choice_type   6150 non-null   object
 9   MCQAid        6150 non-null   int64 
dtypes: int64(1), object(9)
memory usage: 480.6+ KB


In [83]:
df.shape

(5468, 11)

In [80]:
df.isnull().sum(axis=0)

question           0
opa                0
opb                0
opc                0
opd                0
subject_name       0
topic_name      6150
id                 0
choice_type        0
MCQAid             0
dtype: int64

# **Text pre-processing**

In [81]:
## combining product and description
df['StringCombined'] = df['question'] + ' ' +df['opa']+ ' ' +df['opb']+ ' ' +df['opc']+ ' ' +df['opd']

df = df[df['subject_name'] != 'Unknown']

##dropping duplicates and keeping first records
unique_df = df.drop_duplicates(subset=['StringCombined'], keep='first')

## Converting String to Lower Case
unique_df['mcqa_lowered'] = unique_df['StringCombined'].apply(lambda x:x.lower())

## remove Stop special characters
unique_df['mcqa_lowered'] = unique_df['mcqa_lowered'].apply(lambda x:re.sub(r'[^\w\s]','',x))

#converting description to list
desc_list = list(unique_df['mcqa_lowered'])
unique_df = unique_df.reset_index(drop=True)

In [82]:
unique_df.head()

Unnamed: 0,question,opa,opb,opc,opd,subject_name,topic_name,id,choice_type,MCQAid,StringCombined,mcqa_lowered
0,Which of the following is derived from fibrobl...,TGF-13,MMP2,Collagen,Angiopoietin,Pathology,,84f328d3-fca4-422d-8fb2-19d55eb31503,single,1,Which of the following is derived from fibrobl...,which of the following is derived from fibrobl...
1,In Alleged history of gun shot injury.there is...,Close shot entry wound,Close shot exit wound,Distant shot entry wound,distant shot exit wound,Forensic Medicine,,bb85e248-b2e9-48e8-a887-67c1aff15b6d,multi,2,In Alleged history of gun shot injury.there is...,in alleged history of gun shot injurythere is ...
2,Which macrolide is active against Mycobaterium...,Azithromycin,Roxithromycin,Clarithromycin,Framycetin,Pharmacology,,f6ce5597-c646-4a2b-8767-764f185be603,single,3,Which macrolide is active against Mycobaterium...,which macrolide is active against mycobaterium...
3,Most common site of direct hernia,Hesselbach's triangle,Femoral gland,No site predilection,,Surgery,,9c82e23e-714e-422b-a4fa-e89aef919819,multi,5,Most common site of direct hernia Hesselbach's...,most common site of direct hernia hesselbachs ...
4,Resistance to lateral shifting or anteroposter...,Retention.,Stability.,Support.,None.,Dental,,ae87fc33-56eb-43b1-a558-e8e2bb918349,multi,6,Resistance to lateral shifting or anteroposter...,resistance to lateral shifting or anteroposter...


# **Word Embedding**

In [84]:
#importing count vectorizer
cnt_vec = CountVectorizer(stop_words = 'english')

In [85]:
#importing IFIDF
tfidf_vec = TfidfVectorizer(stop_words='english', analyzer='word', ngram_range=(1,3))

# **Similarity measures**

In [86]:
#Eculedian distance
def find_euclidean_distances(sim_matrix, index, n=10):
  #Getting Score and Index
  result = list(enumerate(sim_matrix[index]))

  #Sorting the Score and taking top 10 products
  sorted_result = sorted(result, key=lambda x:x[1], reverse=False)[1:10+1]

  #Mapping index with data
  similar_products = [{'value': unique_df.iloc[x[0]]['question'], 'score':round(x[1], 2), 'subject_name': unique_df.iloc[x[0]]['subject_name']} for x in sorted_result]

  return similar_products

In [87]:
def find_similarity(cosine_sim_matrix, index, n=10):

    # calculate cosine similarity between each vectors
    result = list(enumerate(cosine_sim_matrix[index]))

    # Sorting the Score
    sorted_result = sorted(result,key=lambda x:x[1],reverse=True)[1:n+1]

    similar_products =  [{'value': unique_df.iloc[x[0]]['question'], 'score' : round(x[1], 2), 'subject_name': unique_df.iloc[x[0]]['subject_name']} for x in sorted_result]

    return similar_products

In [88]:
#Manhattan similarity
def find_manhattan_distance(sim_matrix, index, n=10):
  #Getting Score and Index
  result = list(enumerate(sim_matrix[index]))

  #Sorting the Score and taking top 10 products
  sorted_result = sorted(result, key=lambda x:x[1], reverse=False)[1:10+1]

  #Mapping index with data
  similar_products = [{'value': unique_df.iloc[x[0]]['question'], 'score':round(x[1], 2), 'subject_name': unique_df.iloc[x[0]]['subject_name']} for x in sorted_result]

  return similar_products

# **Recommendation using Count Vectorizer**

In [89]:
# Function to get recommendations using Count Vectorizer
def get_recommendation_cv(MCQAid, df, similarity, n=10):
    row = df.loc[df['MCQAid'] == MCQAid]
    if row.empty:
        print(f"No question found with id {MCQAid}")
        return []

    index = list(row.index)[0]
    description = row['mcqa_lowered'].loc[index]

    # Create a list of descriptions
    desc_list = df['mcqa_lowered'].tolist()

    # Create vector using Count Vectorizer
    cnt_vec = CountVectorizer()
    count_vector = cnt_vec.fit_transform(desc_list)

    if similarity == "cosine":
        sim_matrix = cosine_similarity(count_vector)
        mcqas = find_similarity(sim_matrix, index, n)
    elif similarity == "manhattan":
        sim_matrix = manhattan_distances(count_vector)
        mcqas = find_manhattan_distance(sim_matrix, index, n)
    else:
        sim_matrix = euclidean_distances(count_vector)
        mcqas = find_euclidean_distances(sim_matrix, index, n)

    return mcqas

In [90]:
unique_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5468 entries, 0 to 5467
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   question        5468 non-null   object
 1   opa             5468 non-null   object
 2   opb             5468 non-null   object
 3   opc             5468 non-null   object
 4   opd             5468 non-null   object
 5   subject_name    5468 non-null   object
 6   topic_name      0 non-null      object
 7   id              5468 non-null   object
 8   choice_type     5468 non-null   object
 9   MCQAid          5468 non-null   int64 
 10  StringCombined  5468 non-null   object
 11  mcqa_lowered    5468 non-null   object
dtypes: int64(1), object(11)
memory usage: 512.8+ KB


In [93]:
MCQAid = 10
# Cosine Similarity
get_recommendation_cv(MCQAid, unique_df, similarity = "cosine", n=10)

[{'value': 'Which of the following is derived from the neural tube except?',
  'score': 0.4,
  'subject_name': 'Anatomy'},
 {'value': 'All are true about skin except:',
  'score': 0.4,
  'subject_name': 'Skin'},
 {'value': 'All are true regarding mitochondrial DNA, EXCEPT ?',
  'score': 0.35,
  'subject_name': 'Biochemistry'},
 {'value': 'All are androgens except ?',
  'score': 0.32,
  'subject_name': 'Physiology'},
 {'value': 'All are alpha-blocker except?',
  'score': 0.32,
  'subject_name': 'Pharmacology'},
 {'value': 'All are pre-malignant conditions except -',
  'score': 0.3,
  'subject_name': 'Pathology'},
 {'value': 'All are cholinergic actions except?',
  'score': 0.3,
  'subject_name': 'Pharmacology'},
 {'value': 'Ionic receptors are all except ?',
  'score': 0.3,
  'subject_name': 'Physiology'},
 {'value': 'All are occupational cancers except ?',
  'score': 0.3,
  'subject_name': 'Social & Preventive Medicine'},
 {'value': 'Drugs used in ALL in child are all except -',
  'sco

In [94]:
get_recommendation_cv(MCQAid, unique_df, similarity = "manhattan", n=10)

[{'value': 'Post-ganglionic parasympathetic fibers are -',
  'score': 13.0,
  'subject_name': 'Physiology'},
 {'value': 'All are androgens except ?',
  'score': 13.0,
  'subject_name': 'Physiology'},
 {'value': 'B+ Blood group can receive blood from all except',
  'score': 13.0,
  'subject_name': 'Physiology'},
 {'value': 'All are alpha-blocker except?',
  'score': 13.0,
  'subject_name': 'Pharmacology'},
 {'value': 'All are pre-malignant conditions except -',
  'score': 14.0,
  'subject_name': 'Pathology'},
 {'value': 'All are cholinergic actions except?',
  'score': 14.0,
  'subject_name': 'Pharmacology'},
 {'value': 'Ionic receptors are all except ?',
  'score': 14.0,
  'subject_name': 'Physiology'},
 {'value': 'All are occupational cancers except ?',
  'score': 14.0,
  'subject_name': 'Social & Preventive Medicine'},
 {'value': 'Pulses are deficient in ?',
  'score': 15.0,
  'subject_name': 'Social & Preventive Medicine'},
 {'value': 'Gerstmanns syndrome all except',
  'score': 15.

In [95]:
get_recommendation_cv(MCQAid, unique_df, similarity = "euclidean", n=10)

[{'value': 'Post-ganglionic parasympathetic fibers are -',
  'score': 3.61,
  'subject_name': 'Physiology'},
 {'value': 'All are androgens except ?',
  'score': 3.61,
  'subject_name': 'Physiology'},
 {'value': 'All are alpha-blocker except?',
  'score': 3.61,
  'subject_name': 'Pharmacology'},
 {'value': 'All are pre-malignant conditions except -',
  'score': 3.74,
  'subject_name': 'Pathology'},
 {'value': 'All are cholinergic actions except?',
  'score': 3.74,
  'subject_name': 'Pharmacology'},
 {'value': 'Ionic receptors are all except ?',
  'score': 3.74,
  'subject_name': 'Physiology'},
 {'value': 'All are occupational cancers except ?',
  'score': 3.74,
  'subject_name': 'Social & Preventive Medicine'},
 {'value': 'Pulses are deficient in ?',
  'score': 3.87,
  'subject_name': 'Social & Preventive Medicine'},
 {'value': 'Gerstmanns syndrome all except',
  'score': 3.87,
  'subject_name': 'Medicine'},
 {'value': 'Beta-alanine is derived from ?',
  'score': 3.87,
  'subject_name':

# **Build a Model using TF-IDF features**

In [96]:
# Function to get recommendations using Count Vectorizer
def get_recommendation_tfidf(MCQAid, df, similarity, n=10):
    row = df.loc[df['MCQAid'] == MCQAid]
    if row.empty:
        print(f"No question found with id {MCQAid}")
        return []

    index = list(row.index)[0]
    description = row['mcqa_lowered'].loc[index]

    # Create a list of descriptions
    desc_list = df['mcqa_lowered'].tolist()

    # Create vector using Tfidf
    tfidf_matrix = tfidf_vec.fit_transform(desc_list)

    if similarity == "cosine":
        sim_matrix = cosine_similarity(tfidf_matrix)
        mcqas = find_similarity(sim_matrix, index, n)
    elif similarity == "manhattan":
        sim_matrix = manhattan_distances(tfidf_matrix)
        mcqas = find_manhattan_distance(sim_matrix, index, n)
    else:
        sim_matrix = euclidean_distances(tfidf_matrix)
        mcqas = find_euclidean_distances(sim_matrix, index, n)

    return mcqas

In [97]:
MCQAid = 10
# Cosine Similarity
get_recommendation_tfidf(MCQAid, unique_df, similarity = "cosine", n=10)

[{'value': 'All are true about skin except:',
  'score': 0.16,
  'subject_name': 'Skin'},
 {'value': 'Which of the following is derived from the neural tube except?',
  'score': 0.11,
  'subject_name': 'Anatomy'},
 {'value': 'True about Keratinocyte is ?',
  'score': 0.05,
  'subject_name': 'Skin'},
 {'value': 'Low astigmatism in dim light is due ?',
  'score': 0.05,
  'subject_name': 'Ophthalmology'},
 {'value': 'False about phacolytic glaucoma ?',
  'score': 0.05,
  'subject_name': 'Ophthalmology'},
 {'value': 'Silk retina is seen in ?',
  'score': 0.05,
  'subject_name': 'Ophthalmology'},
 {'value': 'Maximum oral structures are having their origin from',
  'score': 0.04,
  'subject_name': 'Anatomy'},
 {'value': 'Development of peritoneal cavity is from ?',
  'score': 0.04,
  'subject_name': 'Anatomy'},
 {'value': 'Which of the following is not a pa of uveal',
  'score': 0.04,
  'subject_name': 'Ophthalmology'},
 {'value': 'More than 90% of growth of brain or brain vault has achieved

In [98]:
# Manhattan Similarity
get_recommendation_tfidf(MCQAid, unique_df, similarity = "manhattan", n=10)

[{'value': 'Lipoproteins are of how many types?',
  'score': 5.93,
  'subject_name': 'Biochemistry'},
 {'value': 'Number of freni in the mandible:',
  'score': 6.63,
  'subject_name': 'Dental'},
 {'value': 'Number of freni in the maxilla:',
  'score': 6.63,
  'subject_name': 'Dental'},
 {'value': 'Post-ganglionic parasympathetic fibers are -',
  'score': 6.64,
  'subject_name': 'Physiology'},
 {'value': 'What is the degree of freedom in a table of 2 x 2',
  'score': 6.64,
  'subject_name': 'Dental'},
 {'value': 'MAC of desflurane is ?',
  'score': 6.65,
  'subject_name': 'Anaesthesia'},
 {'value': 'IL- 1 activated by-', 'score': 6.7, 'subject_name': 'Pathology'},
 {'value': 'In ETC NADH generates -',
  'score': 6.75,
  'subject_name': 'Biochemistry'},
 {'value': 'Spirochetes becomes visible from which of the following zone:',
  'score': 6.98,
  'subject_name': 'Dental'},
 {'value': 'B+ Blood group can receive blood from all except',
  'score': 7.01,
  'subject_name': 'Physiology'}]

In [99]:
# Euclidean Similarity
get_recommendation_tfidf(MCQAid, unique_df, similarity = "euclidean", n=10)

[{'value': 'All are true about skin except:',
  'score': 1.3,
  'subject_name': 'Skin'},
 {'value': 'Which of the following is derived from the neural tube except?',
  'score': 1.33,
  'subject_name': 'Anatomy'},
 {'value': 'True about Keratinocyte is ?',
  'score': 1.38,
  'subject_name': 'Skin'},
 {'value': 'Low astigmatism in dim light is due ?',
  'score': 1.38,
  'subject_name': 'Ophthalmology'},
 {'value': 'False about phacolytic glaucoma ?',
  'score': 1.38,
  'subject_name': 'Ophthalmology'},
 {'value': 'Silk retina is seen in ?',
  'score': 1.38,
  'subject_name': 'Ophthalmology'},
 {'value': 'Maximum oral structures are having their origin from',
  'score': 1.38,
  'subject_name': 'Anatomy'},
 {'value': 'Development of peritoneal cavity is from ?',
  'score': 1.38,
  'subject_name': 'Anatomy'},
 {'value': 'Which of the following is not a pa of uveal',
  'score': 1.38,
  'subject_name': 'Ophthalmology'},
 {'value': 'More than 90% of growth of brain or brain vault has achieved 

# **Test Question**

In [100]:
df.loc[df['MCQAid'] == 10]

Unnamed: 0,question,opa,opb,opc,opd,subject_name,topic_name,id,choice_type,MCQAid,StringCombined
9,All are derived from ectoderm except ?,Lens,Eustachian tube,Brain,Retina,Anatomy,,0a7cddf8-a8b8-4778-aa58-4b01c3da1c12,multi,10,All are derived from ectoderm except ? Lens Eu...


# **Word2vec**

In [101]:
import gdown

word2vec_url = 'https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM'  # Replace 'YOUR_ACTUAL_ID' with the actual ID
word2vec_output = 'word2vec.bin.gz'
gdown.download(word2vec_url, word2vec_output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
From (redirected): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=966dcaf2-47ad-4e78-a423-0f1467a39264
To: /content/word2vec.bin.gz
100%|██████████| 1.65G/1.65G [00:19<00:00, 84.2MB/s]


'word2vec.bin.gz'

In [102]:
#importing Word2Vec
word2vecModel = models.KeyedVectors.load_word2vec_format('word2vec.bin.gz', binary=True)

In [108]:
#Comparing similarity to get the top matches using Word2vec pretrained model
def get_recommendation_word2vec(MCQAid, df, similarity, n=10):

    row = df.loc[df['MCQAid'] == MCQAid]
    if row.empty:
        print(f"No question found with id {MCQAid}")
        return []

    input_index = list(row.index)[0]
    description = row['mcqa_lowered'].loc[input_index]

    #create vectors for each desc using word2vec
    vector_matrix = np.empty((len(desc_list), 300))
    for index, each_sentence in enumerate(desc_list):
        sentence_vector = np.zeros((300,))
        count  = 0
        for each_word in each_sentence.split():
            try:
                sentence_vector += word2vecModel[each_word]
                count += 1
            except:
                continue

        vector_matrix[index] = sentence_vector

    if similarity == "cosine":
        sim_matrix = cosine_similarity(vector_matrix)
        mcqas = find_similarity(sim_matrix , input_index)

    elif similarity == "manhattan":
        sim_matrix = manhattan_distances(vector_matrix)
        mcqas = find_manhattan_distance(sim_matrix , input_index)

    else:
        sim_matrix = euclidean_distances(vector_matrix)
        mcqas = find_euclidean_distances(sim_matrix , input_index)

    return mcqas

In [109]:
# Cosine Similarity
get_recommendation_word2vec(MCQAid, unique_df, similarity = "cosine", n=10)

[{'value': 'Which of the following is derived from the neural tube except?',
  'score': 0.87,
  'subject_name': 'Anatomy'},
 {'value': 'All are true about skin except:',
  'score': 0.8,
  'subject_name': 'Skin'},
 {'value': 'Pupillary reflex pathway- All of the following are a pa except ?',
  'score': 0.8,
  'subject_name': 'Ophthalmology'},
 {'value': 'Which of the following is not a pa of uveal',
  'score': 0.79,
  'subject_name': 'Ophthalmology'},
 {'value': 'Stereocilia are present in?',
  'score': 0.77,
  'subject_name': 'Physiology'},
 {'value': 'The zonules suspending the lens are attached to the?',
  'score': 0.76,
  'subject_name': 'Ophthalmology'},
 {'value': 'All are true about trigeminal nerve except',
  'score': 0.75,
  'subject_name': 'Anatomy'},
 {'value': 'All are true about trigeminal nerve except?',
  'score': 0.75,
  'subject_name': 'Anatomy'},
 {'value': 'Abetalipoproteinemia affects ?',
  'score': 0.75,
  'subject_name': 'Biochemistry'},
 {'value': "Cell bodies of 

In [110]:
# Manhattan Similarity
get_recommendation_word2vec(MCQAid, unique_df, similarity = "manhattan", n=10)

[{'value': 'Which of the following is not a pa of uveal',
  'score': 125.32,
  'subject_name': 'Ophthalmology'},
 {'value': 'Carbonic anhydrase activity found in all except?',
  'score': 130.89,
  'subject_name': 'Physiology'},
 {'value': 'Stereocilia are present in?',
  'score': 136.94,
  'subject_name': 'Physiology'},
 {'value': 'Stereocilia are found in?',
  'score': 137.2,
  'subject_name': 'Physiology'},
 {'value': 'Kidney parenchyma is derived from -',
  'score': 137.98,
  'subject_name': 'Anatomy'},
 {'value': 'Most lateral nucleus of cerebellum is ?',
  'score': 140.67,
  'subject_name': 'Anatomy'},
 {'value': 'Lens attached to ciliary body ?',
  'score': 140.67,
  'subject_name': 'Ophthalmology'},
 {'value': 'All of the following pass through the Sinus of morgagni except -',
  'score': 142.7,
  'subject_name': 'Anatomy'},
 {'value': 'Ionic receptors are all except ?',
  'score': 143.06,
  'subject_name': 'Physiology'},
 {'value': 'Pneumatic bone is all except?',
  'score': 144

In [111]:
# Euclidean Similarity
get_recommendation_word2vec(MCQAid, unique_df, similarity = "euclidean", n=10)

[{'value': 'Which of the following is not a pa of uveal',
  'score': 9.11,
  'subject_name': 'Ophthalmology'},
 {'value': 'Carbonic anhydrase activity found in all except?',
  'score': 9.56,
  'subject_name': 'Physiology'},
 {'value': 'Stereocilia are found in?',
  'score': 9.8,
  'subject_name': 'Physiology'},
 {'value': 'Stereocilia are present in?',
  'score': 9.97,
  'subject_name': 'Physiology'},
 {'value': 'Most lateral nucleus of cerebellum is ?',
  'score': 10.1,
  'subject_name': 'Anatomy'},
 {'value': 'Ionic receptors are all except ?',
  'score': 10.14,
  'subject_name': 'Physiology'},
 {'value': 'Kidney parenchyma is derived from -',
  'score': 10.21,
  'subject_name': 'Anatomy'},
 {'value': 'Lens attached to ciliary body ?',
  'score': 10.25,
  'subject_name': 'Ophthalmology'},
 {'value': 'All of the following pass through the Sinus of morgagni except -',
  'score': 10.37,
  'subject_name': 'Anatomy'},
 {'value': 'Somatic efferent of which arise from medulla?',
  'score': 

# **Build Model using GloVe Features**

In [107]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2024-06-04 18:55:25--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2024-06-04 18:55:25--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2024-06-04 18:58:04 (5.18 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [112]:
#Import Glove
glove_df = pd.read_csv('glove.6B.300d.txt',sep=" ",
                       quoting=3, header=None, index_col=0)
glove_model = {key:value.values for key, value in glove_df.T.items()}

In [113]:
# Comparing similarity to get the top matches using Glove pretrained model
def get_recommendation_glove(MCQAid, df, similarity, n=10):

    row = df.loc[df['MCQAid'] == MCQAid]
    if row.empty:
        print(f"No question found with id {MCQAid}")
        return []

    input_index = list(row.index)[0]
    description = row['mcqa_lowered'].loc[input_index]

    #using glove embeddings to create vectors
    vector_matrix = np.empty((len(desc_list), 300))
    for index, each_sentence in enumerate(desc_list):
        sentence_vector = np.zeros((300,))
        count  = 0
        for each_word in each_sentence.split():
            try:
                sentence_vector += glove_model[each_word]
                count += 1

            except:
                continue

        vector_matrix[index] = sentence_vector


    if similarity == "cosine":
        sim_matrix = cosine_similarity(vector_matrix)
        mcqas = find_similarity(sim_matrix , input_index)

    elif similarity == "manhattan":
        sim_matrix = manhattan_distances(vector_matrix)
        mcqas = find_manhattan_distance(sim_matrix , input_index)

    else:
        sim_matrix = euclidean_distances(vector_matrix)
        mcqas = find_euclidean_distances(sim_matrix , input_index)

    return mcqas

In [114]:
# Cosine Similarity
get_recommendation_glove(MCQAid, unique_df, similarity = "cosine", n=10)

[{'value': 'Which of the following is derived from the neural tube except?',
  'score': 0.83,
  'subject_name': 'Anatomy'},
 {'value': 'All are true about skin except:',
  'score': 0.73,
  'subject_name': 'Skin'},
 {'value': "Cell bodies of Muller's Cells are present in which layer of retina?",
  'score': 0.72,
  'subject_name': 'Ophthalmology'},
 {'value': 'The zonules suspending the lens are attached to the?',
  'score': 0.71,
  'subject_name': 'Ophthalmology'},
 {'value': 'Stereocilia are present in?',
  'score': 0.71,
  'subject_name': 'Physiology'},
 {'value': 'The earliest feature of 3rd cranial nerve involvement in diabetes mellitus patient is -',
  'score': 0.71,
  'subject_name': 'Medicine'},
 {'value': 'Which of the following is the function of tensor tympani muscle?',
  'score': 0.7,
  'subject_name': 'ENT'},
 {'value': 'Which of the following does not have sympathetic noradrenergic fibers ?',
  'score': 0.7,
  'subject_name': 'Physiology'},
 {'value': 'Which of the followin

In [115]:
# Cosine Similarity
get_recommendation_glove(MCQAid, unique_df, similarity = "euclidean", n=10)

[{'value': 'Oxyntic cells are present in -',
  'score': 25.93,
  'subject_name': 'Anatomy'},
 {'value': 'Stereocilia are present in?',
  'score': 26.39,
  'subject_name': 'Physiology'},
 {'value': 'What type of muscles are medial two lumbricals?',
  'score': 27.29,
  'subject_name': 'Anatomy'},
 {'value': 'Inner cell mass differentiates into ?',
  'score': 27.47,
  'subject_name': 'Gynaecology & Obstetrics'},
 {'value': 'Which of the following is derived from fibroblast cells ?',
  'score': 27.49,
  'subject_name': 'Pathology'},
 {'value': 'Kidney parenchyma is derived from -',
  'score': 27.67,
  'subject_name': 'Anatomy'},
 {'value': 'Transverse lie is caused by all except ?',
  'score': 27.69,
  'subject_name': 'Gynaecology & Obstetrics'},
 {'value': 'Sezary cells show which tlpe of nucleus -',
  'score': 27.87,
  'subject_name': 'Pathology'},
 {'value': 'Stereocilia are found in?',
  'score': 27.96,
  'subject_name': 'Physiology'},
 {'value': 'Feilization usually occurs in which pa

To validate and potentially improve your content-based recommendation model for MCQA (Multiple Choice Question Answering), you can follow a systematic approach that involves evaluating different components of the model and comparing their performance. Here's a blueprint of the steps you could take:

**Data Preparation:**

Ensure that you have a sufficiently large and diverse dataset of reference questions and potential recommended questions.
Split the dataset into training, validation, and testing sets.
Consider creating a ground truth dataset by having subject matter experts manually curate relevant recommendations for a subset of the reference questions.


**Embedding Methods:**

Evaluate the performance of different embedding methods (e.g., count vectorizer, GloVe, text2vec) on the validation set.
Compare the recommendations generated by each embedding method against the ground truth (if available) or manually evaluate them.
Analyze the strengths and weaknesses of each embedding method in capturing semantic similarity and relevance.


**Similarity Measures:**

For each embedding method, evaluate the performance of different similarity measures (e.g., Euclidean, Manhattan, Cosine) on the validation set.
Compare the recommendations generated by each similarity measure against the ground truth or manual evaluation.
Identify the most effective similarity measure(s) for each embedding method.


**Hybrid Approaches:**

Explore combining different embedding methods and similarity measures in a hybrid approach.
Evaluate the performance of the hybrid approach on the validation set.
Compare the hybrid approach against the individual methods and the ground truth or manual evaluation.


**Hyperparameter Tuning:**

Identify the hyperparameters of your recommendation model (e.g., embedding dimensions, similarity thresholds).
Perform a grid search or random search to find the optimal hyperparameter values that maximize performance on the validation set.


**Subject-specific Evaluation:**

Evaluate the performance of your recommendation model separately for different subjects (e.g., Anatomy, Physiology, Pathology).
Identify subjects where the model performs well and subjects where it struggles.
Consider fine-tuning the model or exploring subject-specific approaches if necessary.


**Final Evaluation:**

Use the best-performing combination of embedding method, similarity measure, and hyperparameters to generate recommendations on the held-out testing set.
Evaluate the recommendations using appropriate metrics (e.g., precision, recall, F1-score) against the ground truth or manual evaluation.
Analyze the strengths and limitations of your final model, and identify areas for further improvement.


**Deployment and Monitoring:**

Deploy your validated recommendation model in a production environment.
Implement a monitoring system to track the model's performance over time and collect feedback from users.
Regularly update and retrain the model with new data to ensure it remains relevant and effective.