## Downloading and Importing the dependencies

In [1]:
!pip install fuzzywuzzy
!pip install python-Levenshtein



In [3]:
import numpy as np
from pickle import load as pkl_load
import nltk
import json
import string
from nltk.corpus import stopwords
from textblob import TextBlob
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import process

In [4]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
!kaggle datasets download -d watts2/glove6b50dtxt
!unzip /content/glove6b50dtxt.zip

Dataset URL: https://www.kaggle.com/datasets/watts2/glove6b50dtxt
License(s): CC0-1.0
Downloading glove6b50dtxt.zip to /content
 77% 52.0M/67.7M [00:00<00:00, 112MB/s]
100% 67.7M/67.7M [00:00<00:00, 115MB/s]
Archive:  /content/glove6b50dtxt.zip
  inflating: glove.6B.50d.txt        


## Loading the Data and EDA

In [6]:
df = json.load(open('/content/faqs.json', 'r'))
df

{'Admissions': [{'question': 'What is the process for admission into Saras AI Institute?',
   'answer': 'The admission process at Saras AI Institute typically involves submitting the online application form along with necessary details, followed by a quick pre-Enrollment assessment to evaluate your candidature based on your personal traits and basic communication skills in English.'},
  {'question': 'Is there an application fee for applying to Saras AI Institute?',
   'answer': 'There is no application fee for applying to any program at Saras'},
  {'question': 'What is pre-enrollment assessment test? How do I prepare for it?',
   'answer': 'It is a fully online assessment which takes less than 15 minutes. It is designed to evaluate your personal traits and basic English communication skills. You can take it at the time of filling out the application. It idoes not require any specific preparation'},
  {'question': 'Are there any specific requirements or prerequisites for admission into 

In [7]:
questions = []
answers = []

for  domain, pairs in df.items():
    for entry in pairs:
      questions.append(entry['question'])
      answers.append(entry['answer'])

In [8]:
questions

['What is the process for admission into Saras AI Institute?',
 'Is there an application fee for applying to Saras AI Institute?',
 'What is pre-enrollment assessment test? How do I prepare for it?',
 'Are there any specific requirements or prerequisites for admission into the programs?',
 'When is the deadline for submitting the application?',
 'What is the curriculum like at Saras AI Institute?',
 'What does the program structure look like, and how is the curriculum delivered?',
 'Can you provide more details about the role-based curriculum feature and how it benefits students?',
 'Do you also conduct LIVE sessions?',
 'Can I transfer credits earned at other universities to Saras AI Institute?',
 'Who are the faculty members at Saras AI Institute?',
 'Can I connect with mentors outside of class?',
 'Is Saras AI Institute accredited?',
 'Are the degree programs recognised by the government? ',
 'Do employers require an accredited degree? ',
 'Does Saras AI Institute offer employment s

In [9]:
len(questions)

22

We're working with a dataset that contains only 22 questions, which is insufficient for applying contextual embeddings. Therefore, we'll utilize a pre-trained embedding model. We have chosen the GloVe model with 50 dimensions, as it has a smaller footprint compared to Word2Vec and other GloVe variants, making it more efficient for our needs.

In [10]:
stop_words = set(stopwords.words('english'))
punctuations = string.punctuation

In [11]:
# Applying basic text preprocessing like removing punctuations, stopwords and dealing with mispronounciations

def preprocess_text(text):
    if isinstance(text, list):
        text = ' '.join(text)
    tokens = word_tokenize(text.lower())
    processed_tokens = [
        word for word in tokens
        if word.isalnum() and word not in stop_words
    ]
    return processed_tokens

In [12]:
# Loading the GloVe model

def load_glove_model(glove_file):
    embeddings = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings[word] = vector
    print(f"Loaded {len(embeddings)} word vectors.")
    return embeddings

glove_embeddings = load_glove_model("/content/glove.6B.50d.txt")

Loaded 400000 word vectors.


## Traditional Approach using Sentence-vector embedding

In [13]:
embedding_dimension = 50

In [14]:
def get_glove_sentence_vector(sentence, glove_embeddings, embedding_dim=embedding_dimension):
    words = preprocess_text(sentence)
    word_vectors = [glove_embeddings[word] for word in words if word in glove_embeddings]

    if not word_vectors:
        return np.zeros((embedding_dim,))

    return np.mean(word_vectors, axis=0)

In [15]:
get_glove_sentence_vector('how are the classes taken', glove_embeddings, embedding_dim = embedding_dimension)

array([ 0.256795  ,  0.32986   , -0.472073  , -0.770145  ,  0.146318  ,
       -0.084546  , -0.1996985 , -0.03827   , -0.15856001, -0.1157225 ,
        0.23339   , -0.158088  ,  0.02433001, -0.1204015 ,  0.19311701,
       -0.13117   , -0.42638502,  0.21665502, -0.1832225 , -0.42641503,
        0.37863398,  0.377465  ,  0.2471665 ,  0.51135004, -0.038065  ,
       -1.1330299 , -0.138971  , -0.554074  , -0.1686    , -0.13817151,
        3.1283002 ,  0.3363515 , -0.485785  , -0.73368   ,  0.721085  ,
        0.321835  ,  0.40797502,  0.35070002, -0.102921  ,  0.34120935,
       -0.23449999, -0.43982968,  0.260965  ,  0.632208  ,  0.1407385 ,
       -0.33499   ,  0.388705  , -0.158773  , -0.051145  , -0.01500499],
      dtype=float32)

In [16]:
question_vectors = np.array([get_glove_sentence_vector(q, glove_embeddings, embedding_dimension) for q in questions])

def hybrid_search(user_query):
    blob = TextBlob(user_query)
    corrected_query = str(blob.correct())
    processed_query = preprocess_text(corrected_query)
    query_vector = get_glove_sentence_vector(processed_query, glove_embeddings, embedding_dimension)
    semantic_scores = cosine_similarity([query_vector], question_vectors).flatten()
    fuzzy_scores = np.array([process.extractOne(corrected_query, [q])[1] for q in questions]) / 100
    combined_scores = semantic_scores*0.75 + fuzzy_scores*0.25

    best_match_indices = np.argsort(combined_scores)[-5:][::-1]
    for idx in best_match_indices:
        print(f"Question: {questions[idx]}")
        print(f"Answer: {answers[idx]}")
        print()


In [17]:
hybrid_search('how are the faculty at saras')

Question:  Does the university offer internship placement assistance?
Answer: Yes, we assist students in finding internships by connecting them with potential employers and offering guidance on applications and interviews.

Question: Who are the faculty members at Saras AI Institute?
Answer: The faculty at Saras AI Institute consists of industry professionals who bring the most relevant skills and mentorship for the students to help them prepare for exactly what is needed to succeed in the job roles they are preparing for

Question: Can I transfer credits earned at other universities to Saras AI Institute?
Answer: Yes, we evaluate the course that you have taken, if it overlaps with our curriculum and is relevant in today's time, we offer the flexibility of transfering credits

Question: Does Saras AI Institute offer any scholarships for students? How can I apply for them? 
Answer: Yes, we offer various scholarships to eligible students based on academic merit, financial need, and other

In [18]:
hybrid_search('what is the duration of the program')

Question: Are there any specific requirements or prerequisites for admission into the programs?
Answer: To be a successful professional in AI, you need to possess basic mathematical proficieny - which can be demonstrated by your math scores in high school or beyond. At Saras, you learn with global peers and faculties and should possess basic communication skills in English. These make up for the basic eligibility critera

Question: Can you provide more details about the role-based curriculum feature and how it benefits students?
Answer: Our role-based curriculum is designed to provide targeted training and develop specialized skills in students that are highly relevant to their desired job-roles from day one.

Question:  Does the university offer internship placement assistance?
Answer: Yes, we assist students in finding internships by connecting them with potential employers and offering guidance on applications and interviews.

Question: What does the program structure look like, and

In [19]:
hybrid_search('do they provide financiial help')

Question: Can I avail financial aid? 
Answer: You currently can't get a federal aid for Saras AI Institute's programs. However, we are partnering with lenders who can help facilitate a loan to help pay the tuition.

Question: Can you provide more details about the role-based curriculum feature and how it benefits students?
Answer: Our role-based curriculum is designed to provide targeted training and develop specialized skills in students that are highly relevant to their desired job-roles from day one.

Question: What is pre-enrollment assessment test? How do I prepare for it?
Answer: It is a fully online assessment which takes less than 15 minutes. It is designed to evaluate your personal traits and basic English communication skills. You can take it at the time of filling out the application. It idoes not require any specific preparation

Question: Does Saras AI Institute offer employment support?
Answer: Yes, we provide comprehensive employment support including job placement servi

As is evident, this method does not work well for most queries, therefore we move on to out proposed approach.

## Proposed method using Word-to-Word Similarity

In [20]:
def get_glove_words_vector(sentence, glove_embeddings, embedding_dim=embedding_dimension):
    words = preprocess_text(sentence)
    word_vectors = np.array([glove_embeddings[word] for word in words if word in glove_embeddings])

    if np.any(np.isnan(word_vectors)):
        return np.zeros((embedding_dim,))

    return word_vectors

In [21]:
preprocess_text('how are the classes taken')

['classes', 'taken']

In [22]:
get_glove_words_vector('how are the classes taken', glove_embeddings, embedding_dim = embedding_dimension)

array([[-3.7814e-01,  8.6281e-01, -8.7393e-01, -9.4600e-01, -5.9824e-02,
        -5.4972e-02,  8.7623e-02, -5.1979e-01, -5.3783e-01, -3.0595e-02,
         2.6409e-01, -2.5853e-01,  3.6406e-01, -4.1203e-02, -2.7636e-02,
        -5.0479e-01, -6.9142e-01,  9.4545e-01,  6.0625e-02, -5.5887e-01,
         6.8805e-01,  2.2802e-01,  8.9393e-02,  1.2071e+00, -3.4111e-02,
        -4.7886e-01,  6.2698e-02, -1.1058e+00, -4.8528e-01, -4.6123e-02,
         3.1131e+00,  7.0984e-01, -5.4366e-01, -7.9435e-01,  1.1274e+00,
         3.7786e-01,  1.9581e-01,  5.1234e-01, -2.8832e-02,  6.9193e-01,
        -3.5761e-01, -8.7164e-01,  2.3947e-01,  1.1765e+00,  9.9807e-02,
        -3.2057e-01,  1.1830e+00, -2.5024e-01, -3.2822e-01,  3.6105e-01],
       [ 8.9173e-01, -2.0309e-01, -7.0216e-02, -5.9429e-01,  3.5246e-01,
        -1.1412e-01, -4.8702e-01,  4.4325e-01,  2.2071e-01, -2.0085e-01,
         2.0269e-01, -5.7646e-02, -3.1540e-01, -1.9960e-01,  4.1387e-01,
         2.4245e-01, -1.6135e-01, -5.1214e-01, -4.

In [23]:
question_vectors = [get_glove_words_vector(q, glove_embeddings, embedding_dimension) for q in questions]

def word_similarites_glove(user_query):
    blob = TextBlob(user_query)
    corrected_query = str(blob.correct())

    # Getting a normalized word by word vector for the query
    query_vector = get_glove_words_vector(corrected_query, glove_embeddings)
    query_normalised = query_vector/np.linalg.norm(query_vector, axis=1, keepdims=True)

    # Padding the questions to be vectors of same length, in order to convert them into a numpy array
    max_length = max(len(q) for q in question_vectors)
    question_array = np.array([np.pad(question, ((0, max_length - question.shape[0]), (0, 0)), mode='constant', constant_values=0) for question in question_vectors])

    # Normalizing the questions word by word
    norms = np.linalg.norm(question_array, axis=2, keepdims=True)
    questions_normalised = np.where(norms != 0, question_array / (norms + 1e-8), question_array)

    # Calculating the similarity between each word of query and question, and extracting the similarity score for each question using the top scores
    similarity = np.tensordot(questions_normalised, query_normalised, axes=([-1], [-1]))
    similarity = np.nan_to_num(similarity, nan=0)
    similarity = similarity.reshape(22, -1)
    similarity = np.mean(np.sort(similarity, axis=1)[:, -(max(2, len(query_vector))):], axis=1)

    # Returning the top 5 matching FAQ entries
    best_match_idx = np.argsort(similarity)[-5:][::-1]
    for i in best_match_idx:
        print("Question:", questions[i])
        print("Answer:", answers[i])
        print()

In [24]:
word_similarites_glove('how are the faculty at saras')

Question: Who are the faculty members at Saras AI Institute?
Answer: The faculty at Saras AI Institute consists of industry professionals who bring the most relevant skills and mentorship for the students to help them prepare for exactly what is needed to succeed in the job roles they are preparing for

Question: Can I transfer credits earned at other universities to Saras AI Institute?
Answer: Yes, we evaluate the course that you have taken, if it overlaps with our curriculum and is relevant in today's time, we offer the flexibility of transfering credits

Question: Does Saras AI Institute offer any scholarships for students? How can I apply for them? 
Answer: Yes, we offer various scholarships to eligible students based on academic merit, financial need, and other criteria. You can apply for scholarships after you're offered admission. Go ahead with filling out the application to check your eligibility.

Question: Can you provide more details about the role-based curriculum feature a

In [25]:
word_similarites_glove('do they provide financiial help')

Question: Can I avail financial aid? 
Answer: You currently can't get a federal aid for Saras AI Institute's programs. However, we are partnering with lenders who can help facilitate a loan to help pay the tuition.

Question: Can you provide more details about the role-based curriculum feature and how it benefits students?
Answer: Our role-based curriculum is designed to provide targeted training and develop specialized skills in students that are highly relevant to their desired job-roles from day one.

Question:  Does the university offer internship placement assistance?
Answer: Yes, we assist students in finding internships by connecting them with potential employers and offering guidance on applications and interviews.

Question: Does Saras AI Institute offer employment support?
Answer: Yes, we provide comprehensive employment support including job placement services, resume building workshops, and interview preparation.

Question: Are there any specific requirements or prerequisit

In [26]:
word_similarites_glove('what is the duration of the course')

Question: What does the program structure look like, and how is the curriculum delivered?
Answer: Each year is divided into 5 semesters which last for 8 weeks each. Our programs feature a mix of recorded and live sessions, allowing for flexibility in learning. 

Question: Can I connect with mentors outside of class?
Answer: Yes, we encourage mentorship and provide opportunities for students to connect with mentors outside of class through live sessions as well as 24x7 mentor support to help resolve your doubts or queries.

Question:  What are the tuition fees for your courses?
Answer: You can find detailed information and breakdown of the fee on 'Programs' page on the website

Question: What is pre-enrollment assessment test? How do I prepare for it?
Answer: It is a fully online assessment which takes less than 15 minutes. It is designed to evaluate your personal traits and basic English communication skills. You can take it at the time of filling out the application. It idoes not requ

## Saving the model

In [27]:
import pickle
with open('glove_embeddings.pkl', 'wb') as f:
    pickle.dump(glove_embeddings, f)
print("GloVe embeddings saved to glove_embeddings.pkl")

with open('faq_data.pkl', 'wb') as f:
    pickle.dump((questions, answers, question_vectors), f)
print("FAQ data saved to faq_data.pkl")

GloVe embeddings saved to glove_embeddings.pkl
FAQ data saved to faq_data.pkl
