# Automated Question Answering from FAQs Using Word-Embeddings


#### The basic idea of this project is to automatically retrieve a suitable response to customer questions from FAQs. Often websites have  comprehensive FAQs, but manually searching and finding the answer to a specific question from these FAQs is not trivial. The purpose of this exercise is to answer user queries by automatically retrieving the closest question and answer from predefined FAQs when appropriate. I will be using a dataset of FAQs which I have made myself by searching the most frequently asked question from the Internet.The basic strategy to this will be finding an FAQ question that is closest in meaning to the user query and then it will display to the user. For this, the efficient way of computing the semantic similarity between two sentences is converting each sentence into vectors and then using cosine similarity between the vectors to come up with a distance measure between sentences that indicates how similar they are in meaning and for this I have used various models which is further explained.


In [25]:
import pandas as pd
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
df = pd.read_csv("FAQs.csv");
df.columns=["Questions","Answers"];
df


Unnamed: 0,Questions,Answers
0,What does the job hunting experience look like ?,Job hunting experience involves networking to...
1,What are the most valuable skills for a data s...,Data Science is now being integrated with indu...
2,What is the average salary of a data scientist?,A report by AIM had found out that the average...
3,What are the top algorithms that every data sc...,It is very crucial for the machine learning en...
4,How to prepare for a data science interview?,Cracking any interview requires preparation an...
5,Any insights you can offer about the DS job ma...,"There are many kinds of roles, data scientist,..."
6,Do employers look for an advanced ML degree?,For more senior roles: People typically look f...
7,How does a typical day of a data scientist loo...,Here are some tasks in the typical day of a da...
8,Do I need to prepare algorithms and data struc...,Coding round and an algorithms round. So prepa...
9,What are the benefits of a Data Science certif...,There are definitely some advantages that come...


# Preprocessing

##### Most NLP tasks involve preprocessing.
- Removing all characters that are not alpha numeric
- Removing stopwords - commonly used words such as 'a', 'to', 'in' and so on.. that do not contribute to the semantic similarity between two sentences.
##### We apply this to both the FAQ questions and the user query sentence.

In [27]:
wordnet=WordNetLemmatizer()
from gensim.parsing.preprocessing import remove_stopwords

def preprocess(ab):
  ab = re.sub('[^a-zA-Z]', ' ', ab)
  ab = ab.lower()              
  return ab

questions=[]
for i in range(0, len(df)):
  ab = preprocess(df['Questions'][i])
  questions.append(ab)
print(questions);
sentence_words = [[word for word in document.split()]
                    for document in questions]

print()
#sentences = nltk.sent_tokenize(paragraph)
#lemmatizer = WordNetLemmatizer()
#for i in range(len(sentences)):
    #words = nltk.word_tokenize(sentences[i])
    #words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    #sentences[i] = ' '.join(words)  
    
def preprocess_without_stopwords(ab):
  corpus=[]
  ab = re.sub('[^a-zA-Z]', ' ', ab)
  ab = ab.lower()
  ab = remove_stopwords(ab)
  return ab

cleaned_sentences=[]
for i in range(0, len(df)):
  ab = preprocess_without_stopwords(df['Questions'][i])
  cleaned_sentences.append(ab) 
print(cleaned_sentences);   
 


 

['what does the job hunting experience look like  ', 'what are the most valuable skills for a data scientist ', 'what is the average salary of a data scientist ', 'what are the top algorithms that every data scientist should have in his her toolbox ', 'how to prepare for a data science interview ', 'any insights you can offer about the ds job market   ', 'do employers look for an advanced ml degree  ', 'how does a typical day of a data scientist look like ', 'do i need to prepare algorithms and data structures for a data science interview   ', 'what are the benefits of a data science certification ', 'should i participate in hackathons  will that help me in getting a job ', 'do i need to know statistics in order to land a data science role ', 'what are the career opportunities in data science ', 'what are the most common mistakes data science enthusiasts make in an interview ', 'what s the impact of covid on hiring for ds roles ', 'what skills and qualities do employers look for in a d

## Bag of words Model

#### The first model I am using for semantic similarity is  Bag of Words (BOW). With BOW, each sentence is encoded into a vector whose length is the number of words in the vocabulary. Each element of the vector indicates how many times the particular word occurs in the sentence. An example is shown below by printing the dictionary and the FAQ questions in the BOW sparse format. The vector representation is also called "Embedding".

In [28]:

from gensim import corpora

dictionary = corpora.Dictionary(sentence_words)
for key, value in dictionary.items():
    print(key, ' : ', value)

import pprint
bow_corpus = [dictionary.doc2bow(text) for text in sentence_words]
for sent,embedding in zip(questions,bow_corpus):
    print(sent)
    print(embedding)

0  :  does
1  :  experience
2  :  hunting
3  :  job
4  :  like
5  :  look
6  :  the
7  :  what
8  :  a
9  :  are
10  :  data
11  :  for
12  :  most
13  :  scientist
14  :  skills
15  :  valuable
16  :  average
17  :  is
18  :  of
19  :  salary
20  :  algorithms
21  :  every
22  :  have
23  :  her
24  :  his
25  :  in
26  :  should
27  :  that
28  :  toolbox
29  :  top
30  :  how
31  :  interview
32  :  prepare
33  :  science
34  :  to
35  :  about
36  :  any
37  :  can
38  :  ds
39  :  insights
40  :  market
41  :  offer
42  :  you
43  :  advanced
44  :  an
45  :  degree
46  :  do
47  :  employers
48  :  ml
49  :  day
50  :  typical
51  :  and
52  :  i
53  :  need
54  :  structures
55  :  benefits
56  :  certification
57  :  getting
58  :  hackathons
59  :  help
60  :  me
61  :  participate
62  :  will
63  :  know
64  :  land
65  :  order
66  :  role
67  :  statistics
68  :  career
69  :  opportunities
70  :  common
71  :  enthusiasts
72  :  make
73  :  mistakes
74  :  covid
75  :  hir

In [29]:
question_orig="What is the mathematics required for becoming a data scientist?"
question=preprocess(question_orig)
question_embedding = dictionary.doc2bow(question.split())

print("\n\n",question,"\n",question_embedding)



 what is the mathematics required for becoming a data scientist  
 [(6, 1), (7, 1), (8, 1), (10, 1), (11, 1), (13, 1), (17, 1), (88, 1)]


### Calculating Similarity using "Cosine Similarity"

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(question_embedding, sentence_embeddings, FAQdf, questions):
  max_sim=-1;
  index_sim=-1;
  for index, faq_embedding in enumerate(sentence_embeddings):
    sim = cosine_similarity(faq_embedding, question_embedding)[0][0];
    print(index, sim, questions[index])
    if sim>max_sim:
      max_sim=sim;
      index_sim=index;
  print("\n")
  print("Question: ",question)
  print("\n");
  print("Retrieved: ",FAQdf.iloc[index_sim,0]) 
  print(FAQdf.iloc[index_sim,1])

calculate_similarity(question_embedding, bow_corpus, df, questions)

0 0.1643989873053573 what does the job hunting experience look like  
1 1.0 what are the most valuable skills for a data scientist 
2 1.0 what is the average salary of a data scientist 
3 1.0 what are the top algorithms that every data scientist should have in his her toolbox 
4 0.9991680531005775 how to prepare for a data science interview 
5 0.987762965329069 any insights you can offer about the ds job market   
6 0.9994801143396997 do employers look for an advanced ml degree  
7 0.1643989873053573 how does a typical day of a data scientist look like 
8 0.9991680531005775 do i need to prepare algorithms and data structures for a data science interview   
9 1.0 what are the benefits of a data science certification 
10 0.987762965329069 should i participate in hackathons  will that help me in getting a job 
11 0.9991680531005775 do i need to know statistics in order to land a data science role 
12 1.0 what are the career opportunities in data science 
13 1.0 what are the most common mi

## Word2Vec Embeddings

#### Word2Vec embeddings are popularly trained using the skipgram model. These embeddings are trained to take a word as input and reconstruct its context. As a result, they are able to take into account semantic similarity of words based on context information. The resulting embeddings are such that words with similar meaning tend to be closer in terms of cosine similarity.
#### The most popular word2vec model is the skipgram model. Particularly, the most commonly used pre-trained model is based on the Google News dataset that has 3 billion running words and creates upto 300 dimensional embedding for 3 Million words.

In [31]:
from gensim.models import Word2Vec 
import gensim.downloader as api


glove_model=None;
try:
    glove_model = gensim.models.KeyedVectors.load("./glovemodel.mod")
    print("Loaded glove model")
except:            
    glove_model = api.load('glove-twitter-25')
    glove_model.save("./glovemodel.mod")
    print("Saved glove model")
    
v2w_model=None;
try:
    v2w_model = gensim.models.KeyedVectors.load("./w2vecmodel.mod")
    print("Loaded w2v model")
except:            
    v2w_model = api.load('word2vec-google-news-300')
    v2w_model.save("./w2vecmodel.mod")
    print("Saved glove model")

w2vec_embedding_size=len(v2w_model['computer']);
glove_embedding_size=len(glove_model['computer']);

Saved glove model
Saved glove model


### Finding Phrase Embeddings from Word Embeddings

#### To find phrase embeddings, there are sevaral specialized techniques. The most simple technique to convert word embeddings to phrase embeddings, that is applicable with word2vec and glove embeddings, is to sum up the individual word embeddings in the phrase to get a phrase vector which is implemented below.

In [32]:
import numpy

def getWordVec(word,model):
        samp=model['computer'];
        vec=[0]*len(samp);
        try:
                vec=model[word];
        except:
                vec=[0]*len(samp);
        return (vec)


def getPhraseEmbedding(phrase,embeddingmodel):
                       
        samp=getWordVec('computer', embeddingmodel);
        vec=numpy.array([0]*len(samp));
        den=0;
        for word in phrase.split():
            #print(word)
            den=den+1;
            vec=vec+numpy.array(getWordVec(word,embeddingmodel));
        #vec=vec/den;
        #return (vec.tolist());
        return vec.reshape(1, -1)

In [33]:
#With Word2Vec

sent_embeddings=[];
for sent in cleaned_sentences:
    sent_embeddings.append(getPhraseEmbedding(sent,v2w_model));

question_embedding=getPhraseEmbedding(question,v2w_model);

calculate_similarity(question_embedding,sent_embeddings,df, cleaned_sentences);

0 0.3799822892526198 job hunting experience look like
1 0.6461033784288108 valuable skills data scientist
2 0.6070850710931044 average salary data scientist
3 0.591207774232994 algorithms data scientist toolbox
4 0.6215638438872949 prepare data science interview
5 0.3747165543616412 insights offer ds job market
6 0.3947093467569298 employers look advanced ml degree
7 0.6664771811511914 typical day data scientist look like
8 0.661711805242355 need prepare algorithms data structures data science interview
9 0.6154109843716007 benefits data science certification
10 0.4410042557442965 participate hackathons help getting job
11 0.7216223983369535 need know statistics order land data science role
12 0.5718335806128203 career opportunities data science
13 0.5995483341452104 common mistakes data science enthusiasts interview
14 0.3120821846637402 s impact covid hiring ds roles
15 0.5935067281754758 skills qualities employers look data scientist
16 0.644360895616578 proficient data scientist co

## Glove Embeddings

#### Glove is an alternate approach to build word embeddings using matrix factorization techinques on the word-word co-occurance matrix.
#### While both the techniques are popular, glove performs better on some datasets while word2vec skipgram model performs better on some. Here, I have experimented with both the word2vec and the glove models.

In [35]:
#With Glove

sent_embeddings=[];
for sent in cleaned_sentences:
    sent_embeddings.append(getPhraseEmbedding(sent,glove_model));
    
question_embedding=getPhraseEmbedding(question,glove_model);

calculate_similarity(question_embedding,sent_embeddings,df, cleaned_sentences);

0 0.9414487654572873 job hunting experience look like
1 0.8814296457717923 valuable skills data scientist
2 0.8593565693682742 average salary data scientist
3 0.7090768649192446 algorithms data scientist toolbox
4 0.906074353874184 prepare data science interview
5 0.8549028897114835 insights offer ds job market
6 0.877376826630391 employers look advanced ml degree
7 0.9645478930163514 typical day data scientist look like
8 0.9203193687276834 need prepare algorithms data structures data science interview
9 0.8306463881778596 benefits data science certification
10 0.8950820994508617 participate hackathons help getting job
11 0.976240940361076 need know statistics order land data science role
12 0.8960068949797406 career opportunities data science
13 0.9311682185876409 common mistakes data science enthusiasts interview
14 0.8509392843323993 s impact covid hiring ds roles
15 0.9172160767134758 skills qualities employers look data scientist
16 0.7438779300194449 proficient data scientist co