# Automated Question Answering from FAQs Using Word-Embeddings


#### The basic idea of this project is to automatically retrieve a suitable response to customer questions from FAQs. Often websites have  comprehensive FAQs, but manually searching and finding the answer to a specific question from these FAQs is not trivial. The purpose of this exercise is to answer user queries by automatically retrieving the closest question and answer from predefined FAQs when appropriate. I will be using a dataset of FAQs which I have made myself by searching the most frequently asked question from the Internet.The basic strategy to this will be finding an FAQ question that is closest in meaning to the user query and then it will display to the user. For this, the efficient way of computing the semantic similarity between two sentences is converting each sentence into vectors and then using cosine similarity between the vectors to come up with a distance measure between sentences that indicates how similar they are in meaning and for this I have used various models which is further explained.


In [None]:
import pandas as pd
import nltk
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('wordnet')
nltk.download('stopwords')

In [None]:
df = pd.read_csv("FAQs.csv");
df.columns=["Questions","Answers"];
df


# Preprocessing

##### Most NLP tasks involve preprocessing.
- Removing all characters that are not alpha numeric
- Removing stopwords - commonly used words such as 'a', 'to', 'in' and so on.. that do not contribute to the semantic similarity between two sentences.
##### We apply this to both the FAQ questions and the user query sentence.

In [19]:
wordnet=WordNetLemmatizer()
from gensim.parsing.preprocessing import remove_stopwords

def preprocess(ab):
  ab = re.sub('[^a-zA-Z]', ' ', ab)
  ab = ab.lower()              
  return ab

questions=[]
for i in range(0, len(df)):
  ab = preprocess(df['Questions'][i])
  questions.append(ab)
print(questions);
sentence_words = [[word for word in document.split()]
                    for document in questions]

print()
#sentences = nltk.sent_tokenize(paragraph)
#lemmatizer = WordNetLemmatizer()
#for i in range(len(sentences)):
    #words = nltk.word_tokenize(sentences[i])
    #words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    #sentences[i] = ' '.join(words)  
    
def preprocess_without_stopwords(ab):
  corpus=[]
  ab = re.sub('[^a-zA-Z]', ' ', ab)
  ab = ab.lower()
  ab = remove_stopwords(ab)
  return ab

cleaned_sentences=[]
for i in range(0, len(df)):
  ab = preprocess_without_stopwords(df['Questions'][i])
  cleaned_sentences.append(ab) 
print(cleaned_sentences);   
 


 

['what does the job hunting experience look like  ', 'what are the most valuable skills for a data scientist ', 'what is the average salary of a data scientist ', 'what are the top algorithms that every data scientist should have in his her toolbox ', 'how to prepare for a data science interview ', 'any insights you can offer about the ds job market   ', 'do employers look for an advanced ml degree  ', 'how does a typical day of a data scientist look like ', 'do i need to prepare algorithms and data structures for a data science interview   ', 'what are the benefits of a data science certification ', 'should i participate in hackathons  will that help me in getting a job ', 'do i need to know statistics in order to land a data science role ', 'what are the career opportunities in data science ', 'what are the most common mistakes data science enthusiasts make in an interview ', 'what s the impact of covid on hiring for ds roles ', 'what skills and qualities do employers look for in a d

## Bag of words Model

#### The first model I am using for semantic similarity is  Bag of Words (BOW). With BOW, each sentence is encoded into a vector whose length is the number of words in the vocabulary. Each element of the vector indicates how many times the particular word occurs in the sentence. An example is shown below by printing the dictionary and the FAQ questions in the BOW sparse format. The vector representation is also called "Embedding".

In [28]:

from gensim import corpora

dictionary = corpora.Dictionary(sentence_words)
for key, value in dictionary.items():
    print(key, ' : ', value)

import pprint
bow_corpus = [dictionary.doc2bow(text) for text in sentence_words]
for sent,embedding in zip(questions,bow_corpus):
    print(sent)
    print(embedding)

0  :  does
1  :  experience
2  :  hunting
3  :  job
4  :  like
5  :  look
6  :  the
7  :  what
8  :  a
9  :  are
10  :  data
11  :  for
12  :  most
13  :  scientist
14  :  skills
15  :  valuable
16  :  average
17  :  is
18  :  of
19  :  salary
20  :  algorithms
21  :  every
22  :  have
23  :  her
24  :  his
25  :  in
26  :  should
27  :  that
28  :  toolbox
29  :  top
30  :  how
31  :  interview
32  :  prepare
33  :  science
34  :  to
35  :  about
36  :  any
37  :  can
38  :  ds
39  :  insights
40  :  market
41  :  offer
42  :  you
43  :  advanced
44  :  an
45  :  degree
46  :  do
47  :  employers
48  :  ml
49  :  day
50  :  typical
51  :  and
52  :  i
53  :  need
54  :  structures
55  :  benefits
56  :  certification
57  :  getting
58  :  hackathons
59  :  help
60  :  me
61  :  participate
62  :  will
63  :  know
64  :  land
65  :  order
66  :  role
67  :  statistics
68  :  career
69  :  opportunities
70  :  common
71  :  enthusiasts
72  :  make
73  :  mistakes
74  :  covid
75  :  hir

In [24]:
question_orig="Is Coding required for becoming data scientist?"
question=preprocess(question_orig)
question_embedding = dictionary.doc2bow(question.split())

print("\n\n",question,"\n",question_embedding)



 is coding required for becoming data scientist  
 [(10, 1), (11, 1), (13, 1), (17, 1), (82, 1), (88, 1)]


### Calculating Similarity using "Cosine Similarity"

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(question_embedding, sentence_embeddings, FAQdf, questions):
  max_sim=-1;
  index_sim=-1;
  for index, faq_embedding in enumerate(sentence_embeddings):
    sim = cosine_similarity(faq_embedding, question_embedding)[0][0];
    print(index, sim, questions[index])
    if sim>max_sim:
      max_sim=sim;
      index_sim=index;
  print("\n")
  print("Question: ",question)
  print("\n");
  print("Retrieved: ",FAQdf.iloc[index_sim,0]) 
  print(FAQdf.iloc[index_sim,1])

calculate_similarity(question_embedding, bow_corpus, df, questions)

0 0.09950371902099892 what does the job hunting experience look like  
1 0.9978569490503114 what are the most valuable skills for a data scientist 
2 0.9978569490503114 what is the average salary of a data scientist 
3 0.9978569490503114 what are the top algorithms that every data scientist should have in his her toolbox 
4 0.9996953077321108 how to prepare for a data science interview 
5 0.9754410020677066 any insights you can offer about the ds job market   
6 0.9952285251199801 do employers look for an advanced ml degree  
7 0.09950371902099892 how does a typical day of a data scientist look like 
8 0.9996953077321108 do i need to prepare algorithms and data structures for a data science interview   
9 0.9978569490503114 what are the benefits of a data science certification 
10 0.9754410020677066 should i participate in hackathons  will that help me in getting a job 
11 0.9996953077321108 do i need to know statistics in order to land a data science role 
12 0.9978569490503114 what a

## Word2Vec Embeddings

#### Word2Vec embeddings are popularly trained using the skipgram model. These embeddings are trained to take a word as input and reconstruct its context. As a result, they are able to take into account semantic similarity of words based on context information. The resulting embeddings are such that words with similar meaning tend to be closer in terms of cosine similarity.
#### The most popular word2vec model is the skipgram model. Particularly, the most commonly used pre-trained model is based on the Google News dataset that has 3 billion running words and creates upto 300 dimensional embedding for 3 Million words.

In [29]:
from gensim.models import Word2Vec 
import gensim.downloader as api


glove_model=None;
try:
    glove_model = gensim.models.KeyedVectors.load("./glovemodel.mod")
    print("Loaded glove model")
except:            
    glove_model = api.load('glove-twitter-25')
    glove_model.save("./glovemodel.mod")
    print("Saved glove model")
    
v2w_model=None;
try:
    v2w_model = gensim.models.KeyedVectors.load("./w2vecmodel.mod")
    print("Loaded w2v model")
except:            
    v2w_model = api.load('word2vec-google-news-300')
    v2w_model.save("./w2vecmodel.mod")
    print("Saved glove model")

w2vec_embedding_size=len(v2w_model['computer']);
glove_embedding_size=len(glove_model['computer']);

Saved glove model
Saved glove model


### Finding Phrase Embeddings from Word Embeddings

#### To find phrase embeddings, there are sevaral specialized techniques. The most simple technique to convert word embeddings to phrase embeddings, that is applicable with word2vec and glove embeddings, is to sum up the individual word embeddings in the phrase to get a phrase vector which is implemented below.

In [26]:
import numpy

def getWordVec(word,model):
        samp=model['computer'];
        vec=[0]*len(samp);
        try:
                vec=model[word];
        except:
                vec=[0]*len(samp);
        return (vec)


def getPhraseEmbedding(phrase,embeddingmodel):
                       
        samp=getWordVec('computer', embeddingmodel);
        vec=numpy.array([0]*len(samp));
        den=0;
        for word in phrase.split():
            #print(word)
            den=den+1;
            vec=vec+numpy.array(getWordVec(word,embeddingmodel));
        #vec=vec/den;
        #return (vec.tolist());
        return vec.reshape(1, -1)

In [30]:
#With Word2Vec

sent_embeddings=[];
for sent in cleaned_sentences:
    sent_embeddings.append(getPhraseEmbedding(sent,v2w_model));

question_embedding=getPhraseEmbedding(question,v2w_model);

calculate_similarity(question_embedding,sent_embeddings,df, cleaned_sentences);

0 0.2859340695245002 job hunting experience look like
1 0.6399762007455526 valuable skills data scientist
2 0.5515435515887862 average salary data scientist
3 0.6785673174335114 algorithms data scientist toolbox
4 0.5518619953407247 prepare data science interview
5 0.3603353703747457 insights offer ds job market
6 0.3369262291158942 employers look advanced ml degree
7 0.6239160348568386 typical day data scientist look like
8 0.68063636350156 need prepare algorithms data structures data science interview
9 0.6335812632832891 benefits data science certification
10 0.408421370741606 participate hackathons help getting job
11 0.599818270201776 need know statistics order land data science role
12 0.48983332662133156 career opportunities data science
13 0.5508442795886829 common mistakes data science enthusiasts interview
14 0.31511008096839943 s impact covid hiring ds roles
15 0.5408527224896755 skills qualities employers look data scientist
16 0.7644062871503434 proficient data scientist c

## Glove Embeddings

#### Glove is an alternate approach to build word embeddings using matrix factorization techinques on the word-word co-occurance matrix.
#### While both the techniques are popular, glove performs better on some datasets while word2vec skipgram model performs better on some. Here, I have experimented with both the word2vec and the glove models.

In [31]:
#With Glove

sent_embeddings=[];
for sent in cleaned_sentences:
    sent_embeddings.append(getPhraseEmbedding(sent,glove_model));
    
question_embedding=getPhraseEmbedding(question,glove_model);

calculate_similarity(question_embedding,sent_embeddings,df, cleaned_sentences);

0 0.899116905528574 job hunting experience look like
1 0.9336363898230245 valuable skills data scientist
2 0.9081188685497033 average salary data scientist
3 0.8141071793895427 algorithms data scientist toolbox
4 0.9175146061047484 prepare data science interview
5 0.8730817407502276 insights offer ds job market
6 0.9004896812670068 employers look advanced ml degree
7 0.9166257825507318 typical day data scientist look like
8 0.9548748358557457 need prepare algorithms data structures data science interview
9 0.9008804241466186 benefits data science certification
10 0.9109423463126703 participate hackathons help getting job
11 0.9658703837605412 need know statistics order land data science role
12 0.93260344932244 career opportunities data science
13 0.9561140665659048 common mistakes data science enthusiasts interview
14 0.8268798168858132 s impact covid hiring ds roles
15 0.9505724757994463 skills qualities employers look data scientist
16 0.8539567220743777 proficient data scientist co