# Information Extraction using Doc2Vec

## Model Description

Here we have used the Doc2Vec model to find the most similar document/FAQ given a user query.To get going, we'll need to have a set of documents to train our doc2vec model. The bank regulation manual was taken as a reference document for creating a  training data (http://www.bsp.gov.ph/downloads/regulations/morb/morb1.pdf). General FAQs were taken from ICICI Bank (http://www.icicibank.com/Personal-Banking/faq/index.page). Each FAQ is treated as a single document.


The steps involved in building the model were as follows :-


*	Build the tagged corpus based on available banking text files
*	Build Doc2Vec model based on the tagged corpus
*	For testing model on the given corpus, infer vector for each document
*	Find the most similar document in the corpus corresponding to this inferred vector
*   For test query, infer vector and find its most similar document in the corpus



## Define a Function to Read and Preprocess Text

Below, we build the corpus by reading each file,pre-process each document using a simple gensim pre-processing tool (i.e., tokenize text into individual words, remove punctuation, set to lowercase, etc), and return a list of words. Note that, for a given FAQ file, each continuous line constitutes a single document and the length of each line (i.e., document) can vary. Also, to train the model, we'll need to associate a tag/number with each document of the training corpus. In our case, the tag is simply the zero-based line number.

In [3]:
import os
os.chdir('/home/afreen/Desktop/BOT')


import gensim
import glob
import io
import re

corpus=[]
all_docs = []
list_of_files_1=glob.glob(r'./Banking_General/*.txt')
list_of_files_1.sort()
i=0
for file_name in list_of_files_1:
    FI = io.open(file_name,'r',encoding='latin-1')
    doc_text = FI.read()
    s = gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(doc_text), [i])
    i=i+1
    corpus.append(s)
    all_docs.append(doc_text)
    
list_of_files_2=glob.glob(r'./Banking_FAQ/*.txt')
list_of_files_2.sort()
for file_name in list_of_files_2:
    FI = io.open(file_name,'r',encoding='latin-1').read()
    for line in FI.split('\n'):
        corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(line), [i]))
        all_docs.append(line)
        i=i+1

In [4]:
corpus[158]

TaggedDocument(words=[u'will', u'the', u'customer', u'be', u'intimated', u'once', u'the', u'mobile', u'number', u'updation', u'is', u'done', u'yes', u'the', u'customer', u'will', u'be', u'intimated', u'by', u'sms', u'on', u'the', u'old', u'and', u'new', u'mobile', u'number', u'once', u'the', u'same', u'is', u'updated', u'in', u'icore'], tags=[158])

## Training the Model

### Instantiate a Doc2Vec Object

Now, we'll instantiate a Doc2Vec model with a vector size with 100 words and iterating over the training corpus 20 times. We set the minimum word count to 2 in order to give higher frequency words more weighting. Model accuracy can be improved by increasing the number of iterations but this generally increases the training time.


In [5]:
model = gensim.models.doc2vec.Doc2Vec(size=100, alpha=0.025, min_alpha=0.025, min_count=1, dm=0, iter=20)
model.build_vocab(corpus)
model.train(corpus)

1737280

## Inferring a Vector
We can now infer a vector for any piece of text without having to re-train the model by passing a list of words to the model.infer_vector function. This vector can then be compared with other vectors via cosine similarity.
We will first test the performance of the model on the corpus itself, that is given an inferred vector of a document, which is the most similar document returned by the model

In [6]:
%%time

top_ranks = []
for doc_id in range(len(corpus)):
    inferred_vector = model.infer_vector(corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

    top_ranks.append(sims[0])
    

doc_id = 125 

# Compare and print the most/median/least similar documents from the train corpus

print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(corpus[doc_id].words)))
sim_id = top_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(corpus[sim_id[0]].words)))

Train Document (125): «how will get my premium paid receipt the bank will process the receipt»

Similar Document (125, 0.9226488471031189): «how will get my premium paid receipt the bank will process the receipt»

CPU times: user 980 ms, sys: 0 ns, total: 980 ms
Wall time: 1.14 s


## Testing the model on a user query

We will now test the model on a test query by utilizing the methods above, that is preprocessing the query as done for training documents, infer vector for preprocessed query and find the most similar document/FAQ corresponding to the query. A result is returned with document id only if the similarity measure is above a threshold.

In [7]:
def process_query(query):
    query = gensim.utils.simple_preprocess(query)
    inferred_vector= model.infer_vector(query)
    sims = model.docvecs.most_similar([inferred_vector], topn=3)
    top_match = sims[0]
    if top_match[1] > 0.2 :
        return top_match[0]
    else:
        return "Sorry, I don't have an answer for that."


In [50]:
query = "how do i find my cvv number"
result = process_query(query)
print "The FAQ id : %d \n" % int(result)
print all_docs[int(result)]

The FAQ id : 179 

How do I find my CVV number ? The CVV number is a code entered on the back of the card. There are 7 digits entered near the signature panel at the back, out of which the last 3 digits are the CVV number. This value is required as a form of authentication for online and IVR transactions.


## Model Evaluation

We will now evaluate the model based on a Test Corpus by checking for how many FAQs, the model returned the right matching query. For this purpose, we have two Test Corpus: 1) Contains exact matching query 2) Contains queries which are semantically similar. We evaluate the performance of the model separately on these two corpi.

### PART 1: EXACTLY MATCHING QUERIES

In [51]:
testCorpus = []
file_name = r'./TestCorpus/TestFAQs_Direct.txt'
testFile = io.open(file_name,'r',encoding='UTF-8').read()
for line in testFile.split('\n'):
     testCorpus.append(line)

def process_testCorpus(testCorpus):
    all_results = []
    for each_FAQ in testCorpus:
        all_results.append(process_query(each_FAQ))
    return all_results

all_results = process_testCorpus(testCorpus)       

In [52]:
qa= []
inds = []

for i,query in enumerate(testCorpus):
    try:
        ans_inds = int(all_results[i])
        a = all_docs[ans_inds]
    except ValueError:
        ans_inds = 9999
        a = all_results[i]
    qa.append(query + ' MATCHING QUERY :  ' + a)
    inds.append(ans_inds)

resultFile = io.open(r'./TestCorpus/Results_Direct.txt','w',encoding='UTF-8')

for line in qa:
    resultFile.write("%s\n" % line)
    
answerFile = io.open(r'./TestCorpus/Answers_Direct.txt','r',encoding='latin-1')

i=0
score = 0.
for line in answerFile:
     line = (line.strip('\n')).encode('ascii','ignore')
     line = line.split()
     line= map(int,line)
     if (inds[i] - 95) in line:
        score = score+1
     i=i+1
    
accuracy = score/len(inds)
print "For %d Exact Questions, accuracy  : %f" % (len(inds), accuracy)

For 21 Exact Questions, accuracy  : 0.666667


### PART 2: Semantically similar queries

We now test the model against user queries that have semantically similar but not an exactly matching query in the corpus.

In [53]:
testCorpus = []
file_name = r'./TestCorpus/TestFAQs_Indirect.txt'
testFile = io.open(file_name,'r',encoding='UTF-8').read()
for line in testFile.split('\n'):
     testCorpus.append(line)

def process_testCorpus(testCorpus):
    all_results = []
    for each_FAQ in testCorpus:
        all_results.append(process_query(each_FAQ))
    return all_results

all_results = process_testCorpus(testCorpus)  

In [54]:
qa= []
inds = []

for i,query in enumerate(testCorpus):
    try:
        ans_inds = int(all_results[i])
        a = all_docs[ans_inds]
    except ValueError:
        ans_inds = 9999
        a = all_results[i]
    qa.append(query + ' MATCHING QUERY :  ' + a)
    inds.append(ans_inds)

resultFile = io.open(r'./TestCorpus/Results_Indirect.txt','w',encoding='UTF-8')

for line in qa:
    resultFile.write("%s\n" % line)
    
answerFile = io.open(r'./TestCorpus/Answers_Indirect.txt','r',encoding='latin-1')

i=0
score = 0.
for line in answerFile:
     line = (line.strip('\n')).encode('ascii','ignore')
     line = line.split()
     line= map(int,line)
     if (inds[i] - 95) in line:
        score = score+1
     i=i+1
    
accuracy = score/len(inds)
print "For %d Indirect Questions, accuracy  : %f" % (len(inds), accuracy)

For 21 Indirect Questions, accuracy  : 0.238095
