# Information Retrieval from Domain Corpus


## Prerequisites

* Python 2.7
* scikit-learn


## Objective

To build an information retrieval system which, given a user query shows up the most relevant query present in its database. Currently, the model is able to handle only simple queries. Compound queries will be dealt in a later version.


## Solution Approach

In this case, we selected a domain, banking. The bank regulation manual was taken as a reference document for creating a  training data (http://www.bsp.gov.ph/downloads/regulations/morb/morb1.pdf). General FAQs were taken from ICICI Bank (http://www.icicibank.com/Personal-Banking/faq/index.page). For making the FAQs more generally applicable rather than being bank specfic (in our case ICICI specific),  all instances of 'ICICI bank' mentions in the corpus were removed and answers were modified. 


The steps involved in building the model were as follows :-

### Training Data

*	Build the corpus based on available banking text files
*	Extract category name (here:Banking) based on file name
*	Stemming of words in the corpus and removal of stop words
*	Vectorization of corpus based on TF-IDF
*	SVD on TFIDF matrix for dimensionality reduction

### Test Data

*	Get the user query in string format
*	Transform user query to TFIDF vector after stemming and stopwords removal
*   Transform user query TFIDF vector to reduced dimensions
      
### Information retrieval
 
*    Rank documents in decreasing order of query-document cosine similarities. 
*    If the best ranked document belongs to the FAQ corpus (rather than the General corpus) 
     and if the corresponding cosine similarity is > 0.1, then return the corresponding FAQ question id  
  
In an alternate model, we also tested the model with multiple synonyms of query keywords based on WordNet but the results were similar.

## Step by Step Overview

### Loading the corpus

The corpus was taken from a bank manual (Banking_General) and ICICI Bank FAQs (Banking_FAQs) are read in. The corresponding paths to these files needs to be changed accordingly. 
The category name is extracted from the file name for now.

In [13]:
import os
os.chdir('D:\Desktop\BOT')

import glob
import io
import re

#load the corpus
corpus=[]
list_of_files_1=glob.glob(r'.\Banking_General\*.txt')
for file_name in list_of_files_1:
    FI = io.open(file_name,'r',encoding='latin-1')
    corpus.append(FI.read())

#Getting the category name from a file path 
file_path1 = (list_of_files_1[0])
locate_end = file_path1.index('_')
locate_start = file_path1.rfind('\\',0,locate_end)
category = file_path1[(locate_start+1):locate_end]

#load the faqs & name each faq appropriately
list_of_files_2=glob.glob(r'.\Banking_FAQ\*.txt')
queryID = {}
idx =0
for file_name in list_of_files_2:
    FI = io.open(file_name,'r',encoding='latin-1').read()
    for line in FI.split('\n'):
        corpus.append(line)
        corpusID = len(corpus)-1
        idx = idx+1
        quesID = str(idx)
        queryID[corpusID]=quesID


### Processing the corpus to generate TFIDF Matrix
A instance of TfIdf Vectorizer was created to perform necessary processing of the training corpus. Under this processing we performed stemming of words and removal of stopwords. The stopwords list used here are from NLTK's corpus and an additional stopwords list (Stoplist.txt) created by us. ngrams uptil 5 were used in this analysis. 

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import PorterStemmer

#Define stemming class
class PorterStemming(object):
         def __init__(self):
             #self.wnl = WordNetLemmatizer()
              self.wnl = PorterStemmer()
         def __call__(self, doc):
             return [self.wnl.stem(t) for t in word_tokenize(doc)]

#Define Stopwords list 
stop_list = [line.rstrip('\n') for line in open('.\stopwords\Stoplist.txt','r+')]
extra_stopwords = set(stop_list)
stops = set(stopwords.words('english')) | extra_stopwords


#Define TFIDF Vectorizer
tfidfvec=TfidfVectorizer(corpus,decode_error='ignore',stop_words= stops,ngram_range=(1,5), tokenizer=PorterStemming())
trainset_idf_vectorizer=tfidfvec.fit_transform(corpus).toarray()

### Dimensionality reduction using SVD

Singular Value Decomposition (SVD) was performed on the TFIDF matrix generated from the training corpus to extract only the most important features 

In [16]:
%%time

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=150, algorithm='randomized',n_iter=15, random_state=42)
train_lsa = svd.fit_transform(trainset_idf_vectorizer)

Wall time: 58.6 s


### Transform user query and extract document similarities

We next define a function to process a user query and return the question id of the most relevant document. Specifically, we transform the query based on the Tfidf instance and SVD instance described above. Cosine similarities are then extracted to compute distances between each document and user query. The most similar document, if belonging to the FAQ corpus and above a threshhold distance is returned as the best match


In [17]:
from sklearn.metrics.pairwise import linear_kernel

def process_query(query):
    query=query.lower()
    query = [query]
    test=tfidfvec.transform(query).toarray()
    lsa_test=svd.transform(test)
    cosine_similarities = linear_kernel(lsa_test, train_lsa).flatten()
    related_docs_indices = cosine_similarities.argsort()[:-3:-1]
    top_result_index = related_docs_indices[0]
    if (top_result_index > 95) & (cosine_similarities[top_result_index] > 0.1):
        return queryID[top_result_index]
    else:
        return  "Sorry, I don't have an answer for that"


### Testing the module 

In [24]:
query="what is the maximum amount I can withdraw in a day at an ATM"
result = int(process_query(query))
print "The corresponding FAQ id: %d \n" % result
print "The user query was - \n %s \n" % query
print "The matched FAQ is - \n %s " % corpus[95+20]


The corresponding FAQ id: 20 

The user query was - 
 what is the maximum amount I can withdraw in a day at an ATM 

The matched FAQ is - 
 What is the maximum cash I can withdraw at an ATM in a single day ? Yes, banks set limit for cash withdrawal by customers. The cash withdrawal limit for use at the ATM of the issuing bank is set by the bank during the issuance of the card. This limit is displayed at the respective ATM locations.For cash withdrawals at other bank ATMs, banks have decided to maintain a limit of Rs 10,000/- per transaction. This information is displayed at the ATM location 


## Model Evaluation


We will now evaluate the model based on a Test Corpus by checking for how many FAQs, the model returned the right matching query. For this purpose, we have two Test Corpus: 1) Contains exact matching query 2) Contains queries which are semantically similar. We evaluate the performance of the model separately on these two corpi.

### PART 1: EXACTLY MATCHING QUERIES

In [49]:
testCorpus = []
file_name = r'.\TestCorpus\TestFAQs_Direct.txt'
testFile = io.open(file_name,'r',encoding='latin-1').read()
for line in testFile.split('\n'):
     testCorpus.append(line)

def process_testCorpus(testCorpus):
    all_results = []
    for each_FAQ in testCorpus:
        all_results.append(process_query(each_FAQ))
    return all_results

all_results = process_testCorpus(testCorpus)       

In [50]:
qa= []
inds = []

for i,query in enumerate(testCorpus):
    try:
        ans_inds = int(all_results[i])
        a = corpus[95+ans_inds]
    except ValueError:
        ans_inds = 9999
        a = all_results[i]
    qa.append(query + ' MATCHING QUERY :  ' + a)
    inds.append(ans_inds)
    
resultFile = io.open(r'.\TestCorpus\Results_Direct.txt','w',encoding='UTF-8')

for line in qa:
    resultFile.write("%s\n" % line)
    
    
answerFile = io.open(r'.\TestCorpus\Answers_Direct.txt','r',encoding='latin-1')

i=0
score = 0.
for line in answerFile:
     line = (line.strip('\n')).encode('ascii','ignore')
     line = line.split()
     line= map(int,line)
     if inds[i] in line:
        score = score+1
     i=i+1
accuracy = score/len(inds)
print "For %d Exact Questions, accuracy  : %f" % (len(inds), accuracy)


For 21 Exact Questions, accuracy  : 0.904762


### PART 2: Semantically similar queries

We now test the model against user queries that have semantically similar but not an exactly matching query in the corpus.

In [51]:
testCorpus = []
file_name = r'.\TestCorpus\TestFAQs_Indirect.txt'
testFile = io.open(file_name,'r',encoding='latin-1').read()
for line in testFile.split('\n'):
     testCorpus.append(line)

def process_testCorpus(testCorpus):
    all_results = []
    for each_FAQ in testCorpus:
        all_results.append(process_query(each_FAQ))
    return all_results

all_results = process_testCorpus(testCorpus)       

In [52]:
qa= []
inds = []

for i,query in enumerate(testCorpus):
    try:
        ans_inds = int(all_results[i])
        a = corpus[95+ans_inds]
    except ValueError:
        ans_inds = 9999
        a = all_results[i]
    qa.append(query + ' MATCHING QUERY :  ' + a)
    inds.append(ans_inds)
    
resultFile = io.open(r'.\TestCorpus\Results_Indirect.txt','w',encoding='UTF-8')

for line in qa:
    resultFile.write("%s\n" % line)

    
answerFile = io.open(r'.\TestCorpus\Answers_Indirect.txt','r',encoding='latin-1')

i=0
score = 0.
for line in answerFile:
     line = (line.strip('\n')).encode('ascii','ignore')
     line = line.split()
     line= map(int,line)
     if inds[i] in line:
        score = score+1
     i=i+1
accuracy = score/len(inds)
print "For %d Indirect Questions, accuracy  : %f" % (len(inds), accuracy)

For 21 Indirect Questions, accuracy  : 0.714286
