# Question Answering using Deep Neural Networks

# Why Machine Learning?

- The traditional information retrieval way for question answering works very well, especially when you knows what you want exactly.

- But the problem is you need to know what you want exactly, i.e the best key words

- Sometimes this is not easy and bring higher requirement to the user

- For machine learning methods, you can overcome this problem to some extent, by feature engineering (which can take more aspects into consideration than only mathcing the words) or by state of the art deep learning models (which can automaticly learn the relationship from the dataset) 

# How?

- Traditional machine learning models were tested with manually extracted features, which mainly considers the similarity between the question and the answer

- Large number of papers related to QA system are looked through. Most of them still using the idea to matching the answer with the question.

- Finally, a deep siamese network model proposed by CS&AI Lab of MIT is applied 

# The result


- using similarity between concepts insted of specific term search
- Around 70% accuracy to distinguish right answers from the answer set
- can take full sentence as the input question and returns the information with a ranked list
- overcomes the shortcome of elastic search though not as precise as it
- therfore can be a supplement to be used together

In [1]:
import tensorflow as tf
import numpy as np
import os
import time
import datetime
from tensorflow.contrib import learn
from input_helpers import InputHelper
import pandas as pd
from preprocess import MyVocabularyProcessor
inpH = InputHelper()
x1_test,x2_test,y_test = inpH.getTestDataSet("validation.txt0", "runs/100_epochs_merged_demo/checkpoints/vocab", 30)
checkpoint_file = "runs/100_epochs_merged_demo/checkpoints/model-4000"
graph = tf.Graph()

Loading testing/labelled data from validation.txt0
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please use tensorflow/transform or tf.data.
76


# Read the original QA file

In [2]:
test = pd.read_csv("qa_merged.tsv", delimiter = "\t", header = None)
test = test[test[2]==1]
def dumpTest(x1_text,x2_text,y):
        with open('test.txt','w') as f:
            for text1,text2,label in zip(x1_text,x2_text,y):
                f.write(str(label)+"\t"+str(text1)+"\t"+str(text2)+"\n")        
dumpTest(test[[0]], test[[1]], test[[2]])
def getTestDataSet(self, data_path, vocab_path, max_document_length):
        x1_temp,x2_temp,y = self.getTsvTestData(data_path)
        vocab_processor = MyVocabularyProcessor(max_document_length,min_frequency=0)
        vocab_processor = vocab_processor.restore(vocab_path)
        x1 = np.asarray(list(vocab_processor.transform(x1_temp)))
        x2 = np.asarray(list(vocab_processor.transform(x2_temp)))
        del vocab_processor
        gc.collect()
        return x1,x2, y

# A new question

In [3]:
#question = "Do you keep a record of all data disposals?"
question = "Do you have a process to notify RBS of changes to data extraction methodology in advance?"
#question = "Vendor must have and maintain a documented change control process for all changes to the IT production environment where Deutsche Bank information is processed or stored.  Changes to the configuration of IT assets in production environments must be actively managed and documented throughout their lifecycle from initiation to verification and closure."
#question = "Does your company have a defined process to ensure separation of duties between personnel assigned to the development/test environments and those assigned to the production environment?"
#question = "Do you need an ID to enter the building?"

test[3] = question.lower()
vocab_processor = MyVocabularyProcessor(30,min_frequency=0)
vocab_processor = vocab_processor.restore("runs/100_epochs_merged_demo/checkpoints/vocab")
x1 = np.asarray(list(vocab_processor.transform(np.asarray(test[3])))  )
x2 = np.asarray(list(vocab_processor.transform(np.asarray([x.lower() for x in test[1]])))  )
y = test[2]

# Predict using the model

In [4]:
with graph.as_default():
    session_conf = tf.ConfigProto(allow_soft_placement=True,log_device_placement=False)
    sess = tf.Session(config=session_conf)
    with sess.as_default():
        saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
        sess.run(tf.initialize_all_variables())
        saver.restore(sess, checkpoint_file)
        input_x1 = graph.get_operation_by_name("input_x1").outputs[0]
        input_x2 = graph.get_operation_by_name("input_x2").outputs[0]
        input_y = graph.get_operation_by_name("input_y").outputs[0]
        dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
        predictions = graph.get_operation_by_name("output/distance").outputs[0]
        accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0]
        sim = graph.get_operation_by_name("accuracy/temp_sim").outputs[0]
        batches = inpH.batch_iter(list(zip(x1,x2,y)), 2*64, 1, shuffle=False)
        all_predictions = []
        all_d=[]
        for db in batches:
            x1_dev_b,x2_dev_b,y_dev_b = zip(*db)
            batch_predictions, batch_acc, batch_sim = sess.run([predictions,accuracy,sim], {input_x1: x1_dev_b, input_x2: x2_dev_b, input_y:y_dev_b, dropout_keep_prob: 1.0})
            all_predictions = np.concatenate([all_predictions, batch_predictions])
            all_d = np.concatenate([all_d, batch_sim])
        correct_predictions = float(np.mean(all_d == y_test))
test["score"] = 0.0
for i in range(len(test)):
    test.iat[i,4] = all_predictions[i]
test["d"] = 0
for i in range(len(test)):
    test.iat[i,5] = all_d[i]
test.sort_values("score").to_csv("preres.csv")

Instructions for updating:
Use `tf.global_variables_initializer` instead.
INFO:tensorflow:Restoring parameters from runs/100_epochs_merged_demo/checkpoints/model-4000
[[array([ 7, 16,  1, 20, 16, 22,  1, 14,  5, 26,  4,  1,  5,  1,  2])
  array([ 1,  2,  3,  4,  5,  6,  4,  1,  7,  4,  6,  8,  9, 10, 11]) 1.0]
 [array([ 7, 16,  1, 20, 16, 22,  1, 14,  5, 26,  4,  1,  5,  1,  2])
  array([ 1,  2,  3,  4,  5,  6,  4,  1,  2,  9, 16, 26, 10,  7,  4]) 1.0]
 [array([ 7, 16,  1, 20, 16, 22,  1, 14,  5, 26,  4,  1,  5,  1,  2])
  array([ 1,  5,  9,  4,  1, 13, 14, 10,  6,  1, 16, 17, 17, 10,  8]) 1.0]
 ...
 [array([ 7, 16,  1, 20, 16, 22,  1, 14,  5, 26,  4,  1,  5,  1,  2])
  array([17, 16,  9,  1, 22,  6,  4,  9,  1,  5, 22, 13, 14,  4, 19]) 1.0]
 [array([ 7, 16,  1, 20, 16, 22,  1, 14,  5, 26,  4,  1,  5,  1,  2])
  array([ 7, 16,  4,  6,  1, 20, 16, 22,  9,  1,  8, 16, 23,  2,  5]) 1.0]
 [array([ 7, 16,  1, 20, 16, 22,  1, 14,  5, 26,  4,  1,  5,  1,  2])
  array([ 7, 16,  4,  6,  1, 14, 



In [5]:
pd.options.display.max_colwidth= 1000
df_result = test.sort_values("score").head(200)
df_result['score_result'] = 1 - df_result['score']


In [6]:
question_ls =  list(df_result[0])
answer_ls =  list(df_result[1])
score_ls = list(df_result['score_result'])
print()
print('QUERY:', '"' ,question, '"' )
print()
print('SEARCH RESULT:')
print()
    
for i in range (0,50):
    print('')
    print("\033[1;34m\"{}\"\x1b[0m".format(question_ls[i]))
    print('Answer:', answer_ls[i])
    print('')
    print("\x1b[31m\"{}{}\"\x1b[0m".format('The confidence:',score_ls[i]))
    print('---------------------------------------------------------------------------------------')
                                                                                                                                


QUERY: " Do you have a process to notify RBS of changes to data extraction methodology in advance? "

SEARCH RESULT:


[1;34m"Do you have a process for granting and documenting access, including access for subcontractors and remote access? List the person(s)/group(s) responsible for granting access. Please describe the process, including any tools utilized."[0m
Answer: Do you have a process for granting and documenting access, including access for subcontractors and remote access? List the person(s)/group(s) responsible for granting access. Please describe the process, including any tools utilized. Yes Access rights are granted based on standard access templates in line with business roles, and are augmented by specific additional rights on an as needed basis with data owner approval. The IT Service Access Management (SAM) team processes include specific tasks and checks to be carried out for a role and access rights where appropriate, these include Joiners, Movers and Leavers as we

In [7]:
df_result

Unnamed: 0,0,1,2,3,score,d,score_result
730,"Do you have a process for granting and documenting access, including access for subcontractors and remote access? List the person(s)/group(s) responsible for granting access. Please describe the process, including any tools utilized.","Do you have a process for granting and documenting access, including access for subcontractors and remote access? List the person(s)/group(s) responsible for granting access. Please describe the process, including any tools utilized. Yes Access rights are granted based on standard access templates in line with business roles, and are augmented by specific additional rights on an as needed basis with data owner approval. The IT Service Access Management (SAM) team processes include specific tasks and checks to be carried out for a role and access rights where appropriate, these include Joiners, Movers and Leavers as well as remote access. The SAM team use a variety of tools including but not limited to Active Driectory and Remote Access Token Services.",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
1611,Do you have a policy in place to enforce encryption on external storage devices?,Do you have a policy in place to enforce encryption on external storage devices? Fully Implemented Encryption policy in place covering key management. Clifford Chance uses robust encryption protocols for storing and transferring data both physically and logically. Encryption keys are limited to identified personnel and commensurate with their role.,1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
638,"Do you have a process to restrict production data from being used in development and testing environments? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment.","Do you have a process to restrict production data from being used in development and testing environments? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment. Not Applicable",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
639,"Do you have a process to perform peer / independent code reviews for each release? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment.","Do you have a process to perform peer / independent code reviews for each release? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment. Not Applicable We do peer assessment on change control.",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
641,"Do you have a process to perform vulnerability scans and penetration testing for all software releases? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment.","Do you have a process to perform vulnerability scans and penetration testing for all software releases? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment. Not Applicable",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
647,"Do you have a process for the maintenance of segregation of duties between the development, testing and production environments? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment.","Do you have a process for the maintenance of segregation of duties between the development, testing and production environments? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment. Not Applicable",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
648,"Do you have a process to log and monitor access to the program source code on a periodic basis? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment.","Do you have a process to log and monitor access to the program source code on a periodic basis? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment. Not Applicable",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
659,"Do you have a process to remove information prior to decommissioning equipment that housed, stored, processed, controlled or accessed client information? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment.","Do you have a process to remove information prior to decommissioning equipment that housed, stored, processed, controlled or accessed client information? If yes, please upload supporting documentation and/or provide overview. If no, please provide comment. Not Applicable",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
671,"Do you have a process in place to track and record errors/losses, on a timely basis, as they are discovered? Do you analyze these events to understand their root causes and then address and escalate to management? Provide evidence of regulatory and/or operational errors resulting in financial loss, if applicable.","Do you have a process in place to track and record errors/losses, on a timely basis, as they are discovered? Do you analyze these events to understand their root causes and then address and escalate to management? Provide evidence of regulatory and/or operational errors resulting in financial loss, if applicable. N/A N/A",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
672,"Do you have a process to remediate/take appropriate actions and follow up and report on significant deficiencies that were identified through self-, internal-, external- audits and SOX reviews? Provide your most recent internal and external audit remediation items.","Do you have a process to remediate/take appropriate actions and follow up and report on significant deficiencies that were identified through self-, internal-, external- audits and SOX reviews? Provide your most recent internal and external audit remediation items. N/A N/A",1.0,do you have a process to notify rbs of changes to data extraction methodology in advance?,0.170304,1,0.829696
