# Question Answering using Deep Neural Networks

# Why Machine Learning?

- The traditional information retrieval way for question answering works very well, especially when you knows what you want exactly.

- But the problem is you need to know what you want exactly, i.e the best key words

- Sometimes this is not easy and bring higher requirement to the user

- For machine learning methods, you can overcome this problem to some extent, by feature engineering (which can take more aspects into consideration than only mathcing the words) or by state of the art deep learning models (which can automaticly learn the relationship from the dataset) 

# How?

- Traditional machine learning models were tested with manually extracted features, which mainly considers the similarity between the question and the answer

- Large number of papers related to QA system are looked through. Most of them still using the idea to matching the answer with the question.

- Finally, a deep siamese network model proposed by CS&AI Lab of MIT is applied 

# The result


- using similarity between concepts insted of specific term search
- Around 70% accuracy to distinguish right answers from the answer set
- can take full sentence as the input question and returns the information with a ranked list
- overcomes the shortcome of elastic search though not as precise as it
- therfore can be a supplement to be used together

In [1]:
import tensorflow as tf
import numpy as np
import os
import time
import datetime
from tensorflow.contrib import learn
from input_helpers import InputHelper
import pandas as pd
from preprocess import MyVocabularyProcessor
inpH = InputHelper()
x1_test,x2_test,y_test = inpH.getTestDataSet("validation.txt0", "runs/100_epochs_demo/checkpoints/vocab", 30)
checkpoint_file = "runs/100_epochs_demo/checkpoints/model-3000"
graph = tf.Graph()

Loading testing/labelled data from validation.txt0
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please use tensorflow/transform or tf.data.
76


# Read the original QA file

In [2]:
test = pd.read_csv("qapair.tsv", delimiter = "\t", header = None)
test = test[test[2]==1]
def dumpTest(x1_text,x2_text,y):
        with open('test.txt','w') as f:
            for text1,text2,label in zip(x1_text,x2_text,y):
                f.write(str(label)+"\t"+str(text1)+"\t"+str(text2)+"\n")        
dumpTest(test[[0]], test[[1]], test[[2]])
def getTestDataSet(self, data_path, vocab_path, max_document_length):
        x1_temp,x2_temp,y = self.getTsvTestData(data_path)
        vocab_processor = MyVocabularyProcessor(max_document_length,min_frequency=0)
        vocab_processor = vocab_processor.restore(vocab_path)
        x1 = np.asarray(list(vocab_processor.transform(x1_temp)))
        x2 = np.asarray(list(vocab_processor.transform(x2_temp)))
        del vocab_processor
        gc.collect()
        return x1,x2, y

# A new question

In [3]:
question = "Do you keep a record of all data disposals?"
#question = "Do you have a process to notify RBS of changes to data extraction methodology in advance?"
#question = "Vendor must have and maintain a documented change control process for all changes to the IT production environment where Deutsche Bank information is processed or stored.  Changes to the configuration of IT assets in production environments must be actively managed and documented throughout their lifecycle from initiation to verification and closure."
#question = "Does your company have a defined process to ensure separation of duties between personnel assigned to the development/test environments and those assigned to the production environment?"
#question = "Do you need an ID to enter the building?"

test[3] = question.lower()
vocab_processor = MyVocabularyProcessor(30,min_frequency=0)
vocab_processor = vocab_processor.restore("runs/100_epochs_demo/checkpoints/vocab")
x1 = np.asarray(list(vocab_processor.transform(np.asarray(test[3])))  )
x2 = np.asarray(list(vocab_processor.transform(np.asarray([x.lower() for x in test[1]])))  )
y = test[2]

# Predict using the model

In [4]:
with graph.as_default():
    session_conf = tf.ConfigProto(allow_soft_placement=True,log_device_placement=False)
    sess = tf.Session(config=session_conf)
    with sess.as_default():
        saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
        sess.run(tf.initialize_all_variables())
        saver.restore(sess, checkpoint_file)
        input_x1 = graph.get_operation_by_name("input_x1").outputs[0]
        input_x2 = graph.get_operation_by_name("input_x2").outputs[0]
        input_y = graph.get_operation_by_name("input_y").outputs[0]
        dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
        predictions = graph.get_operation_by_name("output/distance").outputs[0]
        accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0]
        sim = graph.get_operation_by_name("accuracy/temp_sim").outputs[0]
        batches = inpH.batch_iter(list(zip(x1,x2,y)), 2*64, 1, shuffle=False)
        all_predictions = []
        all_d=[]
        for db in batches:
            x1_dev_b,x2_dev_b,y_dev_b = zip(*db)
            batch_predictions, batch_acc, batch_sim = sess.run([predictions,accuracy,sim], {input_x1: x1_dev_b, input_x2: x2_dev_b, input_y:y_dev_b, dropout_keep_prob: 1.0})
            all_predictions = np.concatenate([all_predictions, batch_predictions])
            all_d = np.concatenate([all_d, batch_sim])
        correct_predictions = float(np.mean(all_d == y_test))
test["score"] = 0.0
for i in range(len(test)):
    test.iat[i,4] = all_predictions[i]
test["d"] = 0
for i in range(len(test)):
    test.iat[i,5] = all_d[i]
test.sort_values("score").to_csv("preres.csv")

Instructions for updating:
Use `tf.global_variables_initializer` instead.
INFO:tensorflow:Restoring parameters from runs/100_epochs_demo/checkpoints/model-3000
[[array([ 7, 16,  6, 20, 16, 22,  6, 25,  3,  3,  1,  6,  4,  6,  9])
  array([ 8,  2, 10, 17, 17, 16,  9,  7,  6,  8, 14,  4, 19,  8,  3]) 1.0]
 [array([ 7, 16,  6, 20, 16, 22,  6, 25,  3,  3,  1,  6,  4,  6,  9])
  array([ 8,  2, 10, 17, 17, 16,  9,  7,  6,  8, 14,  4, 19,  8,  3]) 1.0]
 [array([ 7, 16,  6, 20, 16, 22,  6, 25,  3,  3,  1,  6,  4,  6,  9])
  array([ 8, 10, 13, 10,  6,  7,  4, 13,  4,  6, 10,  5,  6, 16, 19]) 1.0]
 ...
 [array([ 7, 16,  6, 20, 16, 22,  6, 25,  3,  3,  1,  6,  4,  6,  9])
  array([19, 16, 13,  6,  4,  1,  1,  2, 10,  8,  4, 11,  2,  3,  0]) 1.0]
 [array([ 7, 16,  6, 20, 16, 22,  6, 25,  3,  3,  1,  6,  4,  6,  9])
  array([19, 16, 13,  6,  4,  1,  1,  2, 10,  8,  4, 11,  2,  3,  0]) 1.0]
 [array([ 7, 16,  6, 20, 16, 22,  6, 25,  3,  3,  1,  6,  4,  6,  9])
  array([19, 16, 13,  6,  4,  1,  1,  2,



In [5]:
pd.options.display.max_colwidth= 1000
df_result = test.sort_values("score").head(200)
df_result['score_result'] = 1 - df_result['score']


In [6]:
question_ls =  list(df_result[0])
answer_ls =  list(df_result[1])
score_ls = list(df_result['score_result'])
print()
print('QUERY:', '"' ,question, '"' )
print()
print('SEARCH RESULT:')
print()
    
for i in range (0,50):
    print('')
    print("\033[1;34m\"{}\"\x1b[0m".format(question_ls[i]))
    print('Answer:', answer_ls[i])
    print('')
    print("\x1b[31m\"{}{}\"\x1b[0m".format('The confidence:',score_ls[i]))
    print('---------------------------------------------------------------------------------------')
                                                                                                                                


QUERY: " Do you keep a record of all data disposals? "

SEARCH RESULT:


[1;34m" Is Internet Usage Policy or equivalent in place which restricts user access to external e-mail accounts, webmail, data sharing (peer to peer, collaborative, sharing tools including cloud services, etc.) chat, and messaging, from your network?  Provide a copy of the Policy."[0m
Answer: Please see IT Security and Acceptable Use Policy - Internet Use page 6.  

[31m"The confidence:0.6068438291549683"[0m
---------------------------------------------------------------------------------------

[1;34m" Do you require all workers (employees, contractors, temps, subcontractors, etc.) with access to Citi information sign a non-disclosure or confidentiality agreement (NDA) as a condition of their employment? "[0m
Answer: All employees and relevant third parties sign confidentiality or non-disclosure agreements as part of their work tenure.

[31m"The confidence:0.6006975173950195"[0m
-------------------------

In [7]:
df_result

Unnamed: 0,0,1,2,3,score,d,score_result
45,"Is Internet Usage Policy or equivalent in place which restricts user access to external e-mail accounts, webmail, data sharing (peer to peer, collaborative, sharing tools including cloud services, etc.) chat, and messaging, from your network? Provide a copy of the Policy.",Please see IT Security and Acceptable Use Policy - Internet Use page 6.,1.0,do you keep a record of all data disposals?,0.393156,1,0.606844
161,"Do you require all workers (employees, contractors, temps, subcontractors, etc.) with access to Citi information sign a non-disclosure or confidentiality agreement (NDA) as a condition of their employment?",All employees and relevant third parties sign confidentiality or non-disclosure agreements as part of their work tenure.,1.0,do you keep a record of all data disposals?,0.399302,1,0.600698
2041,"Does your company store media containing clients' information (e.g. printouts, CD's, backup tapes) in a secured environment (i.e. locked room or cabinet with access limited to those who have a business need)?","All tapes are encrypted, and where appropriate tapes are collected weekly and stored by a 3 rd Party at a secure offsite facility.",1.0,do you keep a record of all data disposals?,0.405208,1,0.594792
423,Which statement most accurately represents the retention of Personally Identifiable Information (PII) and Business Confidential Information? (Select all that apply),Data is only retained the minimal amount of time required to perform the expected services,1.0,do you keep a record of all data disposals?,0.409020,1,0.590980
90,Provide the information classification scheme category of Citi data in your systems.,Data is labelled according to the matter it is associated with and acquires labelling and handling controls through the matter security model in SharePoint.,1.0,do you keep a record of all data disposals?,0.409602,1,0.590398
89,"What are your information classification scheme categories (e.g. Restricted, Confidential, Internal and Public)?",Data is labelled according to the matter it is associated with and acquires labelling and handling controls through the matter security model in SharePoint.,1.0,do you keep a record of all data disposals?,0.409602,1,0.590398
1051,"Does the information security management policy address Information Classification: Classification of information, and requirements ensuring information is protected based upon it’s classification.",Data is labelled according to the matter it is associated with and acquires labelling and handling controls through the matter security model in SharePoint.,1.0,do you keep a record of all data disposals?,0.409602,1,0.590398
88,Do you have a Policy in place which requires the classification of information? Provide the title of the Policy.,Data is labelled according to the matter it is associated with and acquires labelling and handling controls through the matter security model in SharePoint.,1.0,do you keep a record of all data disposals?,0.409602,1,0.590398
34,"Are user held accountable for all activity associated with their login IDs? Do you have Acceptable Use Policy or comparable document in place, to ensure accountability for all activities associated with Login IDs? Provide the title of the Policy. Do employees must acknowledge their acceptance of their accountability for all activities associated with their Login IDs?","All users are held accountable for activity associated with their ID's. This is specified within the Acceptable Use and Confidentiality policies, which are signed on appointment and which apply to all members of a Matter Team.",1.0,do you keep a record of all data disposals?,0.411402,1,0.588598
103,"Provide a copy of your Policies, Standards, and Procedures (e.g. Sanitation of Media Devices) related to the destruction and secure disposal of information in the form of paper and/or electronic media.",Please see A8 Asset Ref 4 S Secure Disposal policy (Redacted).pdf,1.0,do you keep a record of all data disposals?,0.412720,1,0.587280
