# General
-----------------------------------------------------------------------

In this notebook, we quickly demonstrate how to boost the performance of a deep-learning based QA system with our approach to adaptive information retrieval. We are using the publicly available [DrQA](https://github.com/facebookresearch/DrQA) implementation as a showcase. Nonetheless, our results are generalizable to any deep QA system. 

For the motivation and theoretical background, please refer to our paper. 

If you find this helpful, please consider citing us:

```
@inprocidings{kratzwald2018adaptive, 
  title={Adaptive Document Retrieval for Deep Question Answering},
  author={Kratzwald, Bernhard and Feuerriegel, Stefan},
  booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}
```

# 1. Generate training data

Before training the logistic regression model, we need to collect training data. For every query in a given dataset (we used SQuAD-v1.1-train in our paper), we write the (possibly normalized) confidence scores for every of the top-n documents as well as the position of the first document that contains the ground-truth answer to a csv file as follows:
`score-top-1, score-top-2, ... , score-top-n, pos` 

It is better to choose a large n here since the model will learn the cut-off between 0 and n. As a rule of thumb we choose n to be 25 for document-based information retrieval and 250 for paragraph based information retrieval. If no answer was found within the first n documents, we set pos to n later. 

To generate training data for the DrQA system follow these four steps:

I) (Optional) normalized scores of the ir module.
At the end of method `closest_docs(self, query, k=1)` in `drqa/retriever/tf-idf-ranker.py` add the following line of code:

In [None]:
doc_scores = doc_scores/np.sum(doc_scores)

II) Replace the method `get_score` the file `script/retriever/eval.py` by the method below:

In [None]:
def get_score(answer_doc, match):
    """Search through all the top docs to see if they have the answer."""
    answer, (doc_ids, doc_scores) = answer_doc
    pos = 0
    for doc_id in doc_ids:
        pos += 1
        if has_answer(answer, doc_id, match):
            return (pos, doc_scores)
    return (-1, doc_scores)

III) In the `__main__` of the same script replace the lines following `scores = processes.map(get_score_partial, answers_docs)` by:

In [None]:
with open(training_file, "w") as f:
    for pos, score in scores:
        f.write('{},{}\n'.format(pos, 
                                  np.array2string(score, separator=',', 
                                                  max_line_width=999999)[1:-1]))

IV) To generate training data for the squad dataset and the top 25 documents you simply have to call: 

`python script/retriever/eval.py path_to_squad_dataset --top-n 25` 

# 2. Train the model 

Before training the model, we recommend splitting it into a train/test fraction to get a feeling for the strength of your classifier.  

You can train the model using pytorch or tensorflow. Alternatively, you can use the [mord](https://pythonhosted.org/mord/) package.

In [None]:
import mord
reg = mord.OrdinalRidge(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto')
reg.fit(X_train, y_train)

In our paper we didn't vary alpha from it's defaul 1. You will see that lower alphas lead to a more expanded distribution of the cutoff point while bigger alphas will narrow the distribtuion. To visualize this better you can predictic and count the cutoff points for your test data: 

In [None]:
y_pred = reg.predict(X_dev)
b = 1
print(np.bincount(y_pred.astype(np.int32)+b))

Now we can save the model to a file, so we can integrate it into the QA system:

In [None]:
pickle.dump(reg, open(filename, 'wb'))

# 3. Implement the trained model

To use the trained model in the DrQA pipeline you first have to load the model in the `__init__` method of the `tf-idf-ranker.py` and then alter the `closest_docs` function as follows:

In [None]:
    def closest_docs(self, query, k=1):
        """Closest docs by dot product between query and documents
        in tfidf weighted word vector space.
        """
        spvec = self.text2spvec(query)
        res = spvec * self.doc_mat

        if len(res.data) <= k:
            o_sort = np.argsort(-res.data)
        else:
            o = np.argpartition(-res.data, k)[0:k]
            o_sort = o[np.argsort(-res.data[o])]

        doc_scores = res.data[o_sort]
        doc_scores = doc_scores/np.sum(doc_scores) # THIS LINE IS ONLY NECCESSARY IF YOUR TRAINING DATA IS USING NORMALIZED SCORES

        x = np.zeros([1,25])
        x[0,0:len(doc_scores)]=doc_scores

        b = 1
        
        y = self.model.predict(x)[0].astype(np.int32) + b

        doc_ids = [self.get_doc_id(i) for i in res.indices[o_sort]]
        return doc_ids[0:y], doc_scores[0:y]

If you run the system now with `--top-n 25` it will automatically predict the cutoffpoint between the first 25 documents.