# Question Answering Workflow

## Overview

The objective of this notebook is to hae a workflows that allows us to suggest one paper to answer a specific question from a user. To this end, we will employ am extractive question answering approach. 

The workflow is separated into X different steps:

* Validation of the query
* Search of closest articles 
* Extraction of answer from article
* Recovery of the source article

# Imports & globals

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import transformers
import pandas as pd

import string
import json
import os
import re

import sys
sys.path.append("../backend")

from thm.qa import models, paper_priority
from arxiv_dataset import data_load

In [3]:
DATA_LOCATION = os.environ.get("DATA_LOCATION", '/home/jovyan/arxiv/arxiv-metadata-oai-snapshot.json')
DATA_LOCATION = '/home/jovyan/arxiv/arxiv-metadata-oai-snapshot.json'
YEAR_CUTOFF = 2012
ML_CATEGORY = "cs.LG"

In [4]:
def papers():
    with open(DATA_LOCATION, 'r') as f:
        for paper in f:
            paper = data_load.parse_paper(paper)
            if paper['year']:
                if paper['year'] >= YEAR_CUTOFF and ML_CATEGORY in paper['categories']:
                    yield paper

In [5]:
papers_df = pd.DataFrame(papers())

In [6]:
papers_df.head()

Unnamed: 0,id,title,year,authors,categories,abstract
0,705.4485,Mixed membership stochastic blockmodels,2014,"Edoardo M Airoldi, David M Blei, Stephen E Fie...","stat.ME,cs.LG,math.ST,physics.soc-ph,stat.ML,s...",Observations consisting of measurements on r...
1,808.3231,Multi-Instance Multi-Label Learning,2012,"Zhi-Hua Zhou, Min-Ling Zhang, Sheng-Jun Huang,...","cs.LG,cs.AI","In this paper, we propose the MIML (Multi-In..."
2,811.4413,A Spectral Algorithm for Learning Hidden Marko...,2012,"Daniel Hsu, Sham M. Kakade, Tong Zhang","cs.LG,cs.AI",Hidden Markov Models (HMMs) are one of the m...
3,903.4817,An Exponential Lower Bound on the Complexity o...,2012,"Bernd G\""artner, Martin Jaggi and Cl\'ement Maria","cs.LG,cs.CG,cs.CV,math.OC,stat.ML",For a variety of regularized optimization pr...
4,909.5175,Bounding the Sensitivity of Polynomial Thresho...,2013,"Prahladh Harsha, Adam Klivans, Raghu Meka","cs.CC,cs.LG",We give the first non-trivial upper bounds o...


In [7]:
print(*papers_df.iloc[30], sep="\n")

1104.4803
Clustering Partially Observed Graphs via Convex Optimization
2014
Yudong Chen, Ali Jalali, Sujay Sanghavi and Huan Xu
cs.LG,stat.ML
  This paper considers the problem of clustering a partially observed
unweighted graph---i.e., one where for some node pairs we know there is an edge
between them, for some others we know there is no edge, and for the remaining
we do not know whether or not there is an edge. We want to organize the nodes
into disjoint clusters so that there is relatively dense (observed)
connectivity within clusters, and sparse across clusters.
  We take a novel yet natural approach to this problem, by focusing on finding
the clustering that minimizes the number of "disagreements"---i.e., the sum of
the number of (observed) missing edges within clusters, and (observed) present
edges across clusters. Our algorithm uses convex optimization; its basis is a
reduction of disagreement minimization to the problem of recovering an
(unknown) low-rank matrix and an (unknow

In [8]:
contexts = papers_df.apply(
    lambda r: data_load.clean_description(r['title'] + ' ' + r['abstract']), axis=1
).tolist()

In [9]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

In [10]:
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

In [11]:
question = "How to cluster a partially observed unweighted graph?"

In [79]:
priority_papers = paper_priority.PriorityPapersManager(contexts[:100], papers_df["id"].iloc[:100])

In [14]:
context = priority_papers.merged_context

In [35]:
encoded_inputs = tokenizer(
    question,
    context,
    padding=True,
    truncation="only_second",
    max_length=512,
    stride=100,
    return_token_type_ids=True,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    return_special_tokens_mask=True,
    return_tensors="pt"
)

In [36]:
num_spans = len(encoded_inputs["input_ids"])

p_mask = [
    [tok != 1 for tok in encoded_inputs.sequence_ids(span_id)]
    for span_id in range(num_spans)
]

In [23]:
encoded_inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [24]:
special_tokens_mask = encoded_inputs["special_tokens_mask"]
offset_mapping = encoded_inputs["offset_mapping"]
overflow_to_sample_mapping = encoded_inputs["overflow_to_sample_mapping"]

del encoded_inputs["special_tokens_mask"]
del encoded_inputs["offset_mapping"]
del encoded_inputs["overflow_to_sample_mapping"]

In [16]:
answer_scores = model(**encoded_inputs)

In [33]:
from torch.nn.functional import softmax

confidence_start = torch.max(softmax(answer_start_scores.flatten(), dim=0))
confidence_end = torch.max(softmax(answer_end_scores.flatten(), dim=0))
confidence_score = confidence_start * confidence_end

In [36]:
float(confidence_score)

0.00028420527814887464

In [17]:
answer_start = torch.argmax(
    answer_start_scores
)  # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokenized_example["input_ids"].flatten()[answer_start:answer_end]))

print(f"Question: {question}")
print(f"Answer: {answer}\n")

Question: How to cluster a partially observed unweighted graph?
Answer: convex optimization



In [35]:
type(tokenizer)

transformers.models.bert.tokenization_bert_fast.BertTokenizerFast

In [25]:
confidence_start = torch.max(answer_start_scores)
confidence_end = torch.max(answer_end_scores)

print((confidence_start + confidence_end) / 2)
print((confidence_start * confidence_end) ** 0.5)

tensor(3.5788, grad_fn=<DivBackward0>)
tensor(3.5331, grad_fn=<PowBackward0>)


In [None]:
merged_context.find_document(int(answer_start), int(answer_end))

Try to get multiple answers

In [None]:
answer_start_scores.shape

In [None]:
top_k_answers_start = torch.topk(answer_start_scores.flatten(), 3).indices
top_k_answers_end = torch.topk(answer_end_scores.flatten(), 3).indices + 1

In [None]:
answers = []
document_idxs = []
for start, end in zip(top_k_answers_start, top_k_answers_end):
    answers.append(tokenizer.convert_ids_to_tokens(tokenized_example["input_ids"].flatten()[start:end]))
    document_idxs.append(merged_context.find_document(int(start), int(end)))

In [None]:
test = tokenization.qa_pre_tokenize_text([question, question])

In [37]:
from transformers import pipeline

In [38]:
model = pipeline(
    task="question-answering"
)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [112]:
answers = model(
    {
        "question": "How to use gaussian processes?",
        "context": priority_papers.merged_context
    },
    top_k=3
)

In [117]:
for answer in answers:
    start = answer["start"]
    end = answer["end"]
    text = answer["answer"]
    print(answer)
    paper_id= priority_papers.find_paper(start, end)
    print(papers_df.loc[papers_df["id"]==paper_id, "title"])

{'score': 0.8984388113021851, 'start': 63094, 'end': 63119, 'answer': 'kernel density estimation'}
57    Sparse Nonparametric Graphical Models
Name: title, dtype: object
{'score': 0.7347434759140015, 'start': 95415, 'end': 95484, 'answer': 'alternates between clustering and training discriminative classifiers'}
87    Unsupervised Discovery of Mid-Level Discriminative Patches
Name: title, dtype: object
{'score': 0.6878352165222168, 'start': 95415, 'end': 95484, 'answer': 'alternates between clustering and training discriminative classifiers'}
87    Unsupervised Discovery of Mid-Level Discriminative Patches
Name: title, dtype: object


In [115]:
priority_papers.find_paper(start, end, return_paper_idx=True)

('1205.3137', 87)

In [111]:
contexts[6]

' evolutionary inference for function valued traits gaussian process regression on phylogenies biological data objects often have both of the following features i they are functions rather than single numbers or vectors and ii they are correlated due to phylogenetic relationships in this paper we give a flexible statistical model for such data by combining assumptions from phylogenetics with gaussian processes we describe its use as a nonparametric bayesian prior distribution both for prediction placing posterior distributions on ancestral functions and model selection comparing rates of evolution across a phylogeny or identifying the most likely phylogenies consistent with the observed data our work is integrative extending the popular phylogenetic brownian motion and ornstein uhlenbeck models to functional data and bayesian inference and extending gaussian process regression to phylogenies we provide a brief illustration of the application of our method '

In [118]:
contexts[57]

' sparse nonparametric graphical models we present some nonparametric methods for graphical modeling in the discrete case where the data are binary or drawn from a finite alphabet markov random fields are already essentially nonparametric since the cliques can take only a finite number of values continuous data are different the gaussian graphical model is the standard parametric model for continuous data but it makes distributional assumptions that are often unrealistic we discuss two approaches to building more flexible graphical models one allows arbitrary graphs and a nonparametric extension of the gaussian the other uses kernel density estimation and restricts the graphs to trees and forests examples of both methods are presented we also discuss possible future research directions for nonparametric graphical modeling '

In [5]:
import sys 
sys.path.append("../backend")

from thm.search_index import SearchIndex
from thm.config.settings import get_settings
import redis.asyncio as redis

config = get_settings()
# search_index = SearchIndex()
# redis_client = redis.from_url(config.get_redis_url)

# # Create query
# query = search_index.vector_query(
#     similarity_request.categories,
#     similarity_request.years,
#     similarity_request.search_type,
#     similarity_request.number_of_results,
# )
# count_query = search_index.count_query(
#     years=similarity_request.years, categories=similarity_request.categories
# )

# # obtain results of the queries
# total, results = await asyncio.gather(
#     redis_client.ft(config.index_name).search(count_query),
#     redis_client.ft(config.index_name).search(
#         query,
#         query_params={
#             "vec_param": embeddings.make(similarity_request.user_text).tobytes()
#         },
#     ),
# )

ValidationError: 5 validation errors for Settings
redis_host
  field required (type=value_error.missing)
redis_port
  field required (type=value_error.missing)
redis_db
  field required (type=value_error.missing)
redis_password
  field required (type=value_error.missing)
data_location
  field required (type=value_error.missing)

In [6]:
get_settings()

ValidationError: 5 validation errors for Settings
redis_host
  field required (type=value_error.missing)
redis_port
  field required (type=value_error.missing)
redis_db
  field required (type=value_error.missing)
redis_password
  field required (type=value_error.missing)
data_location
  field required (type=value_error.missing)