В этом блокноте мы реализуем модель вопросно-ответной системы, основанную на подсчёте схожести вопроса и потенциальных ответов.

# Data collection

In [None]:
import json


Для бейзлайн-модели нам не нужно как таковое обучение – алгоритм способен работать с текстами сразу – поэтому будем использовать только тестовые данные.

In [None]:
# loading val data
with open('/content/drive/MyDrive/qasper-dev-v0.3.json', 'r', encoding='utf-8') as file:
    qasper_dict = json.load(file)

In [None]:
qasper_dict['1503.00841'] #an example of data structure

{'title': 'Robustly Leveraging Prior Knowledge in Text Classification',
 'abstract': 'Prior knowledge has been shown very useful to address many natural language processing tasks. Many approaches have been proposed to formalise a variety of knowledge, however, whether the proposed approach is robust or sensitive to the knowledge supplied to the model has rarely been discussed. In this paper, we propose three regularization terms on top of generalized expectation criteria, and conduct extensive experiments to justify the robustness of the proposed methods. Experimental results demonstrate that our proposed methods obtain remarkable improvements and are much more robust than baselines.',
 'full_text': [{'section_name': 'Introduction',
   'paragraphs': ['We posses a wealth of prior knowledge about many natural language processing tasks. For example, in text categorization, we know that words such as NBA, player, and basketball are strong indicators of the sports category BIBREF0 , and wor

# Text preprocessing

In [None]:
import nltk
nltk.download('wordnet')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words_en = stopwords.words('english')
stop_words_en

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
# texts preprocessing
def preprocess_text(text):

    tokenizer = RegexpTokenizer(r'[\w\']+') # introducing an instance of tokenizer
                                              # we treat words with hyphens as two different tokens
    lemmatizer = WordNetLemmatizer() # introducing an instance of lemmatizer
                                      # our lemmatizer is based on WordNet for English

    tokenized_text = tokenizer.tokenize(text) # tokenizing texts

    lemmatized_text = [lemmatizer.lemmatize(word.lower())
                      for word in tokenized_text] # lemmatizing texts (one word at a time)

    text_without_stopwords = [word for word in lemmatized_text
                              if word not in stop_words_en
                              and len(word) > 2] # sorting out stop words

    return text_without_stopwords

# Model building

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

In [None]:
import gensim.downloader as api

In [None]:
api.info()['models'].keys()

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

Модель будет работать в два этапа:

1.   Из каждого отрывка мы извлечём ключевые слова и сравним среднюю близость каждого из этих слов к словам из вопроса. Таким образом выберем наиболее релевантную для вопроса часть текста.
2.   Затем пройдёмся окнами от 1 до n по этому отрывку, сравнивая близость вектора фразы в этом окне с вектором вопроса. Выберем наиболее похожую на вопрос фразу.

Недостатки такого подхода очевидны: мы исходим из предположения, что ответ будет похож по смыслу на вопрос или содержит слова из вопроса, что далеко не всегда является правдой.



In [None]:
def cosine_similarity(vector1, vector2):
    return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))

In [152]:
class QASimilarityModel:

    def __init__(self, window_size, n_keywords, model):
        self.window_size = window_size
        self.n_keywords = n_keywords
        self.model = model

    def _extract_key_words(self, text):
        # the idea is to cound words' importance (tfidf value) and take n of them

        tfidf_vectorizer = TfidfVectorizer()
        tfidf_word_features = tfidf_vectorizer.fit_transform([text]).toarray() # counting tfidf values

        keywords = sorted(tfidf_vectorizer.vocabulary_,
                          key=lambda x:
                          tfidf_word_features[0, tfidf_vectorizer.vocabulary_[x]],
                          reverse=True) # getting keywords (words with higher tfidf)

        thresh = min(self.n_keywords, len(keywords)) # to prevent cases, when n > number of words
        return keywords[:thresh]

    def _count_mean_similarity(self, keywords, question_words):
        simils = []
        for q_word in question_words:

            q_wordv = self.model[q_word] # getting a vector for a word from the question

            simil = []
            for k_word in keywords: # counting similarities between keywords and a from the question

                try:
                    k_wordv = self.model[k_word]
                    simil.append(cosine_similarity(q_wordv, k_wordv))
                except KeyError:
                    simil.append(0.)

            simils.append(sum(simil) / len(simil))

        return sum(simils) / len(simils)

    def _get_relevant_paragraph(self, paragraphs, question_words):
        relevance_dict = {}
        for parag in paragraphs:

            parag_cleaned = ' '.join(preprocess_text(parag)) # getting words from the paragraph
            parag_keywords = self._extract_key_words(parag_cleaned) # extracting keywords from the paragraph

            # counting the relevance of each paragraph to the question
            relevance_dict[parag] = self._count_mean_similarity(parag_keywords, question_words)

        # getting the most relevant paragraph
        return sorted(relevance_dict, key=lambda x: relevance_dict[x], reverse=True)[0]

    def _get_all_spans(self, paragraph):
        paragraph = paragraph.split()
        spans = []

        for n in range(1, self.window_size+1):
            spans.extend([paragraph[i:i + n]
                          for i in range(len(paragraph))
                          if paragraph[i:i + n] not in spans])

        return spans

    def _get_text_vector(self, text_words):
        words_vecs = []
        for word in text_words:

            try:
                words_vecs.append(self.model[word])
            except KeyError:
                words_vecs.append(np.zeros(300))

        return np.mean(np.array(words_vecs))

    def get_answer(self, question, text):

        question_words = preprocess_text(question) # getting words from the question

        paragraphs = []
        for parts in text:
            paragraphs.extend(parts['paragraphs'])

        relevant_paragraph = self._get_relevant_paragraph(paragraphs, question_words)
        paragraph_spans = self._get_all_spans(relevant_paragraph)
        question_vector = self._get_text_vector(question_words)

        spans_relevancy_dict = {}
        for span in paragraph_spans:
            span_vector = self._get_text_vector(span)
            span_similarity = cosine_similarity(question_vector, span_vector)

            spans_relevancy_dict[' '.join(span)] = span_similarity

        return sorted(spans_relevancy_dict,
                      key=lambda x: spans_relevancy_dict[x],
                      reverse=True)[0], relevant_paragraph

# Model inference

In [142]:
model = api.load('fasttext-wiki-news-subwords-300')



In [164]:
# для оценки будем брать случайные тексты и вопросы к ним из датасета
text = qasper_dict[
    list(qasper_dict.keys())[np.random.randint(len(qasper_dict))]
    ]
question_idx = np.random.randint(len(text['qas']))
question = text['qas'][question_idx]['question']
text['title'], question

('Improving Question Generation With to the Point Context',
 'How big are significant improvements?')

In [171]:
qa_model = QASimilarityModel(window_size=10, n_keywords=10, model=model)
qa_model.get_answer(question, text['full_text'])

  return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


('the',
 'In the evaluations on the SQuAD dataset, our system achieves significant and consistent improvement as compared to all baseline methods. In particular, we demonstrate that the improvement is more significant with a larger relative distance between the answer and other non-stop sentence words that also appear in the ground truth question. Furthermore, our model is capable of generating diverse questions for a single sentence-answer pair where the sentence conveys multiple relations of its answer fragment.')

In [166]:
# давайте сравним с реальным ответом (короткий ответ + контекст)
text['qas'][question_idx]['answers'][0]['answer']

{'unanswerable': False,
 'extractive_spans': [],
 'yes_no': None,
 'free_form_answer': 'Metrics show better results on all metrics compared to baseline except Bleu1  on Zhou split (worse by 0.11 compared to baseline). Bleu1 score on DuSplit is 45.66 compared to best baseline 43.47, other metrics on average by 1',
 'evidence': ['Table TABREF30 shows automatic evaluation results for our model and baselines (copied from their papers). Our proposed model which combines structured answer-relevant relations and unstructured sentences achieves significant improvements over proximity-based answer-aware models BIBREF9, BIBREF15 on both dataset splits. Presumably, our structured answer-relevant relation is a generalization of the context explored by the proximity-based methods because they can only capture short dependencies around answer fragments while our extractions can capture both short and long dependencies given the answer fragments. Moreover, our proposed framework is a general one to j

## сгенерируем ещё несколько примеров

### 1 пример

In [172]:
text = qasper_dict[
    list(qasper_dict.keys())[np.random.randint(len(qasper_dict))]
    ]
question_idx = np.random.randint(len(text['qas']))
question = text['qas'][question_idx]['question']
text['title'], question

('Neural Collective Entity Linking',
 'Do they only use adjacent entity mentions or use more than that in some cases (next to adjacent)?')

In [173]:
qa_model = QASimilarityModel(window_size=10, n_keywords=10, model=model)
qa_model.get_answer(question, text['full_text'])

  return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


('Local',
 'Local features focus on how compatible the entity is mentioned in a piece of text (i.e., the mention and the context words). Except for the prior probability (Section SECREF9 ), we define two types of local features for each candidate entity INLINEFORM0 :')

In [174]:
text['qas'][question_idx]['answers'][0]['answer']

{'unanswerable': False,
 'extractive_spans': [],
 'yes_no': None,
 'free_form_answer': 'NCEL considers only adjacent mentions.',
 'evidence': ['Complexity Analysis Compared with local methods, the main disadvantage of collective methods is high complexity and expensive costs. Suppose there are INLINEFORM0 mentions in documents on average, among these global models, NCEL not surprisingly has the lowest time complexity INLINEFORM1 since it only considers adjacent mentions, where INLINEFORM2 is the number of sub-GCN layers indicating the iterations until convergence. AIDA has the highest time complexity INLINEFORM3 in worst case due to exhaustive iteratively finding and sorting the graph. The LBP and PageRank/random walk based methods achieve similar high time complexity of INLINEFORM4 mainly because of the inference on the entire graph.'],
 'highlighted_evidence': ['Suppose there are INLINEFORM0 mentions in documents on average, among these global models, NCEL not surprisingly has the lo

### 2 пример

In [175]:
text = qasper_dict[
    list(qasper_dict.keys())[np.random.randint(len(qasper_dict))]
    ]
question_idx = np.random.randint(len(text['qas']))
question = text['qas'][question_idx]['question']
text['title'], question

('A Framework for Evaluation of Machine Reading Comprehension Gold Standards',
 'Have they made any attempt to correct MRC gold standards according to their findings? ')

In [176]:
qa_model = QASimilarityModel(window_size=10, n_keywords=10, model=model)
qa_model.get_answer(question, text['full_text'])

  return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


('by',
 'One way by which developers of modern crowd-sourced gold standards ensure quality is by having the same entry annotated by multiple workers BIBREF18 and keeping only those with high agreement. We investigate whether this method is enough to establish a sound ground truth answer that is unambiguously correct. Concretely we annotate an answer as Debatable when the passage features multiple plausible answers, when multiple expected answers contradict each other, or an answer is not specific enough with respect to the question and a more specific answer is present. We annotate an answer as Wrong when it is factually wrong and a correct answer is present in the context.')

In [177]:
text['qas'][question_idx]['answers'][0]['answer']

{'unanswerable': False,
 'extractive_spans': [],
 'yes_no': True,
 'free_form_answer': '',
 'evidence': ['In this paper, we introduce a novel framework to characterise machine reading comprehension gold standards. This framework has potential applications when comparing different gold standards, considering the design choices for a new gold standard and performing qualitative error analyses for a proposed approach.'],
 'highlighted_evidence': ['This framework has potential applications when comparing different gold standards, considering the design choices for a new gold standard and performing qualitative error analyses for a proposed approach.']}

### 3 пример

In [178]:
text = qasper_dict[
    list(qasper_dict.keys())[np.random.randint(len(qasper_dict))]
    ]
question_idx = np.random.randint(len(text['qas']))
question = text['qas'][question_idx]['question']
text['title'], question

('Embedding Geographic Locations for Modelling the Natural Environment using Flickr Tags and Structured Data',
 'what dataset is used in this paper?')

In [179]:
qa_model = QASimilarityModel(window_size=10, n_keywords=10, model=model)
qa_model.get_answer(question, text['full_text'])

  return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


('is',
 'Our work is different from these studies, as our focus is on representing locations based on a given text description of that location (in the form of Flickr tags), along with numerical and categorical features from scientific datasets.')

In [180]:
text['qas'][question_idx]['answers'][0]['answer']

{'unanswerable': False,
 'extractive_spans': [' the same datasets as BIBREF7'],
 'yes_no': None,
 'free_form_answer': '',
 'evidence': ['There is a wide variety of structured data that can be used to describe locations. In this work, we have restricted ourselves to the same datasets as BIBREF7 . These include nine (real-valued) numerical features, which are latitude, longitude, elevation, population, and five climate related features (avg. temperature, avg. precipitation, avg. solar radiation, avg. wind speed, and avg. water vapor pressure). In addition, 180 categorical features were used, which are CORINE land cover classes at level 1 (5 classes), level 2 (15 classes) and level 3 (44 classes) and 116 soil types (SoilGrids). Note that each location should belong to exactly 4 categories: one CORINE class at each of the three levels and a soil type.'],
 'highlighted_evidence': [' In this work, we have restricted ourselves to the same datasets as BIBREF7 . These include nine (real-valued)

### 4 пример

In [181]:
text = qasper_dict[
    list(qasper_dict.keys())[np.random.randint(len(qasper_dict))]
    ]
question_idx = np.random.randint(len(text['qas']))
question = text['qas'][question_idx]['question']
text['title'], question

('Impact of Batch Size on Stopping Active Learning for Text Classification',
 'What downstream tasks are evaluated?')

In [182]:
qa_model = QASimilarityModel(window_size=10, n_keywords=10, model=model)
qa_model.get_answer(question, text['full_text'])

  return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


('An',
 'An important aspect of the active learning process is when to stop the active learning process. Stopping methods enable the potential benefits of active learning to be achieved in practice. Without stopping methods, the active learning process would continue until all annotations have been labeled, defeating the purpose of using active learning. Accordingly, there has been a lot of interest in the development of active learning stopping methods BIBREF2 , BIBREF3 , BIBREF4 , BIBREF5 , BIBREF6 .')

In [183]:
text['qas'][question_idx]['answers'][0]['answer']

{'unanswerable': True,
 'extractive_spans': [],
 'yes_no': None,
 'free_form_answer': '',
 'evidence': [],
 'highlighted_evidence': []}

### 5 пример

In [186]:
text = qasper_dict[
    list(qasper_dict.keys())[np.random.randint(len(qasper_dict))]
    ]
question_idx = np.random.randint(len(text['qas']))
question = text['qas'][question_idx]['question']
text['title'], question

('Predicting Annotation Difficulty to Improve Task Routing and Model Performance for Biomedical Information Extraction',
 'How much higher quality is the resulting annotated data?')

In [187]:
qa_model = QASimilarityModel(window_size=10, n_keywords=10, model=model)
qa_model.get_answer(question, text['full_text'])

  return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * np.linalg.norm(vector2))


('We',
 'We now examine the possibility that the higher quality and more consistent annotations of domain experts on the difficult instances will benefit the extraction model. This simulates an annotation strategy in which we route difficult instances to domain experts and easier ones to crowd annotators. We also contrast the value of difficult data to that of an i.i.d. random sample of the same size, both annotated by experts.')

In [188]:
text['qas'][question_idx]['answers'][0]['answer']

{'unanswerable': False,
 'extractive_spans': ['improvement when the difficult subset with expert annotations is mixed with the remaining crowd annotation is 3.5 F1 score, much larger than when a random set of expert annotations are added'],
 'yes_no': None,
 'free_form_answer': '',
 'evidence': ['The results show adding more training data with crowd annotation still improves at least 1 point F1 score in all three extraction tasks. The improvement when the difficult subset with expert annotations is mixed with the remaining crowd annotation is 3.5 F1 score, much larger than when a random set of expert annotations are added. The model trained with re-annotating the difficult subset (D+Other) also outperforms the model with re-annotating the random subset (R+Other) by 2 points in F1. The model trained with re-annotating both of difficult and random subsets (D+R+Other), however, achieves only marginally higher F1 than the model trained with the re-annotated difficult subset (D+Other). In s

По данным примерам делаем вывод, что наша модель способна находить подходящий для вопроса контекст, контекст, в котором может содержаться ответ, однако сам ответ обнаруживается достаточно плохо. Возможно, это как раз связано с тем главным недостатком модели, на который мы указали ранее, также возможно, что нужно использовать другие математические методы помимо вычисления косинусной близости.

Что касается получаемых контекстов, они не совпадают с истинными и иногда не содержат ожидаемый нами ответ (то есть ответ, совпадающий с ответом в нашем датасете), однако в этих контекстах часто можно найти в какой-то степени правильный и логичный ответ, пусть иногда слишком расплывчатый и не совсем прямо отвечающий на вопрос (модель напоминает лживого человека, который пытается уйти от ответа на неудобный вопрос). Кажется, что для не model-based подхода наш алгоритм работает в целом неплохо.