Dependencies

In [96]:
import torch 
from transformers import (
    BertForQuestionAnswering,
    BertTokenizerFast
)

from scipy.special import softmax
import plotly.express as px
import pandas as pd
import numpy as np 

In [97]:
context = "The giraffe is a large African hoofed mammal belonging to the genus Giraffa. It is the tallest living terrestrial animal and the largest ruminant on Earth. Traditionally, giraffes were thought to be one species, Giraffa camelopardalis, with nine subspecies. Most recently, researchers proposed dividing them into up to eight extant species due to new research into their mitochondrial and nuclear DNA, as well as morphological measurements. Seven other extinct species of Giraffa are known from the fossil record."
question = "How many giraffes species are there?"

In [98]:
model_name = "deepset/bert-base-cased-squad2"
#model_name = "deepset/roberta-base-squad2"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [99]:
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=512)


In [100]:
with torch.no_grad():
    outputs = model(**inputs)

We tokenize the context, and what BERT does is to give a probability of each token, where it will be either the start of the answer or the end of the answer

In [101]:
outputs

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-0.7849, -7.7939, -8.3531, -7.6489, -8.5965, -8.6282, -9.4338, -8.9586,
         -8.4428, -8.2568, -8.0975, -8.8458, -8.2185, -4.5230, -8.3194, -6.7624,
         -8.6803, -7.0846, -5.9340, -7.7338, -6.8487, -8.5294, -9.1060, -6.9105,
         -7.6318, -8.2203, -8.7276, -8.3912, -8.0501, -6.9143, -8.6218, -7.8172,
         -9.0100, -8.2569, -6.4151, -8.2700, -7.5583, -5.8637, -8.2565, -7.4781,
         -8.7355, -8.7740, -8.0287, -7.1294, -7.5655, -8.6153, -9.4420, -8.9166,
         -7.9060, -8.5978, -7.0571, -7.5721, -2.7808, -7.8713, -7.4432, -8.0318,
         -8.0333, -6.7128, -7.8459, -8.0049, -1.5219, -5.9873, -9.0419, -4.5403,
         -7.7411, -6.8027, -8.3700, -5.6448, -8.4717, -8.2887, -9.5469, -8.7747,
         -7.0176, -1.2607, -5.1707, -5.9624, -5.9328, -7.4170, -8.5625, -7.2535,
         -7.6185, -7.1874, -6.1013, -6.6234, -0.8019, -6.7759,  0.6271, -5.7152,
         -4.9117, -7.4438, -8.8701, -8.0098, -7.9540, -8

In [102]:
start_scores, end_scores = softmax(outputs.start_logits)[0], softmax(outputs.end_logits)[0]

In [103]:
scores_df = pd.DataFrame({
    "Token Position": list(range(len(start_scores))) * 2,
    "Score": list(start_scores) + list(end_scores),
    "Score Type": ["Start"] * len(start_scores) + ["End"] * len(end_scores),
})
px.bar(scores_df, x="Token Position", y="Score", color="Score Type", barmode="group", title="Start and End Scores for Tokens")

The answer will be at token 0 most likely to. 

In [104]:
start_index = np.argmax(start_scores) 
end_index = np.argmax(end_scores)

In [105]:
answer_ids = inputs.input_ids[0][start_index : end_index + 1]
answer_tokens = tokenizer.convert_ids_to_tokens(answer_ids)
answer = tokenizer.convert_tokens_to_string(answer_tokens)

In [106]:
answer

'eight'

In [107]:
def predict_answer(context, question):
    inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    start_scores, end_scores = softmax(outputs.start_logits)[0], softmax(outputs.end_logits)[0]
    start_idx = np.argmax(start_scores)
    end_idx = np.argmax(end_scores)
    confidence_score = (start_scores[start_idx] + end_scores[end_idx]) /2
    answer_ids = inputs.input_ids[0][start_idx: end_idx + 1]
    answer_tokens = tokenizer.convert_ids_to_tokens(answer_ids)
    answer = tokenizer.convert_tokens_to_string(answer_tokens)
    if answer != tokenizer.cls_token:
        return answer, confidence_score
    return None, confidence_score

In [120]:
context = """Coffee is a beverage prepared from roasted coffee beans. Darkly colored, bitter, and slightly acidic, coffee has a stimulating effect on humans, primarily due to its caffeine content. It has the highest sales in the world market for hot drinks.[2]

The seeds of the Coffea plant's fruits are separated to produce unroasted green coffee beans. The beans are roasted and then ground into fine particles that are typically steeped in hot water before being filtered out, producing a cup of coffee. It is usually served hot, although chilled or iced coffee is common. Coffee can be prepared and presented in a variety of ways (e.g., espresso, French press, caffè latte, or already-brewed canned coffee). Sugar, sugar substitutes, milk, and cream are often added to mask the bitter taste or enhance the flavor.

Though coffee is now a global commodity, it has a long history tied closely to food traditions around the Red Sea. The earliest credible evidence of coffee-drinking as the modern beverage appears in modern-day Yemen in southern Arabia in the middle of the 15th century in Sufi shrines, where coffee seeds were first roasted and brewed in a manner similar to how it is now prepared for drinking.[3] The coffee beans were procured by the Yemenis from the Ethiopian Highlands via coastal Somali intermediaries, and cultivated in Yemen. By the 16th century, the drink had reached the rest of the Middle East and North Africa, later spreading to Europe.

The two most commonly grown coffee bean types are C. arabica and C. robusta.[4] Coffee plants are cultivated in over 70 countries, primarily in the equatorial regions of the Americas, Southeast Asia, the Indian subcontinent, and Africa. As of 2018, Brazil was the leading grower of coffee beans, producing 35% of the world's total. Green, unroasted coffee is traded as an agricultural commodity. Despite sales of coffee reaching billions of dollars worldwide, farmers producing coffee beans disproportionately live in poverty. Critics of the coffee industry have also pointed to its negative impact on the environment and the clearing of land for coffee-growing and water use
Coffee has become a vital cash crop for many developing countries. Over one hundred million people in developing countries have become dependent on coffee as their primary source of income. It has become the primary export and backbone for African countries like Uganda, Burundi, Rwanda, and Ethiopia,[39] as well as many Central American countries."""



In [109]:
predict_answer(context, "What is coffee?")

('a beverage prepared from roasted coffee beans', 0.757049560546875)

In [110]:
predict_answer(context, "What are the common coffee beans?")

('C. arabica and C. robusta', 0.8553870916366577)

In [111]:
predict_answer(context, "How are you?")

(None, 0.9917401671409607)

In [115]:
def group(sentences, group_size, stride):
    groups = []
    num_sentences = len(sentences)
    for i in range(0, num_sentences, group_size - stride):
        chunk = sentences[i: i + group_size]
        groups.append(chunk)
    return groups

In [118]:
sentences = [
    "Sentence 1.",
    "Sentence 2.",
    "Sentence 3.",
    "Sentence 4.",
    "Sentence 5.",
    "Sentence 6.",
    "Sentence 7.",
    "Sentence 8.",
    "Sentence 9.",
    "Sentence 10."
]

grouped_sentences = group(sentences, group_size=3, stride=1)
grouped_sentences

[['Sentence 1.', 'Sentence 2.', 'Sentence 3.'],
 ['Sentence 3.', 'Sentence 4.', 'Sentence 5.'],
 ['Sentence 5.', 'Sentence 6.', 'Sentence 7.'],
 ['Sentence 7.', 'Sentence 8.', 'Sentence 9.'],
 ['Sentence 9.', 'Sentence 10.']]

In [121]:
context = """Coffee is a beverage prepared from roasted coffee beans. Darkly colored, bitter, and slightly acidic, coffee has a stimulating effect on humans, primarily due to its caffeine content. It has the highest sales in the world market for hot drinks.[2]

The seeds of the Coffea plant's fruits are separated to produce unroasted green coffee beans. The beans are roasted and then ground into fine particles that are typically steeped in hot water before being filtered out, producing a cup of coffee. It is usually served hot, although chilled or iced coffee is common. Coffee can be prepared and presented in a variety of ways (e.g., espresso, French press, caffè latte, or already-brewed canned coffee). Sugar, sugar substitutes, milk, and cream are often added to mask the bitter taste or enhance the flavor.

Though coffee is now a global commodity, it has a long history tied closely to food traditions around the Red Sea. The earliest credible evidence of coffee-drinking as the modern beverage appears in modern-day Yemen in southern Arabia in the middle of the 15th century in Sufi shrines, where coffee seeds were first roasted and brewed in a manner similar to how it is now prepared for drinking.[3] The coffee beans were procured by the Yemenis from the Ethiopian Highlands via coastal Somali intermediaries, and cultivated in Yemen. By the 16th century, the drink had reached the rest of the Middle East and North Africa, later spreading to Europe.

The two most commonly grown coffee bean types are C. arabica and C. robusta.[4] Coffee plants are cultivated in over 70 countries, primarily in the equatorial regions of the Americas, Southeast Asia, the Indian subcontinent, and Africa. As of 2018, Brazil was the leading grower of coffee beans, producing 35% of the world's total. Green, unroasted coffee is traded as an agricultural commodity. Despite sales of coffee reaching billions of dollars worldwide, farmers producing coffee beans disproportionately live in poverty. Critics of the coffee industry have also pointed to its negative impact on the environment and the clearing of land for coffee-growing and water use
Coffee has become a vital cash crop for many developing countries. Over one hundred million people in developing countries have become dependent on coffee as their primary source of income. It has become the primary export and backbone for African countries like Uganda, Burundi, Rwanda, and Ethiopia,[39] as well as many Central American countries."""



In [124]:
sentences = context.split("\n")
grouped_sentences = group(sentences, group_size=3, stride=1)
questions = questions = ["What is coffee?", "What are the most common coffee beans?", "How can I make ice coffee?", "How many people are dependent on coffee for their income?"]
answers = {}

In [130]:
def get_answers(grouped_sentences, context, questions):
    answers = {}
    for group in grouped_sentences:
        context = "\n".join(group)
        for question in questions:
            answer, score = predict_answer(context, question)
            if answer:
                if question not in answers:
                    answers[question] = (answer, score)
                else:
                    if score > answers[question][1]:
                        answers[question] = (answer, score)
    return answers

In [131]:
get_answers(grouped_sentences, context, questions)

{'What is coffee?': ('a beverage prepared from roasted coffee beans',
  0.8990074396133423),
 'What are the most common coffee beans?': ('C. arabica and C. robusta',
  0.9542303681373596),
 'How many people are dependent on coffee for their income?': ('Over one hundred million',
  0.8877464532852173)}