In [1]:
import datasets

dataset = datasets.load_dataset("squad_v2")

Found cached dataset squad_v2 (/home/ubuntu/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d)


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
dataset['train'][0]

{'id': '56be85543aeaaa14008c9063',
 'title': 'Beyoncé',
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': 'When did Beyonce start becoming popular?',
 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}

In [2]:
from transformers import BertTokenizerFast, BertForQuestionAnswering

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

In [4]:
import nltk
nltk.download('punkt')

def find_sentence_in_context(context, sentence):
    return context.find(sentence)

def sentence_starts(context, sentences):
    starts = []
    for sentence in sentences:
        starts.append(context.find(sentence))
    return starts

def sentence_end(context, sentences, sentence_starts):
    return [start + len(sentence) for start, sentence in zip(sentence_starts, sentences)]

def find_sentence_from_answer(context, sentence_starts, sentence_ends, answer):
    answer_start = context.find(answer)
    answer_end = answer_start + len(answer)
    start = max([start for start in sentence_starts if start <= answer_start])
    end = min([end for end in sentence_ends if end >= answer_end])
    return start, end

def find_position(offset_mapping, text_position):
    for i, offset in enumerate(offset_mapping):
        if offset[0] <= text_position <= offset[1]:
            return i
    return 0

def convert_to_features(batch):
    questions = [q.strip() for q in batch["question"]]
    inputs = tokenizer(questions, batch["context"], truncation='only_second', padding="max_length", max_length=512,return_offsets_mapping=True)
    offset_mapping = inputs.pop("offset_mapping")
    answers = [answer["text"][0] for answer in batch["answers"]]
    sentences = [nltk.sent_tokenize(context) for context in batch["context"]]
    sentence_starts_ = [sentence_starts(context, sentence) for context, sentence in zip(batch["context"], sentences)]
    sentence_ends_ = [sentence_end(context, sentence, sentence_starts) for context, sentence, sentence_starts in zip(batch["context"], sentences, sentence_starts_)]
    positions_text = [find_sentence_from_answer(context, sentence_starts, sentence_ends, answer) for context, sentence_starts, sentence_ends, answer in zip(batch["context"], sentence_starts_, sentence_ends_, answers)]
    start_positions = [find_position(offset_mapping[i], label[0]) for i, label in enumerate(positions_text)]
    end_positions = [find_position(offset_mapping[i], label[1]) for i, label in enumerate(positions_text)]

    inputs['start_positions'] = start_positions
    inputs['end_positions'] = end_positions
    inputs['labels'] = list(zip(start_positions, end_positions))
    return inputs


    

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
dataset = dataset.filter(lambda x: len(x['answers']['text']) > 0).map(convert_to_features, batched=True, batch_size=8)

Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-e230092325bb4c8b.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-9e7035250d7beaf6.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-4c68b83fdea2a377.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d/cache-e47a902b3fa9a3cc.arrow


In [6]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

In [7]:
from transformers.trainer import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./models/distilbert-squad-full-sentence',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    do_train=True,
    do_eval=True,
    warmup_steps=500,   
    weight_decay=0.01,
    logging_dir='./logs',
    learning_rate=2e-5,
    logging_steps=10,

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
)

In [8]:
trainer.train()

Step,Training Loss
10,5.9931
20,5.9634
30,5.9246
40,5.7821
50,5.6038
60,5.4041
70,5.187
80,4.939
90,4.5968
100,4.2214


In [9]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForQuestionAnswering.forward` and have been ignored: title, context, labels, answers, question, id. If title, context, labels, answers, question, id are not expected by `BertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5928
  Batch size = 64


{'eval_loss': 0.45976531505584717}

In [10]:
tokenizer.save_pretrained('./models/distilbert-squad-full-sentence')
model.save_pretrained('./models/distilbert-squad-full-sentence')

tokenizer config file saved in ./models/distilbert-squad-full-sentence/tokenizer_config.json
Special tokens file saved in ./models/distilbert-squad-full-sentence/special_tokens_map.json
Configuration saved in ./models/distilbert-squad-full-sentence/config.json
Model weights saved in ./models/distilbert-squad-full-sentence/pytorch_model.bin


In [11]:
from transformers import pipeline

In [20]:
p = pipeline('question-answering', model=model.cpu(), tokenizer=tokenizer)

In [24]:
item = dataset['validation'][12]
print(item['question'])
print(item['context'])
print(p(question=item['question'], context=item['context']))

print(item['answers']['text'][0])
print(item['start_positions'])
print(item['end_positions'])

What river originally bounded the Duchy
In the course of the 10th century, the initially destructive incursions of Norse war bands into the rivers of France evolved into more permanent encampments that included local women and personal property. The Duchy of Normandy, which began in 911 as a fiefdom, was established by the treaty of Saint-Clair-sur-Epte between King Charles III of West Francia and the famed Viking ruler Rollo, and was situated in the former Frankish kingdom of Neustria. The treaty offered Rollo and his men the French lands between the river Epte and the Atlantic coast in exchange for their protection against further Viking incursions. The area corresponded to the northern part of present-day Upper Normandy down to the river Seine, but the Duchy would eventually extend west beyond the Seine. The territory was roughly equivalent to the old province of Rouen, and reproduced the Roman administrative structure of Gallia Lugdunensis II (part of the former Gallia Lugdunensis)

In [None]:
item

{'id': '56dddf4066d3e219004dad5f',
 'title': 'Normans',
 'context': 'The Norman dynasty had a major political, cultural and military impact on medieval Europe and even the Near East. The Normans were famed for their martial spirit and eventually for their Christian piety, becoming exponents of the Catholic orthodoxy into which they assimilated. They adopted the Gallo-Romance language of the Frankish land they settled, their dialect becoming known as Norman, Normaund or Norman French, an important literary language. The Duchy of Normandy, which they formed by treaty with the French crown, was a great fief of medieval France, and under Richard I of Normandy was forged into a cohesive and formidable principality in feudal tenure. The Normans are noted both for their culture, such as their unique Romanesque architecture and musical traditions, and for their significant military accomplishments and innovations. Norman adventurers founded the Kingdom of Sicily under Roger II after conquering