# Hugging face for question answering

## Test con modelli predefiniti

La libreria hugging face permette di importare modelli come faceva flair, questa libreria ha i metodi anche per impostare il modello direttamente oer il question answering.

In [68]:
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading: 100%|██████████| 232k/232k [00:00<00:00, 369kB/s]
Downloading: 100%|██████████| 1.34G/1.34G [14:26<00:00, 1.55MB/s]


Nelle seguenti linee si dichiarano la domanda e gli snippet e si generano le embeddings. In que

In [69]:
question, text = "Who is Mary?", "Mary is a brillant student in Rome"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(all_tokens)
print(input_ids)
print(token_type_ids)

['[CLS]', 'who', 'is', 'mary', '?', '[SEP]', 'mary', 'is', 'a', 'br', '##illa', '##nt', 'student', 'in', 'rome', '[SEP]']
[101, 2040, 2003, 2984, 1029, 102, 2984, 2003, 1037, 7987, 9386, 3372, 3076, 1999, 4199, 102]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Per questo primo esempio si da in pasto al modello la domanda in modo che possa calcolare i punteggi reativi a all'inizio di una risposta e alla fine di una risposta. Ci sarà un punteggio per ogni token in input e si prendono i massimi

In [70]:
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
print(start_scores)
print(end_scores)
print(torch.argmax(start_scores))
print(torch.argmax(end_scores))

tensor([[-5.0109, -5.9293, -6.8523, -2.9008, -8.2639, -5.0107,  0.6795, -3.0905,
          4.8076,  3.5589, -3.9001, -4.0644, -1.2053, -2.7351,  0.3483, -5.0100]],
       grad_fn=<SqueezeBackward1>)
tensor([[ 0.1915, -6.0547, -6.0942, -2.1645, -5.8761,  0.1918, -0.7231, -4.6880,
         -3.2773, -2.9547, -3.3409,  0.5081,  4.2771, -2.4325,  4.1675,  0.1900]],
       grad_fn=<SqueezeBackward1>)
tensor(8)
tensor(12)


In [71]:

answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
print(answer)


a br ##illa ##nt student


In [72]:
# Start with the first token.
answer_start=torch.argmax(start_scores)
answer_end=torch.argmax(end_scores)
answer = all_tokens[answer_start]

# Select the remaining answer tokens and join them with whitespace.
for i in range(answer_start + 1, answer_end + 1):
    
    # If it's a subword token, then recombine it with the previous token.
    if all_tokens[i][0:2] == '##':
        answer += all_tokens[i][2:]
    
    # Otherwise, add a space then the token.
    else:
        answer += ' ' + all_tokens[i]

print('Answer: "' + answer + '"')

Answer: "a brillant student"


## Test con importazione modello

In [5]:
from transformers import BertTokenizer, BertModel, BertForQuestionAnswering
import torch
tokenizer = BertTokenizer.from_pretrained("./embedding_models/biobert_factoid")
model = BertForQuestionAnswering.from_pretrained("./embedding_models/biobert_factoid")

In [6]:
question, text = "Who is Mary?", "Mary is a brillant student in Rome"
encoding = tokenizer.encode_plus(question, text)
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(all_tokens)
print(input_ids)
print(token_type_ids)

['[CLS]', 'who', 'is', 'ma', '##ry', '?', '[SEP]', 'ma', '##ry', 'is', 'a', 'br', '##illa', '##nt', 'student', 'in', 'r', '##ome', '[SEP]']
[101, 1150, 1110, 12477, 1616, 136, 102, 12477, 1616, 1110, 170, 9304, 5878, 2227, 2377, 1107, 187, 6758, 102]
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [7]:
start_scores, end_scores = model(torch.tensor([input_ids]), token_type_ids=torch.tensor([token_type_ids]))
print(start_scores)
print(end_scores)
print(torch.argmax(start_scores))
print(torch.argmax(end_scores))

tensor([[-3.2683, -4.1020, -7.0528, -3.5073, -7.0091, -6.8528, -1.2135,  1.1049,
         -3.0018,  3.1927,  7.3026,  5.9603, -6.2884, -2.8307,  4.2421, -1.0523,
          0.0679, -4.2151, -1.2135]], grad_fn=<SqueezeBackward1>)
tensor([[-0.3841, -6.0075, -5.4050, -7.4030, -0.5038, -4.6372,  4.4007, -4.6523,
          0.9098, -3.6444, -1.0526, -1.8325, -4.6182,  2.4971,  5.3820,  0.4778,
         -0.3536,  1.9850,  4.4007]], grad_fn=<SqueezeBackward1>)
tensor(10)
tensor(14)


In [8]:

answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
print(answer)

a br ##illa ##nt student


In [9]:
# Start with the first token.
answer_start=torch.argmax(start_scores)
answer_end=torch.argmax(end_scores)
answer = all_tokens[answer_start]

# Select the remaining answer tokens and join them with whitespace.
for i in range(answer_start + 1, answer_end + 1):
    
    # If it's a subword token, then recombine it with the previous token.
    if all_tokens[i][0:2] == '##':
        answer += all_tokens[i][2:]
    
    # Otherwise, add a space then the token.
    else:
        answer += ' ' + all_tokens[i]

print('Answer: "' + answer + '"')

Answer: "a brillant student"
