#Modelo BERT pre-entrenado para el problema de Preguntas y Respuestas

Nos basaremos en la librería HuggingFace que nos proporciona modelos Transformer pre-entrenados:

https://huggingface.co/

https://github.com/huggingface/transformers/


En particular este ejemplo está basado en el pequeño ejemplo mostrado para el modelo pre-entrenado TFBertForQuestionAnswering, que es el modelo BERT_Large pre-entrenado (fine-tuning) para problemas de preguntas y respuestas:

https://huggingface.co/transformers/model_doc/bert.html?highlight=bertforquestionanswering#tfbertforquestionanswering 

https://huggingface.co/transformers/pretrained_models.html


Usaremos además el modelo siguiente pre-entrenado (fine-tuning) con la base de datos SQuad::

https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad

https://rajpurkar.github.io/SQuAD-explorer/


In [None]:
!pip install transformers

In [2]:
import tensorflow as tf
import numpy as np

Descargemos el modelo. Estrictamente no se descarga en nuestro disco, sino que el modelo está en algún lugar de la web y que nos proporciona HuggingFace.

Observa que el modelo es de 1.34GB.

In [None]:
from transformers import TFBertForQuestionAnswering
model = TFBertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


Cargamos el tokenizer con su vocabulario en el caso uncased:

In [None]:
from transformers import BertTokenizer 

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

In [None]:
tokenizer.vocab_size

## Utilicemos el modelo pre-entrenado para realizar alguna preguntas, dado un texto.

Para este ejemplo usarmos algunos párrafos de Wikipedia sobre Stephen Hawking:

https://en.wikipedia.org/wiki/Stephen_Hawking


Hawking was born in Oxford into a family of doctors. He began his university education at University College, Oxford, in October 1959 at the age of 17, where he received a first-class BA (Hons.) degree in physics. He began his graduate work at Trinity Hall, Cambridge, in October 1962, where he obtained his PhD degree in applied mathematics and theoretical physics, specialising in general relativity and cosmology in March 1966. 

In 1963, Hawking was diagnosed with an early-onset slow-progressing form of motor neurone disease that gradually paralysed him over the decades.[20][21] After the loss of his speech, he communicated through a speech-generating device initially through use of a handheld switch, and eventually by using a single cheek muscle. Hawking achieved commercial success with several works of popular science in which he discussed his theories and cosmology in general. His book A Brief History of Time appeared on the Sunday Times bestseller list for a record-breaking 237 weeks. Hawking was a Fellow of the Royal Society, a lifetime member of the Pontifical Academy of Sciences, and a recipient of the Presidential Medal of Freedom, the highest civilian award in the United States. In 2002, Hawking was ranked number 25 in the BBC's poll of the 100 Greatest Britons. He died on 14 March 2018 at the age of 76, after living with motor neurone disease for more than 50 years.

In [None]:
import textwrap
from textwrap import wrap
wrapper = textwrap.TextWrapper(width=70) # ancho del texto a despegar.
#texto = "Hawking was born in Oxford into a family of doctors. He began his university education at University College, Oxford, in October 1959 at the age of 17, where he received a first-class BA (Hons.) degree in physics. He began his graduate work at Trinity Hall, Cambridge, in October 1962, where he obtained his PhD degree in applied mathematics and theoretical physics, specialising in general relativity and cosmology in March 1966. In 1963, Hawking was diagnosed with an early-onset slow-progressing form of motor neurone disease that gradually paralysed him over the decades.[20][21] After the loss of his speech, he communicated through a speech-generating device initially through use of a handheld switch, and eventually by using a single cheek muscle. Hawking achieved commercial success with several works of popular science in which he discussed his theories and cosmology in general. His book A Brief History of Time appeared on the Sunday Times bestseller list for a record-breaking 237 weeks. Hawking was a Fellow of the Royal Society, a lifetime member of the Pontifical Academy of Sciences, and a recipient of the Presidential Medal of Freedom, the highest civilian award in the United States. In 2002, Hawking was ranked number 25 in the BBC's poll of the 100 Greatest Britons. He died on 14 March 2018 at the age of 76, after living with motor neurone disease for more than 50 years."
texto = "In 1963, Hawking was diagnosed with an early-onset slow-progressing form of motor neurone disease that gradually paralysed him over the decades.[20][21] After the loss of his speech, he communicated through a speech-generating device initially through use of a handheld switch, and eventually by using a single cheek muscle. Hawking achieved commercial success with several works of popular science in which he discussed his theories and cosmology in general. His book A Brief History of Time appeared on the Sunday Times bestseller list for a record-breaking 237 weeks. Hawking was a Fellow of the Royal Society, a lifetime member of the Pontifical Academy of Sciences, and a recipient of the Presidential Medal of Freedom, the highest civilian award in the United States."
print(wrapper.fill(texto)) #  

In [7]:
question = "what disease did Hawking have?"

#question = "How did Hawking talk?"
#question = "What is the title of his best-seller?"

In [8]:
# Apply the tokenizer to the input text, treating them as a text-pair.
input_ids = tokenizer.encode(question, texto)
print('The input has a total of {:} tokens.'.format(len(input_ids)))

The input has a total of 165 tokens.


In [None]:
# BERT only needs the token IDs, but for the purpose of inspecting the 
# tokenizer's behavior, let's also get the token strings and display them.
tokens = tokenizer.convert_ids_to_tokens(input_ids)

# For each token and its id...
for token, id in zip(tokens, input_ids):
    
    # If this is the [SEP] token, add some space around it to make it stand out.
    if id == tokenizer.sep_token_id:
        print('')
    
    # Print the token string and its ID in two columns.
    print('{:<12} {:>6,}'.format(token, id))

    if id == tokenizer.sep_token_id:
        print('')
    

In [10]:
# Search the input_ids for the first instance of the `[SEP]` token.
sep_index = input_ids.index(tokenizer.sep_token_id)

# The number of segment A tokens includes the [SEP] token istelf.
num_seg_a = sep_index + 1

# The remainder are segment B.
num_seg_b = len(input_ids) - num_seg_a

# Construct the list of 0s and 1s.
segment_ids = [0]*num_seg_a + [1]*num_seg_b

# There should be a segment_id for every input token.
assert len(segment_ids) == len(input_ids)

In [None]:
tf.convert_to_tensor([input_ids], np.int32) 

In [12]:
outputs = model(tf.convert_to_tensor([input_ids], dtype=np.int32),
                token_type_ids = tf.convert_to_tensor([segment_ids], dtype=np.int32) , 
                return_dict=True) 

start_scores = outputs.start_logits
end_scores = outputs.end_logits


In [13]:
#start_scores
#end_scores

In [None]:
start_scores   # EagerTensor

In [None]:
start_scores.shape

In [None]:
tokens

In [None]:
#tf.argmax(start_scores, axis=1)[0]

In [None]:
# Find the tokens with the highest `start` and `end` scores.

answer_start = tf.argmax(start_scores, axis=1)
answer_end = tf.argmax(end_scores, axis=1)

answer = ' '.join(tokens[answer_start[0]:answer_end[0]+1])

print('Answer: "' + answer + '"')

In [None]:
# Start with the first token.
answer = tokens[answer_start[0]]

# Select the remaining answer tokens and join them with whitespace.
for i in range(answer_start[0] + 1, answer_end[0] + 1):
    
    # If it's a subword token, then recombine it with the previous token.
    if tokens[i][0:2] == '##':
        answer += tokens[i][2:]
    
    # Otherwise, add a space then the token.
    else:
        answer += ' ' + tokens[i]

print('Answer: "' + answer + '"')

Veamos la salida probabilística de cada palabra

In [20]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style='darkgrid')

#sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (32,8)

In [21]:
s_scores = start_scores.numpy()
e_scores = end_scores.numpy()

# tokens identificados de manera única en el eje-x
token_labels = []
for (i, token) in enumerate(tokens):
    token_labels.append('{:} - {:>2}'.format(token, i))


In [None]:
ax = sns.barplot(x=token_labels, y=np.squeeze(s_scores), ci=None)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="center")
ax.grid(True)
plt.title('Start_Word scores')

plt.show()

In [None]:
ax = sns.barplot(x=token_labels, y=np.squeeze(e_scores), ci=None)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha="center")
ax.grid(True)
plt.title('End_Word scores')

plt.show()