<a href="https://colab.research.google.com/github/aecoaker/FTA-Summary/blob/master/Exploring_BART_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring BART Models to find best Pre-Trained Option

## Example of work prediction

In [50]:
import random
from transformers import BartForConditionalGeneration, BartTokenizer, BartConfig
# load a pre-trained model and tokenizer 'bart-large-cnn'
tokeniser = BartTokenizer.from_pretrained('facebook/bart-base')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

In [6]:
text = "There is nothing quite like a sunny day to remind someone of their own mortality."

In [11]:
#use bart for summary of the sentence to check it all works
inputs = tokeniser.batch_encode_plus([text],return_tensors='pt')
summary_ids = model.generate(inputs['input_ids'], early_stopping=True)
bart_summaries = tokeniser.decode(summary_ids[0], skip_special_tokens=True)
print(bart_summaries)



There is nothing quite like a sunny day to remind someone of their own mortality. There is also nothing like a sun-soaked beach to remind you that you are not immortal. There are no guarantees in life, but there are some things that can be learned from the sun.


In [32]:
text = "There is nothing quite like a sunny <mask> to remind someone of their own mortality."

In [33]:
input_ids = tokeniser([text], return_tensors="pt")["input_ids"]
logits = model(input_ids).logits

In [47]:
input_ids[0]

tensor([    0,   970,    16,  1085,  1341,   101,    10,  5419, 50264,     7,
         8736,   951,     9,    49,   308, 15812,     4,     2])

In [73]:
text = "<mask>"
input_ids = tokeniser([text], return_tensors="pt")["input_ids"]
logits = model(input_ids).logits
input_ids[0]

tensor([    0, 50264,     2])

In [75]:
input_ids[0][1] = 2

In [76]:
input_ids

tensor([[0, 2, 2]])

In [39]:
masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)

In [40]:
tokenizer.decode(predictions).split()

['day,', 'morning', 'moment', 'afternoon']

In [41]:
values

tensor([0.2780, 0.1116, 0.0756, 0.0548, 0.0384], grad_fn=<TopkBackward0>)

## Writing this into a function that can be used for assessment

In [42]:
def is_pred_good(text, model = 'facebook/bart-base'):
  #read in chosen model
  tokeniser = BartTokenizer.from_pretrained(model)
  model = BartForConditionalGeneration.from_pretrained(model)
  #tokenise text, sample from it, turn those to masks and predict them
  input_ids = tokeniser([text], return_tensors="pt")["input_ids"]
  n = len(input_ids[0])
  n_masks = int(n/10)
  masks_sample = random.sample(range(1, n), n_masks)
  for i in range(n):
    

  logits = model(input_ids).logits
  masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
  probs = logits[0, masked_index].softmax(dim=0)
  values, predictions = probs.topk(5)
  return tokenizer.decode(predictions).split()

In [43]:
is_pred_good(text)

['day,', 'morning', 'moment', 'afternoon']

In [49]:
int(5.9)

5