<a href="https://colab.research.google.com/github/cagBRT/promptEngineering/blob/main/Encoder_Decoder_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook gives examples of implementing Encoder models

In [None]:
# Clone the entire repo.
!git clone -s https://github.com/cagBRT/promptEngineering.git cloned-repo
%cd cloned-repo

In [None]:
! pip install transformers

In [None]:
from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel

**Configure the encoder and decoder**

In [None]:
config_encoder = BertConfig()
config_decoder = BertConfig()

config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
model = EncoderDecoderModel(config=config)

In [None]:
from transformers import EncoderDecoderModel, BertTokenizer

**Select a tokenizer**<br><br>
**BertTokenizer**:
BERT uses what is called a WordPiece tokenizer. It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one word can be broken into multiple tokens. An example of where this can be useful is where we have multiple forms of words.<br><br>
BERT's creators noted a significant decrease in performance when using documents longer than 512 tokens. So, this limit was put to guard against low quality output

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")

In [None]:
from transformers import AutoTokenizer, EncoderDecoderModel

**patrickvonplaten/bert2bert_cnn_daily_mail**<br><br>
This model is a warm-started BERT2BERT model fine-tuned on the CNN/Dailymail summarization dataset.

In [None]:
# load a fine-tuned seq2seq model and corresponding tokenizer
model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert_cnn_daily_mail")
tokenizer = AutoTokenizer.from_pretrained("patrickvonplaten/bert2bert_cnn_daily_mail")

In [None]:
# let's perform inference on a long piece of text
ARTICLE_TO_SUMMARIZE = (
    "PG&E stated it scheduled the blackouts in response to forecasts for high winds "
    "amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were "
    "scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."
)
input_ids = tokenizer(ARTICLE_TO_SUMMARIZE, return_tensors="pt").input_ids

**Autoregressive models** predict future values based on past values. <br><br>
For example: they are widely used in technical analysis to forecast future security prices.

In [None]:
# autoregressively generate summary (uses greedy decoding by default)
generated_ids = model.generate(input_ids)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

The model summarized the text<br>
Look at the summary the model generated.Would you comsider this a good summary?




---



In [None]:
print(generated_text)

In [None]:
# let's perform inference on a long piece of text
ARTICLE_TO_SUMMARIZE = (
    "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building,"
    "and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on "
    "each side. During its construction, the Eiffel Tower surpassed the Washington Monument to "
    "become the tallest man-made structure in the world, a title it held for 41 years until the"
    "Chrysler Building in New York City was finished in 1930. It was the first structure to reach"
    "a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower"
    "in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). "
    "Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France"
    "after the Millau Viaduct."
)
input_ids = tokenizer(ARTICLE_TO_SUMMARIZE, return_tensors="pt").input_ids

In [None]:
# autoregressively generate summary (uses greedy decoding by default)
generated_ids = model.generate(input_ids)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

**What do you think of this summary?**<br>
Does this example illustrate how these transformation models can make mistakes?

In [None]:
print(generated_text)

**Assignment:** <br>
Find a block of text and see the model summarizes it. <br>
Is there any pattern to the text segments the model selects for the summary?



---



In [None]:
# a workaround to load from pytorch checkpoint
from transformers import EncoderDecoderModel, TFEncoderDecoderModel

_model = EncoderDecoderModel.from_pretrained("patrickvonplaten/bert2bert-cnn_dailymail-fp16")

_model.encoder.save_pretrained("./encoder")
_model.decoder.save_pretrained("./decoder")

model = TFEncoderDecoderModel.from_encoder_decoder_pretrained(
    "./encoder", "./decoder", encoder_from_pt=True, decoder_from_pt=True
)
# This is only for copying some specific attributes of this particular model.
model.config = _model.config

In [None]:
from transformers import BertTokenizer, EncoderDecoderModel

**Select the tokenizer**<br>
**And configure the model**

In [None]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-uncased", "bert-base-uncased")

model.config.decoder_start_token_id = tokenizer.cls_token_id
model.config.pad_token_id = tokenizer.pad_token_id

In [None]:
input_ids = tokenizer(
    "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building,"
    " and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side."
    "During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest "
    "man-made structure in the world, a title it held for 41 years until the Chrysler Building in "
    "New York City was  finished in 1930. It was the first structure to reach a height of 300 metres."
    "Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than"
    "the Chrysler Building by 5.2 metres (17 ft).Excluding transmitters, the Eiffel Tower is the second"
    " tallest free-standing structure in France after the Millau Viaduct.",
    return_tensors="pt",
).input_ids

Once the model is created, it can be fine-tuned similar to BART, T5 or any other encoder-decoder model. <br><br>
**Only 2 inputs are required for the model in order to compute a loss**:
- input_ids (which are the input_ids of the encoded input sequence)
- labels (which are the input_ids of the encoded target sequence).

In [None]:
labels = tokenizer(
    "the eiffel tower surpassed the washington monument to become the tallest structure in the world. it was the first structure to reach a height of 300 metres in paris in 1930. it is now taller than the chrysler building by 5. 2 metres ( 17 ft ) and is the second tallest free - standing structure in paris.",
    return_tensors="pt",
).input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss

A loss function guides the training algorithm to update parameters in the right way. In a much simple definition, a loss function takes a truth (y) and a prediction (ŷ) as input and gives a score of real value number. This value indicates how much the prediction is close to the truth.

In [None]:
loss

Range of values for some Loss functions:<br>

- 0.00: Perfect probabilities<br>
- < 0.02: Great probabilities<br>
- < 0.05: In a good way<br>
- < 0.20: Great<br>
- "> 0.30: Not great"<br>
- 1.00: Hell<br>
- "> 2.00 Something is not working"<br>