## **Loading a pretrained modal from modal(facebook/bart-large-cnn) hub and predicting the results**

In [1]:
!pip install transformers # installing the tranformers library from hugging face



In [2]:
from transformers import pipeline # import the pipeline to create a modal instance from the pretrained modal
model_name="facebook/bart-large-cnn" # name of the selected modal
summarizer = pipeline("summarization", model=model_name) # creating an instance of the pretrained modal, later can be used for summarization tasks

In [3]:
# taking an example article from 'cnn_dailymail' dataset available at https://huggingface.co/datasets/cnn_dailymail
article="""
QUEBEC, Canada -- Third seed Julia Vakulenko will face comeback queen Lindsay Davenport in her first WTA Tour final at the Bell Challenge on Sunday. Julia Vakulenko will seek her first victory on the WTA Tour at the Bell Challenge in Quebec. The Ukrainian battled through with a 6-1 4-6 7-5 victory over American qualifier Julie Ditty in the semifinals. The 24-year-old, who reached the fourth round of the U.S. Open, had previously twice lost at the last-four stage this year in Las Vegas and Berlin. She reached a career high of 33rd in the world rankings back in May, but is now 36th. "Sometimes you play your best and win easy, but sometimes you don't play your best and really have to fight hard," said Vakulenko, who squandered points for 5-3 leads in both the second and third sets. "I'm just going to try my best -- I've never played her and I'm looking forward to it." Former world No. 1 Davenport is seeking her second win in three tournaments since returning from a one-year hiatus to have a baby. The 31-year-old, who is unseeded after accepting a wild-card to enter the Canadian tournament for the first time, also had to battle to beat Russian second seed Vera Zvonareva 6-2 6-7 (3-7) 6-3 in the semifinals. The three-time Grand Slam winner has surged back up the rankings from 234th to 126th after winning her comeback tournament in Bali and then reaching the last four in Beijing. The American has now beaten Zvonareva in all six encounters between the two players. "I played well in the first set and had some chances early in the second set, but I didn't quite capitalize on them. I was able to come back but at 4-4 and 5-5 I just didn't return well enough," Davenport said. "I was happy I was able to regroup in the third set. Physically I feel good. There are lots of positives I can take from it, especially beating a really good player and now being in the final. "I want to be the one on the offensive and not the defensive, and that's what I'm going to try to do. "I was trying to watch the first semifinal and see if that helped, but I play so much differently than Julie Ditty that it was hard to get anything from it." E-mail to a friend .
"""
# Now passing it to the summarizer to get the summary of the given article
summarizer(article)

[{'summary_text': 'Julia Vakulenko will face Lindsay Davenport in the final of the Bell Challenge. The Ukrainian third seed beat American qualifier Julie Ditty 6-1 4-6 7-5. Former world No. 1 Daven Port beat Russian second seed Vera Zvonareva in straight sets.'}]

## **Now instead of using the pipeline, we will manually load the modal then clean the data(tokenize) and pass it to the modal to see the results**


In [4]:
model_name="facebook/bart-large-cnn"
# importing AutoTokenizer for loading the tokenizer of the modal and TFAutoModelForSeq2SeqLM(tensorflow) to load the pretrained modal
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM
# tf_model contains the downloaded pretrained model
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_name) # model is loaded into the tf_model variable
# tokenizer is the tokenizer used for the modal
tokenizer = AutoTokenizer.from_pretrained(model_name) # tokenizer is loaded into the tokenizer variable

All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at facebook/bart-large-cnn.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


In [5]:
article="""
QUEBEC, Canada -- Third seed Julia Vakulenko will face comeback queen Lindsay Davenport in her first WTA Tour final at the Bell Challenge on Sunday. Julia Vakulenko will seek her first victory on the WTA Tour at the Bell Challenge in Quebec. The Ukrainian battled through with a 6-1 4-6 7-5 victory over American qualifier Julie Ditty in the semifinals. The 24-year-old, who reached the fourth round of the U.S. Open, had previously twice lost at the last-four stage this year in Las Vegas and Berlin. She reached a career high of 33rd in the world rankings back in May, but is now 36th. "Sometimes you play your best and win easy, but sometimes you don't play your best and really have to fight hard," said Vakulenko, who squandered points for 5-3 leads in both the second and third sets. "I'm just going to try my best -- I've never played her and I'm looking forward to it." Former world No. 1 Davenport is seeking her second win in three tournaments since returning from a one-year hiatus to have a baby. The 31-year-old, who is unseeded after accepting a wild-card to enter the Canadian tournament for the first time, also had to battle to beat Russian second seed Vera Zvonareva 6-2 6-7 (3-7) 6-3 in the semifinals. The three-time Grand Slam winner has surged back up the rankings from 234th to 126th after winning her comeback tournament in Bali and then reaching the last four in Beijing. The American has now beaten Zvonareva in all six encounters between the two players. "I played well in the first set and had some chances early in the second set, but I didn't quite capitalize on them. I was able to come back but at 4-4 and 5-5 I just didn't return well enough," Davenport said. "I was happy I was able to regroup in the third set. Physically I feel good. There are lots of positives I can take from it, especially beating a really good player and now being in the final. "I want to be the one on the offensive and not the defensive, and that's what I'm going to try to do. "I was trying to watch the first semifinal and see if that helped, but I play so much differently than Julie Ditty that it was hard to get anything from it." E-mail to a friend .
"""

In [7]:
input_data = tokenizer(
    article,
    padding=True,
    truncation=True,
    return_tensors="tf"
) # here we are tokenizing the article

In [8]:
input_data

{'input_ids': <tf.Tensor: shape=(1, 532), dtype=int32, numpy=
array([[    0, 50118,  1864,  9162,   387,  3586,     6,   896,   480,
         7470,  5018, 11450,   468,   677,   922, 19536,    40,   652,
         7115, 12133, 13853,   211, 10570,  3427,    11,    69,    78,
          305,  3847,  3637,   507,    23,     5,  3043, 10045,    15,
          395,     4, 11450,   468,   677,   922, 19536,    40,  2639,
           69,    78,  1124,    15,     5,   305,  3847,  3637,    23,
            5,  3043, 10045,    11,  7534,     4,    20,  9302, 12248,
          149,    19,    10,   231,    12,   134,   204,    12,   401,
          262,    12,   245,  1124,    81,   470, 18008,  9786,   211,
        18308,    11,     5, 12477,     4,    20,   706,    12,   180,
           12,   279,     6,    54,  1348,     5,   887,  1062,     9,
            5,   121,     4,   104,     4,  2117,     6,    56,  1433,
         2330,   685,    23,     5,    94,    12, 10231,  1289,    42,
           76, 

In [9]:
for key, value in input_data.items():
    print(f"{key}: {value.numpy().tolist()}") # this contains the input_ids where the article is encoded into a vector and attention masks

input_ids: [[0, 50118, 1864, 9162, 387, 3586, 6, 896, 480, 7470, 5018, 11450, 468, 677, 922, 19536, 40, 652, 7115, 12133, 13853, 211, 10570, 3427, 11, 69, 78, 305, 3847, 3637, 507, 23, 5, 3043, 10045, 15, 395, 4, 11450, 468, 677, 922, 19536, 40, 2639, 69, 78, 1124, 15, 5, 305, 3847, 3637, 23, 5, 3043, 10045, 11, 7534, 4, 20, 9302, 12248, 149, 19, 10, 231, 12, 134, 204, 12, 401, 262, 12, 245, 1124, 81, 470, 18008, 9786, 211, 18308, 11, 5, 12477, 4, 20, 706, 12, 180, 12, 279, 6, 54, 1348, 5, 887, 1062, 9, 5, 121, 4, 104, 4, 2117, 6, 56, 1433, 2330, 685, 23, 5, 94, 12, 10231, 1289, 42, 76, 11, 2588, 2461, 8, 5459, 4, 264, 1348, 10, 756, 239, 9, 2357, 2586, 11, 5, 232, 8359, 124, 11, 392, 6, 53, 16, 122, 2491, 212, 4, 22, 13624, 47, 310, 110, 275, 8, 339, 1365, 6, 53, 2128, 47, 218, 75, 310, 110, 275, 8, 269, 33, 7, 1032, 543, 60, 26, 468, 677, 922, 19536, 6, 54, 9316, 463, 3215, 332, 13, 195, 12, 246, 3315, 11, 258, 5, 200, 8, 371, 3880, 4, 22, 100, 437, 95, 164, 7, 860, 127, 275, 480, 38

In [10]:
output_summary = model(input_data) # now we use the modal to guess the summary for the article

In [11]:
output_summary  # We can decode this with the tokenizer to get the final output in string format

TFSeq2SeqLMOutput([('logits',
                    <tf.Tensor: shape=(1, 532, 50264), dtype=float32, numpy=
                    array([[[11.800372  ,  1.0455639 ,  2.4607432 , ...,  1.2767552 ,
                              0.9321272 ,  1.1559694 ],
                            [11.800371  ,  1.0455627 ,  2.4607446 , ...,  1.276756  ,
                              0.9321285 ,  1.1559696 ],
                            [-3.5929978 ,  0.44992352,  3.792599  , ...,  0.31393057,
                              0.3892161 ,  0.6892278 ],
                            ...,
                            [ 4.0420446 ,  0.42996106,  9.707426  , ...,  0.25233856,
                              0.65427035,  0.41754112],
                            [-0.80224735,  0.472819  , 13.378819  , ...,  1.1586044 ,
                              1.3399727 ,  1.2714485 ],
                            [-0.54361945,  0.01990198, 10.844031  , ..., -0.02331384,
                              0.279344  ,  0.5004392 ]]], dtype=

## **This time we will load a dataset and train the modal on the dataset to get a custom model. We can tune parameters to get the best accuracy**

In [16]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 41.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 41.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.9 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed pyyaml-6.0 sacremoses-0.0.49 tokenizers-0.12.1 

In [1]:
!pip install datasets # library to load hugging face datasets

Collecting datasets
  Downloading datasets-2.1.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 5.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 38.5 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.2 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 47.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 32.8 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25

In [6]:
from datasets import load_dataset
cnn_dailymail_dataset = load_dataset("cnn_dailymail",'3.0.0') # sownload cnn_dailymail dataset @ https://huggingface.co/datasets/cnn_dailymail

Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)


  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
# cnn_dailymail_dataset["train"] # training dataset
# cnn_dailymail_dataset["validation"] # validation dataset
# cnn_dailymail_dataset["test"] #testing dataset

In [7]:
# check if they are imported withour errors
print(cnn_dailymail_dataset["train"][0])
print(cnn_dailymail_dataset["validation"][0])
print(cnn_dailymail_dataset["test"][0])

{'article': 'It\'s official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It\'s a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but bec

In [14]:
model_name="facebook/bart-large-cnn" # choosen madel

In [17]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # download and load the pretrained model
tokenizer = AutoTokenizer.from_pretrained(model_name) #load the tokenizer related to the model


Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [18]:
#since this model is capable of multiple NLP tasks require prompting for specific tasks.
prefix = "summarize: " # we define a prefix and attach it to every article in training data
def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=128, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [19]:
tokenized_cnn_dailymail = cnn_dailymail_dataset["train"].map(preprocess_function, batched=True) # we applly the function we defined to the training dataset

  0%|          | 0/288 [00:00<?, ?ba/s]

In [20]:
tokenized_cnn_dailymail_validation = cnn_dailymail_dataset["validation"].map(preprocess_function, batched=True) #for validation data

  0%|          | 0/14 [00:00<?, ?ba/s]

In [21]:
tokenized_cnn_dailymail_test = cnn_dailymail_dataset["test"].map(preprocess_function, batched=True) # for test data

  0%|          | 0/12 [00:00<?, ?ba/s]

In [22]:
# a helper class to dynamically pad the data dynamically
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_name)


In [23]:
type(tokenized_cnn_dailymail)

datasets.arrow_dataset.Dataset

In [27]:
# import torch
# torch.cuda.empty_cache()
# import gc
# del variables
# gc.collect()
# torch.cuda.memory_summary(device=None, abbreviated=False)



### **Not able to train on the full dataset as the dataset is large and also cuda memory is not sufficient for higher batch sizes, for now only supporting batchsize = 1**

In [29]:
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_cnn_dailymail,
    eval_dataset=tokenized_cnn_dailymail_validation,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BartForConditionalGeneration.forward` and have been ignored: id, article, highlights. If id, article, highlights are not expected by `BartForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 287113
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 287113


Epoch,Training Loss,Validation Loss


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json


KeyboardInterrupt: ignored

### **While training we can see the model performance by analyzing the validation loss and validation accuracies. So that we can tune the hyper-parameters of the modal to get better accuracies.**

### **After training we have to test the modal on unseen dataset (test dataset) to evaluate the performance.**

In [None]:
trainer.evaluate(cnn_dailymail_dataset["test"])  # we can evaluate the results on the testing dataset

In [None]:
new_data = "some article data after tokenization "

In [None]:
out = trainer.predict(new_data)  # this gives the summary vector

## References:

https://huggingface.co/docs/datasets/v1.12.1/load_hub.html --> datasets

https://huggingface.co/docs/transformers/tasks/summarization -->summarization @ transormers

https://huggingface.co/datasets/cnn_dailymail --> chosen dataset

https://huggingface.co/facebook/bart-large-cnn -->chosen model