To save in google drive

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


**Packages**

In [21]:
! pip install datasets --quiet
! pip install evaluate --quiet
! pip install rouge_score --quiet
! pip install sacrebleu
! pip install transformers --quiet
! pip install -q sentencepiece --quiet
! pip install summarizer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 KB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting portalocker
  Downloading portalocker-2.7.0-py2.py3-none-any.whl (15 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.7.0 sacrebleu-2.3.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

from datasets import load_dataset
import evaluate

from pprint import pprint

**Necessary Functions**

In [24]:
rouge = evaluate.load('rouge')

In [25]:
chrf = evaluate.load("chrf")

**Data**

In [5]:
dataset = load_dataset("csebuetnlp/xlsum", "english")

Downloading builder script:   0%|          | 0.00/4.55k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.6k [00:00<?, ?B/s]

Downloading and preparing dataset xlsum/english to /root/.cache/huggingface/datasets/csebuetnlp___xlsum/english/2.0.0/518ab0af76048660bcc2240ca6e8692a977c80e384ffb18fdddebaca6daebdce...


Downloading data:   0%|          | 0.00/282M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset xlsum downloaded and prepared to /root/.cache/huggingface/datasets/csebuetnlp___xlsum/english/2.0.0/518ab0af76048660bcc2240ca6e8692a977c80e384ffb18fdddebaca6daebdce. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
print(len(dataset['train']))
print(len(dataset['validation']))

306522
11535


In [7]:
# EDA
dataset['train'][1]

{'id': 'uk-scotland-highlands-islands-11069985',
 'url': 'https://www.bbc.com/news/uk-scotland-highlands-islands-11069985',
 'title': 'Huge tidal turbine installed at Orkney test site',
 'summary': 'The massive tidal turbine AK1000 has been installed in 35m (114.8ft) of water at a test site in Orkney.',
 'text': 'Atlantis Resources unveiled the marine energy device at Invergordon ahead of it being shipped to Kirkwall. Trials on the device will now be run at the European Marine Energy Centre test site off Eday. The device stands 22.5m (73ft) tall, weighs 1,300 tonnes and has two sets of blades on a single unit. It could generate enough power for 1,000 homes.'}

In [8]:
def get_df(ds):

  title = []
  article = []
  summary = []

  prefix = 'summarize: '
  for data in ds:
      title.append(data['title']) 
      article.append(prefix+data['text'])
      summary.append(data['summary'])
  d = {'title': title, 'article': article, 'summary': summary}
  df = pd.DataFrame(data=d)
  return df

In [39]:
train_df = get_df(dataset['train'])
val_df = get_df(dataset['validation'])
test_df = get_df(dataset['test'])

train_df.head(5)

Unnamed: 0,title,article,summary
0,Weather alert issued for gale force winds in W...,summarize: The Met Office has issued a yellow ...,Winds could reach gale force in Wales with sto...
1,Huge tidal turbine installed at Orkney test site,summarize: Atlantis Resources unveiled the mar...,The massive tidal turbine AK1000 has been inst...
2,Leeds stabbing: Man attacked outside betting shop,summarize: Police were called to the scene out...,A man has been stabbed in broad daylight outsi...
3,Could killing of Iranian general help Trump ge...,summarize: Anthony ZurcherNorth America report...,It was inevitable that the fallout from the US...
4,Coronavirus: 'I've moved out to protect my fam...,summarize: By Debbie JacksonBBC Scotland But w...,Week four of social distancing is starting to ...


**T5 Model**

##### 1.) Load and set up model

In [11]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

t5model = TFT5ForConditionalGeneration.from_pretrained("t5-base")
t5tokenizer = T5Tokenizer.from_pretrained("t5-base")

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/892M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [17]:
## Set up model params

max_length = 32
batch_size = 16

In [12]:
def preprocess_data(text_pairs, tokenizer, model, max_length=128):
    orig_text = text_pairs[0]
    orig_encoded = tokenizer.batch_encode_plus(
        orig_text,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='tf'
    )

    orig_input_ids = np.array(orig_encoded["input_ids"], dtype="int32")
    orig_attention_masks = np.array(orig_encoded["attention_mask"], dtype="int32")
    
    target_text = text_pairs[0]
    target_encoded = tokenizer.batch_encode_plus(
        target_text,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_tensors='tf'
    )

    label_ids = np.array(target_encoded['input_ids'])
    decoder_input_ids = model._shift_right(label_ids)
    
    return [orig_input_ids, orig_attention_masks, decoder_input_ids], label_ids

In [13]:
import tensorflow as tf

class SummarizationDataGenerator(tf.keras.utils.Sequence):
    
    def __init__(self,
                 tokenizer,
                 model,
                 n_examples,
                 dataframe,
                 max_length=128,
                 batch_size=16,
                 shuffle=True):
        
        self.tokenizer = tokenizer
        self.model = model
        self.n_examples = n_examples
        self.dataframe = dataframe
        self.max_length = max_length
        self.batch_size = batch_size
        self.shuffle = shuffle
        
        # Initialize row order, call on_epoch_end to shuffle row indices
        self.row_order = np.arange(1, self.n_examples+1)
        self.on_epoch_end()
    
    def __len__(self):
        # Return the number of batches in the full dataset
        return self.n_examples // self.batch_size
    
    def __getitem__(self, idx):
        batch_start = idx * self.batch_size
        batch_end = (idx + 1) * self.batch_size

        # Indices to skip are the ones in the shuffled row_order before and
        # after the chunk we'll use for this batch
        batch_idx_skip = self.row_order[:batch_start] + self.row_order[batch_end:]
        
        text_pairs = self.dataframe[['article', 'summary']].values.astype(str).tolist()
        
        batch_data = preprocess_data(
            text_pairs,
            self.tokenizer,
            self.model,
            self.max_length
        )

        return batch_data
    
    def __call__(self):
        for i in range(self.__len__()):
            yield self.__getitem__(i)
            
            if i == self.__len__()-1:
                self.on_epoch_end()
    
    def on_epoch_end(self):
        if self.shuffle:
            self.row_order = list(np.random.permutation(self.row_order))

In [15]:
from tensorflow.keras import layers

def build_t5_training_wrapper_model(t5_model, max_length):
    input_ids = layers.Input(shape=(max_length), dtype=tf.int32, name='input_ids')
    attention_mask = layers.Input(shape=(max_length), dtype=tf.int32, name='attention_mask')
    decoder_input_ids = layers.Input(shape=(max_length), dtype=tf.int32, name='labels')
    
    t5_logits = t5_model(input_ids, attention_mask=attention_mask, decoder_input_ids=decoder_input_ids)[0]

    model = tf.keras.models.Model(inputs=[input_ids, attention_mask, decoder_input_ids],
                                  outputs=[t5_logits])
    model.compile(optimizer=tf.keras.optimizers.Adam(),
                  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    
    return model

##### 2.) Train model

In [16]:
model_wrapper = build_t5_training_wrapper_model(t5model, max_length)

In [None]:
train_df = train_df.sample(n=1000)

train_data_generator = SummarizationDataGenerator(
    tokenizer=t5tokenizer,
    model=t5model,
    n_examples=train_df.shape[0],
    dataframe=train_df,
    max_length=max_length,
    batch_size=batch_size
)

valid_data_generator = SummarizationDataGenerator(
    tokenizer=t5tokenizer,
    model=t5model,
    n_examples=val_df.shape[0],
    dataframe=val_df,
    max_length=max_length,
    batch_size=batch_size
)

In [35]:
checkpoint_dir = '/content/drive/MyDrive/W266FinalProject/model_checkpoints/'
checkpoint_filepath = checkpoint_dir + 't5_weights.{epoch:02d}-{val_accuracy:.2f}.hdf5'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True)

In [36]:
model_wrapper.fit(train_data_generator,
                  validation_data=valid_data_generator,
                  epochs=10,
                  callbacks=[model_checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f72b7cc8100>

##### 3.) Test model

In [18]:
model_wrapper = build_t5_training_wrapper_model(t5model, max_length)

In [19]:
checkpoint_filepath = '/content/drive/MyDrive/W266FinalProject/model_checkpoints/t5_weights.03-0.92.hdf5'

model_wrapper.load_weights(checkpoint_filepath)

In [40]:
test_df.shape

(11535, 3)

In [27]:
test_df = test_df.sample(n=100)

In [None]:
model_wrapper.evaluate

In [36]:
r1 = []
r2 = []
rL = []
rLs = []
chrfs = []

for i in test_df.index:

    T5ARTICLE_TO_SUMMARIZE = test_df['article'][i]

    inputs = t5tokenizer(T5ARTICLE_TO_SUMMARIZE, 
                         max_length=max_length, 
                         truncation=True, 
                         return_tensors="tf")

    summary_ids = t5model.generate(inputs["input_ids"])
    
    candidate = t5tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    #pprint(candidate[0], compact=True)
    
    ref = [test_df['summary'][i]]
    
    rouge_results = rouge.compute(predictions=candidate,
                            references=ref)
    
    r1.append(rouge_results['rouge1'])
    r2.append(rouge_results['rouge2'])
    rL.append(rouge_results['rougeL'])
    rLs.append(rouge_results['rougeLsum'])
    
    chrf_results = chrf.compute(predictions=candidate,
                            references=ref)
    chrfs.append(chrf_results['score'])

    if i in np.arange(0, 50, 100):
    #     data = {'rouge1': r1, 'rouge2': r2, 'rogueL': rL, 'rogueLs': rLs, 'chrf': chrfs}
    #     scores = pd.DataFrame(data)
    #     scores.to_csv(f'/content/drive/MyDrive/W266FinalProject/model_results/T5_scores_{i}.csv', index=False)
        print(i)



In [37]:
print('rouge1 average :', np.mean(r1))
print('rouge2 average :', np.mean(r2))
print('rougeL average :', np.mean(rL))
print('rougeLs average :', np.mean(rLs))
print('chrf average :', np.mean(chrfs))

rouge1 average : 0.13596804706619461
rouge2 average : 0.015478396192686261
rougeL average : 0.11032705317061785
rougeLs average : 0.11032705317061785
chrf average : 16.556515545698197


In [38]:
data = {'rouge1': r1, 'rouge2': r2, 'rogueL': rL, 'rogueLs': rLs, 'chrf': chrfs}

scores = pd.DataFrame(data)

scores.to_csv(r'/content/drive/MyDrive/W266FinalProject/model_results/T5_scores_all.csv', index=False)