This notebook part two of three notebooks containing an example for training a BERT model using AWS Sagemaker

## Package Imports

In [2]:
from datasets import Dataset
import pickle
import pandas as pd
import numpy as np

## Load Pickle File

In [18]:
file = open('cnn_stories_clean.pkl', 'rb')
df = pickle.load(file)

Let's take a look at our DataFrame to see what else needs to be done before it is ready for use

In [19]:
df.highlight[0]

["Fernando Lugo says he's the father of a 2-year-old conceived when he was a bishop",
 "Announcement comes in the week after child's mother sued, seeking paternity test",
 'Some Cabinet members say paternity disclosure reflects government transparency',
 'But opposition party member calls on Vatican to excommunicate Lugo']

Each article has multiple highlights attached to it. Let's expand our DataFrame to have each row contain one article and one summary

In [25]:
new_df = pd.DataFrame({
    'story': df['story'].repeat(df['highlight'].str.len()),
    'highlight': [h for lst in df['highlight'] for h in lst]
})
df = new_df.reset_index().drop(columns='index')

We can use the .from_pandas function from Dataset to load in our dataframe and have the HuggingFace library turn it into our dataset

In [28]:
dataset = Dataset.from_pandas(df)

Finally, let's divide our dataset into train and test

In [29]:
dataset = dataset.train_test_split(test_size=0.2)

Let's take a look at some of our examples to make sure that it created training examples that will be useful for us. We are going to train our T-5 model using summary-source pairs

In [31]:
for num in range(0,5):
    print(dataset['train'][num])
    print('\n\n')

{'story': 'Federal drug agents discovered a 240-yard-long tunnel underneath the U.S.-Mexico border, and they suspect it was used to smuggle drugs into Arizona for sale in the United States, officials said Thursday.  The "sophisticated drug smuggling tunnel," which runs 55 feet below ground, begins in an ice plant in San Luis Rio Colorado, Sonora, Mexico, and ends inside a one-story, nondescript building in San Luis, Arizona, according to the U.S. Drug Enforcement Administration.  Report: Focus on cops, not military  Investigators started watching the building in January "after observing possible suspicious activity that indicated the site was being used as a potential stash location," the DEA said.  Arizona police found 39 pounds of methamphetamine inside a pickup truck stopped on Interstate 95 on July 6, which led them back to the San Luis, Arizona, building, the DEA said. They got a search warrant with that information.  Police in Arizona arrest 20, dismantle drug trafficking cell of

These examples look good enough for us to use as training examples. Let's further prepare our dataset for use with the model


First, we need to load our tokenizer. For this example, we will be training the FLAN-T5-small model for text summarization, so we need to instantiate the tokenizer using the AutoTokenizer.from_pretrained method

In [38]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-small')

We now need to write a pre-processing function which takes in our dataset and prepares it to go to the model. In order to train any FLAN-T5 model, we need to append "summarize: " to the beginning of all of our training examples. We are also going to truncate all of our examples that are beyond 512 tokens. We also need to pad all of our examples that are less than 512 so that our training examples are all the same length

In [48]:
MAX_INPUT_LENGTH = 512
MAX_TARGET_LENGTH = 64
prefix = 'summarize: '

def pre_process(examples):
    inputs = [prefix + text for text in examples['story']]
    model_inputs = tokenizer(inputs, max_length = MAX_INPUT_LENGTH, truncation=True, padding=True)
    
    labels = tokenizer(examples['highlight'], max_length = MAX_INPUT_LENGTH, truncation=True, padding=True)
    
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

To apply this function to all of our dataset, use the .map method of our Dataset object we created earlier

In [58]:
tokenized_dataset = dataset.map(pre_process, batched=True)

Map:   0%|          | 0/263249 [00:00<?, ? examples/s]

Map:   0%|          | 0/65813 [00:00<?, ? examples/s]

Now that our dataset is tokenized, let's do one final bit of postprocessing to prepare it for training.

Because the model does not accept raw input as text, we need to remove our "story" and "highlight" fields from the dataset

In [59]:
tokenized_dataset = tokenized_dataset.remove_columns(['story', 'highlight'])

We also need to set the dataset to return pytorch tensors instead of lists

In [60]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 263249
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 65813
    })
})

In [61]:
tokenized_dataset.set_format('torch')

In [62]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 263249
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 65813
    })
})

Finally, let's split our dataset into separate train and test datasets to send to the model

In [63]:
train_dataset = tokenized_dataset['train']
test_dataset = tokenized_dataset['test']

Create Pickle files and save them in your current directory. Our final notebook will use these datasets to train
our T5 model for summarization

In [66]:
train_file = open('train_dataset.pkl', 'wb')
test_file = open('test_dataset.pkl', 'wb')

pickle.dump(train_dataset, train_file)
pickle.dump(test_dataset, test_file)