This notebook part two of three notebooks containing an example for training a BERT model using AWS Sagemaker

## Package Imports

In [1]:
from datasets import Dataset
import pickle
import pandas as pd
import numpy as np

In [2]:
pwd

'/home/eanthony/workspace/github-work/aidiv-sagemaker-examples/notebooks'

In [3]:
cd ..

/home/eanthony/workspace/github-work/aidiv-sagemaker-examples


## Load Pickle File

In [4]:
file = open('cnn_stories_clean.pkl', 'rb')
df = pickle.load(file)

Let's take a look at our DataFrame to see what else needs to be done before it is ready for use

In [5]:
df.highlight[0]

["Fernando Lugo says he's the father of a 2-year-old conceived when he was a bishop",
 "Announcement comes in the week after child's mother sued, seeking paternity test",
 'Some Cabinet members say paternity disclosure reflects government transparency',
 'But opposition party member calls on Vatican to excommunicate Lugo']

In [6]:
df

Unnamed: 0,story,highlight
0,"ASUNCION, Paraguay Paraguayan President Ferna...",[Fernando Lugo says he's the father of a 2-yea...
1,"KUNA YALA, Panama Hunched over a campfire in ...",[New trends could open door to reversal in def...
2,"Seoul, South Korea The new commander of U.S. f...",[Gen. James D. Thurman is the new commander of...
3,Most Americans don't want the United States to...,[David Rothkopf: Polls say Americans averse to...
4,U.S. Secretary of State John Kerry said Monday...,[John Kerry says unilateral action by North Ko...
...,...,...
92460,"Despite the retail madness of Black Friday, sm...","[Program intended to boost local businesses, S..."
92461,The U.S. is not returning combat troops to Ira...,[House approves Obama's request to train and a...
92462,Facebook wants to cut clutter. The social med...,[Facebook has redesigned the news feed to fill...
92463,Former Olympic champion Angel Matos of Cuba f...,[Cuba's Angel Matos kicks referee in the face ...


Each article has multiple highlights attached to it. Let's expand our DataFrame to have each row contain one article and one summary

In [7]:
new_df = pd.DataFrame({
    'story': df['story'].repeat(df['highlight'].str.len()),
    'highlight': [h for lst in df['highlight'] for h in lst]
})
df = new_df.reset_index().drop(columns='index')

We can use the .from_pandas function from Dataset to load in our dataframe and have the HuggingFace library turn it into our dataset

In [9]:
dataset = Dataset.from_pandas(df)

Finally, let's divide our dataset into train and test

In [11]:
dataset = dataset.train_test_split(test_size=0.2)

Let's take a look at some of our examples to make sure that it created training examples that will be useful for us. We are going to train our T-5 model using summary-source pairs

In [12]:
for num in range(0,5):
    print(dataset['train'][num])
    print('\n\n')

{'story': 'CHENGDU, China Rainy weather and poor logistics thwarted efforts by relief troops who walked for hours over rock, debris and mud on Tuesday in hopes of reaching the worst-hit area of an earthquake that killed nearly 10,000 in central China, state-run media reported.  Setting out from Maerkang in Sichuan Province at 8 p.m. Monday, the 100 or so troops had to travel 200 kilometers (124 miles) to go before reaching Wenchuan, the epicenter of the quake, also in the province, Xinhua reported. After seven hours, they still had 70 kilometers (43 miles) to go.  "I have seen many collapsed civilian houses, and the rocks dropped from mountains on the roadside are everywhere," the head of the unit, Li Zaiyuan, told Xinhua.  Added CNN Correspondent John Vause: "The roads here are terrible in the best of times ... right now they\'re down right atrocious. They\'ve resorted to going in one man at a time on foot."  Nearly all the confirmed deaths were in Sichuan Province, but rescuers were 

These examples look good enough for us to use as training examples. Let's further prepare our dataset for use with the model


First, we need to load our tokenizer. For this example, we will be training the FLAN-T5-small model for text summarization, so we need to instantiate the tokenizer using the AutoTokenizer.from_pretrained method

In [13]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-small')

We now need to write a pre-processing function which takes in our dataset and prepares it to go to the model. In order to train any FLAN-T5 model, we need to append "summarize: " to the beginning of all of our training examples. We are also going to truncate all of our examples that are beyond 512 tokens. We also need to pad all of our examples that are less than 512 so that our training examples are all the same length

In [29]:
MAX_INPUT_LENGTH = 512
MAX_TARGET_LENGTH = 64
prefix = 'summarize: '

def pre_process(examples):
    inputs = [prefix + text for text in examples['story']]
    model_inputs = tokenizer(inputs, max_length = MAX_INPUT_LENGTH, truncation=True, padding=True)
    
    labels = tokenizer(examples['highlight'], max_length = MAX_INPUT_LENGTH, truncation=True, padding=True)
    
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

To apply this function to all of our dataset, use the .map method of our Dataset object we created earlier

In [None]:
The quick brown fox jumped over the lazy dog because he had to go to work.

In [30]:
tokenized_dataset = dataset.map(pre_process, batched=True)

Map:   0%|          | 0/263249 [00:00<?, ? examples/s]

Map:   0%|          | 0/65813 [00:00<?, ? examples/s]

Now that our dataset is tokenized, let's do one final bit of postprocessing to prepare it for training.

Because the model does not accept raw input as text, we need to remove our "story" and "highlight" fields from the dataset

In [31]:
tokenized_dataset = tokenized_dataset.remove_columns(['story', 'highlight'])

We also need to set the dataset to return pytorch tensors instead of lists

In [32]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 263249
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 65813
    })
})

In [33]:
tokenized_dataset.set_format('torch')

Finally, let's split our dataset into separate train and test datasets to send to the model

In [34]:
train_dataset = tokenized_dataset['train']
test_dataset = tokenized_dataset['test']

Create Pickle files and save them in your current directory. Our final notebook will use these datasets to train
our T5 model for summarization

In [36]:
train_file = open('train_dataset.pkl', 'wb')
test_file = open('test_dataset.pkl', 'wb')

pickle.dump(train_dataset, train_file)
pickle.dump(test_dataset, test_file)

train_file.close()
test_file.close()