### Training NLP models in Colab without running out of RAM

This notebook focuses on some techniques you can use to avoid running out of memory, when working with a lot of data and large models used for NLP tasks.

The task we'll work on is textual entailment, using the [Stanford Natural Language Inference (SNLI) dataset](https://nlp.stanford.edu/projects/snli/). We'll build a fairly simple classification model, using a pre-trained BERT model. (Some of the code is inspired by [this Keras example for SNLI classification](https://keras.io/examples/nlp/semantic_similarity_with_bert/).)

The main focus of this notebook is not on the task or model architecture, but on how to load part of your data at a time while you train, and save model checkpoints as you go. You should be able to run the notebook on the free tier of Google Colab. (There is a point where it will run out of RAM, for demonstration, but that is noted in the comments.)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2022-fall-main/blob/master/materials/walkthrough_notebooks/keras_with_limited_ram/keras_training_with_limited_ram.ipynb)


In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.22.2-py3-none-any.whl (4.9 MB)
[K     |████████████████████████████████| 4.9 MB 16.9 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 59.2 MB/s 
Collecting huggingface-hub<1.0,>=0.9.0
  Downloading huggingface_hub-0.10.0-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 69.9 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.0 tokenizers-0.12.1 transformers-4.22.2


In [2]:
import os, re
import time
import numpy as np
import pandas as pd

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# These auto classes load the right type of tokenizer and model based on a model name
from transformers import AutoTokenizer, TFAutoModel

We'll start by downloading the data, using curl in bash to save it to the local disk space for the Colab notebook. You might have your data in Google Drive instead; later we'll mount a Drive folder to this notebook so that we can save our model someplace more permanent, but you can move that step up if you need to load data from Drive.

In [3]:
!curl -LO https://raw.githubusercontent.com/MohamadMerchant/SNLI/master/data.tar.gz
!tar -xvzf data.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11.1M  100 11.1M    0     0  7315k      0  0:00:01  0:00:01 --:--:-- 7311k
SNLI_Corpus/
SNLI_Corpus/snli_1.0_dev.csv
SNLI_Corpus/snli_1.0_train.csv
SNLI_Corpus/snli_1.0_test.csv


In [4]:
!ls SNLI_Corpus

snli_1.0_dev.csv  snli_1.0_test.csv  snli_1.0_train.csv


First let's read in the entire train and dev datasets. It looks like we have about 550k training examples, and 10k dev examples (which we'll use for validation). Just loading those short sentence pairs doesn't take a lot of RAM, but it will be too much to process with a BERT model. (You can see how much RAM and Disk space you're using by looking in the upper right corner of the notebook.)

In [5]:
train_filename = 'SNLI_Corpus/snli_1.0_train.csv'
dev_filename = 'SNLI_Corpus/snli_1.0_dev.csv'

df_train = pd.read_csv(train_filename)
df_dev = pd.read_csv(dev_filename)

df_train.shape, df_dev.shape

((550152, 3), (10000, 3))

Let's define some functions that we'll need to preprocess the data and build our classification model. First, we'll tokenize the sentence pairs using the pretrained BERT tokenizer. Second, we need to convert the three label classes from strings to numeric values.

In [6]:
label_dict = {'neutral': 0, 'entailment': 1, 'contradiction': 2}

In [7]:
def preprocess_data(sentence_pairs, label_strs, tokenizer, max_length=128):
    # With BERT tokenizer's batch_encode_plus, sentence pairs are
    # encoded together and separated by [SEP] token.
    encoded = tokenizer.batch_encode_plus(
        sentence_pairs,
        add_special_tokens=True,
        max_length=max_length,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_token_type_ids=True,
        return_tensors="tf"
    )

    # Extract encoded features and labels, add to corresponding lists
    input_ids = np.array(encoded["input_ids"], dtype="int32")
    attention_masks = np.array(encoded["attention_mask"], dtype="int32")
    token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")

    # Convert string labels into numbered categories
    labels = np.array([label_dict[label] if label in label_dict else 0
                       for label in label_strs])
    
    return [input_ids, attention_masks, token_type_ids], labels

For the model, we'll construct a fairly simple classification model on top of the pretrained BERT model. Since we're freezing the full BERT model, it doesn't work very well for this classification problem to just use the pre-trained CLS token output as our vector representing the full input that we want to classify. (It would probably work better if we unfroze some BERT layers to fine-tune that CLS token.) Instead, we'll add one more attention layer on top of the full sequence of contextualized token vectors that we get out of BERT, so that we can train that attention layer to pay attention to the tokens that are most useful for this entailment task.

In [8]:
def build_snli_model(bert_model, max_length=128, hidden_dim=256):
    input_ids = layers.Input(shape=(max_length), dtype=tf.int32, name='input_ids')
    attention_masks = layers.Input(shape=(max_length), dtype=tf.int32, name='attention_masks')
    token_type_ids = layers.Input(shape=(max_length), dtype=tf.int32, name='token_type_ids')
    
    bert_output = bert_model(input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids)
    sequence_output = bert_output.last_hidden_state

    attn_output = layers.MultiHeadAttention(num_heads=4, key_dim=100)(sequence_output, sequence_output)
    max_pool = layers.GlobalMaxPooling1D()(attn_output)
    dropout_output = layers.Dropout(0.3)(max_pool)
    final_output = layers.Dense(3, activation="softmax")(dropout_output)
    
    model = tf.keras.models.Model(inputs=[input_ids, attention_masks, token_type_ids],
                                  outputs=[final_output])
    model.compile(optimizer=tf.keras.optimizers.Adam(),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    return model

Ok, let's load the pretrained tokenizer and model, build the classification model, and preprocess our data to get ready to train. We'll freeze the BERT model layers for the live demo, so we keep the pre-trained weights rather than fine-tuning. (We will still be training the new layers we add on top of BERT for classification.) If you set the last line in the cell below to True, and train further on your task, you'll be fine-tuning the BERT model. It will take longer to train.

In [9]:
bert_model_name='bert-base-uncased'
bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)
bert_model = TFAutoModel.from_pretrained(bert_model_name)
bert_model.trainable = False

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [10]:
model = build_snli_model(bert_model, max_length=128, hidden_dim=256)
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 128)]        0           []                               
                                                                                                  
 attention_masks (InputLayer)   [(None, 128)]        0           []                               
                                                                                                  
 token_type_ids (InputLayer)    [(None, 128)]        0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_masks[0][0]',    

In [11]:
# Tokenize the 1k of the dev examples to use for validation data first
# (You can use more, but we'll just use 1k for the live demo)

dev_sentence_pairs = df_dev[['sentence1', 'sentence2']].values[:1000].astype(str).tolist()
dev_labels = df_dev['similarity'].values[:1000]

dev_data = preprocess_data(
    dev_sentence_pairs, dev_labels, tokenizer=bert_tokenizer, max_length=128
)

In [None]:
# Now tokenize the 550k training examples ...
# ONLY RUN THIS CELL THE FIRST TIME FOR DEMONSTRATION, IT MIGHT RUN OUT OF RAM

sentence_pairs = df_train[['sentence1', 'sentence2']].values.astype(str).tolist()
labels = df_train['similarity'].values

train_data = preprocess_data(
    sentence_pairs, labels, tokenizer=bert_tokenizer, max_length=128
)

At some point when running the last cell, if you're using the free Collab tier, your notebook probably ran out of RAM and crashed. (If it didn't, you're on a Colab machine with more RAM; available resources may vary. But you may not be able to actually train the model with all of that data in memory.)

Let's try again, but this time, we won't load all of our data at once. Connect the notebook again (it may have restarted on its own), and run most of the code above, but stop after tokenizing the dev data (which we'll keep for validation below). Don't tokenize the full 550k dataset.

In [12]:
# In case you loaded the full dataset above
df_train = None

We can define a [custom class called a data generator](https://medium.com/analytics-vidhya/write-your-own-custom-data-generator-for-tensorflow-keras-1252b64e41c3), that we will pass to model.fit instead of our full dataset. This data generator needs to implement methods for `__len__`, `__getitem__`, and `on_epoch_end`. In `__getitem__`, we'll write the code to get the next batch of data to train the model. We can write that function so it only loads and tokenizes the data needed for the next batch. In `on_epoch_end` we'll shuffle the order in which we plan to load data for the next epoch.

We'll look at how to do this two ways. First, in the data we downloaded from SNLI, all of the training data is in one large CSV file. We can use the pandas `pd.read_csv` method, which includes options to skip certain rows of data and only load a certain number. We won't just want to load a consecutive chunk each time, because we'll want to shuffle the rows. The `read_csv` method doesn't quite have an option to specify individual row indices to load, but we can specify a list of row indices to skip, so that's what we'll do here.

In [13]:
class SNLIDataGeneratorFromFile(tf.keras.utils.Sequence):
    
    def __init__(self,
                 tokenizer,
                 n_examples,
                 data_filename,
                 max_length=128,
                 batch_size=32,
                 shuffle=True):
        
        self.tokenizer = tokenizer
        self.n_examples = n_examples
        self.data_filename = data_filename
        self.max_length = max_length
        self.batch_size = batch_size
        self.shuffle = shuffle
        
        # Initialize row order, call on_epoch_end to shuffle row indices
        self.row_order = np.arange(1, self.n_examples+1)
        self.on_epoch_end()
    
    def __len__(self):
        # NOTE: USING REDUCED BATCHES PER EPOCH TO SPEED UP THE LIVE DEMO
        # For normal use, this line should be:
        # return self.n_examples // self.batch_size
        return 100
    
    def __getitem__(self, idx):
        batch_start = idx * self.batch_size
        batch_end = (idx + 1) * self.batch_size

        # Indices to skip are the ones in the shuffled row_order before and
        # after the chunk we'll use for this batch
        batch_idx_skip = self.row_order[:batch_start] + self.row_order[batch_end:]
        df = pd.read_csv(self.data_filename, skiprows=batch_idx_skip)
        
        sentence_pairs = df[['sentence1', 'sentence2']].values.astype(str).tolist()
        labels = df['similarity'].values
        
        batch_data = preprocess_data(
            sentence_pairs,
            labels,
            self.tokenizer,
            self.max_length
        )

        return batch_data
    
    def on_epoch_end(self):
        if self.shuffle:
            self.row_order = list(np.random.permutation(self.row_order))

In [14]:
train_data_generator = SNLIDataGeneratorFromFile(
    tokenizer=bert_tokenizer,
    n_examples=550152,
    data_filename='SNLI_Corpus/snli_1.0_train.csv'
)

One more thing. It's going to take a while to train our model (even with a GPU, which we'll need to use). Colab resources are free but shared, there are usage limits and our notebook might time out especially when using a GPU for a while. So we should periodically save a copy of our trained model as we go. Later, we can load the model that we saved and keep training it further.

At this point, we probably do want to mount a Google Drive folder, because we won't want to save our checkpoints just to temporary Colab disk space. If our notebook disconnects, we'll lose those files. The next cell mounts your Drive folder, then for demonstration I'm showing my (the instructor's) UC Berkeley Drive path to where I'm storing files for this semester's class. You'll want to edit that for your Drive.

In [15]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
# CHANGE THIS TO THE PATH IN YOUR OWN DRIVE WHERE YOU WANT TO SAVE CHECKPOINTS

!ls drive/MyDrive/ISchool/MIDS/W266/2022_Fall/

Keras provides a handy [ModelCheckpoint](https://keras.io/api/callbacks/model_checkpoint/) class that we can pass into .fit as a callback. By default, it'll save a checkpoint of the model at the end of each epoch of training.

We can choose to save the whole model or just the weights (i.e. the model parameters that we've trained so far). And we'll specify the destination filepath (we can include formatting options to have different filenames for each epoch and loss).

(Other options: you can also choose to only save the best performing model each time, based on a performance metric you choose. But we'll save after every epoch here so that we can see the resulting files in the live demo.)

In [17]:
# CHANGE checkpoint_dir TO THE PATH IN YOUR OWN DRIVE WHERE YOU WANT TO SAVE CHECKPOINTS

checkpoint_dir = 'drive/MyDrive/ISchool/MIDS/W266/2022_Fall/model_checkpoints/'
checkpoint_filepath = checkpoint_dir + 'weights.{epoch:02d}-{val_accuracy:.2f}.hdf5'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True)

Now we're ready to train our model. We'll call `model.fit`, but instead of passing in an array of data, we'll pass in our data generator. And we'll include the model checkpoint callback, to save the weights after each epoch.

The next cell may take a couple hours to run per epoch on the full dataset. The ETA in the running output can be very useful to estimate how long it will take your model to train (and whether you need to interrupt it and make adjustments to be able to make progress).

Note: For demonstration purposes during the live demo, we're only using a few batches of data per epoch, so that we can see it train and save checkpoints. In the SNLIDataGeneratorFromFile code above, see the comments on the `__len__` method. Change that code in the `__len__` method back to the correct batches per epoch to see how long it takes to train on the full dataset.

In [19]:
model.fit(train_data_generator, validation_data=dev_data, epochs=5,
          callbacks=[model_checkpoint_callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f23780e7c10>

If we need to pick up where we left off, we can load the weights that we saved into our model and then call `model.fit` again.

In [20]:
# CHANGE checkpoint_filepath TO EXACT NAME OF A SAVED CHECKPOINT YOU WANT TO LOAD

checkpoint_filepath = checkpoint_dir + 'weights.05-0.75.hdf5'
model.load_weights(checkpoint_filepath)

What if our data is stored in many small files? We might want to only load one file of data at a time, and randomly shuffle the order in which we load files from the data folder for each training epoch.

Just to demonstrate that option, we'll simulate having our data in multiple files. We'll use the same dataset, but read the full dataset once and write it to a bunch of csv files of 256 rows each. (You won't typically do this if your data starts out all in one file, in which case you can just read select rows like we did above.)

In [21]:
!mkdir SNLI_Corpus/train_files/

In [22]:
df_train = pd.read_csv(train_filename)
for i in range(0, 550152, 256):
    df_train[i:i+256].to_csv('SNLI_Corpus/train_files/train_data_%d.csv' % i, index=False)

df_train = None

In [23]:
data_dir = 'SNLI_Corpus/train_files/'
data_filenames = os.listdir(data_dir)
len(data_filenames)

2150

Now we have 2150 separate csv files of training data, and we'll only want to load one or a few at a time to train our model. Our data files have 256 rows in each, and we'll only use 32 examples per batch, so we'll usually only load one file at a time and tokenize part of it for the next batch of data.

The code below will work whether your files are larger or smaller than one batch, though. We'll keep track of which rows we've already used from the current file and take the next rows for a new batch, so we might run past the current file and load another file to fill up the rest of the batch.

In your own project, you might have data files that are smaller or larger, so we've made the code somewhat flexible so that you can see how to load just enough files to get the next batch of data that you need.

In [24]:
class SNLIDataGeneratorFromDir(tf.keras.utils.Sequence):
    
    def __init__(self,
                 tokenizer,
                 n_examples,
                 data_dir,
                 examples_per_file,
                 max_length=128,
                 batch_size=32,
                 shuffle=True):
        
        self.tokenizer = tokenizer
        self.n_examples = n_examples
        self.data_dir = data_dir
        self.examples_per_file = examples_per_file
        self.max_length = max_length
        self.batch_size = batch_size
        self.shuffle = shuffle
        
        self.filename_order = os.listdir(self.data_dir)
        self.next_file_i = 0
        self.next_row_i = 0
        
        # Call on_epoch_end to shuffle data at start
        self.on_epoch_end()
    
    def __len__(self):
        # NOTE: USING REDUCED BATCHES PER EPOCH TO SPEED UP THE LIVE DEMO
        # For normal use, this line should be:
        # return self.n_examples // self.batch_size
        return 100
    
    def __getitem__(self, idx):
        files_to_load = (self.batch_size // self.examples_per_file) + 1
        
        sentence_pairs = []
        labels = []
        
        for file_i in range(self.next_file_i, self.next_file_i + files_to_load):
            filepath = os.path.join(self.data_dir, self.filename_order[file_i])
            df = pd.read_csv(filepath)
            n_remaining = self.batch_size - len(sentence_pairs)
            
            start = self.next_row_i
            end = self.next_row_i + n_remaining
            curr_sent_pairs = df[['sentence1', 'sentence2']].values[start:end]
            sentence_pairs.extend(curr_sent_pairs.tolist())
            
            curr_labels = df['similarity'].values[start:end]
            labels.extend(curr_labels.tolist())
            
            if end < len(df):
                self.next_file_i = file_i
                self.next_row_i = end
            else:
                self.next_file_i = file_i + 1
                self.next_row_i = 0
                
            if len(sentence_pairs) >= self.batch_size:
                break
            
        batch_data = preprocess_data(
            sentence_pairs,
            labels,
            self.tokenizer,
            self.max_length
        )
        return batch_data
    
    def on_epoch_end(self):
        self.next_file_i = 0
        self.next_row_i = 0
        
        if self.shuffle:
            self.filename_order = np.random.permutation(self.filename_order)

In [25]:
train_data_generator = SNLIDataGeneratorFromDir(
    tokenizer=bert_tokenizer,
    n_examples=550152,
    data_dir=data_dir,
    examples_per_file=256
)

In [26]:
model.fit(train_data_generator, validation_data=dev_data, epochs=5,
          callbacks=[model_checkpoint_callback])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f2362a87350>