Here we are leveraging Google's T5 model and fine-tuning it on own data.

Additionally, the script reads the data from GCP directly into TPU nodes, speeding up the process

## First Install the required (special) packages

In [None]:
%pip install sentencepiece
%pip install -q -U tf-models-official
%pip install transformers==4.17.0

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.6 MB/s eta 0:00:01
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96
[K     |████████████████████████████████| 2.2 MB 8.0 MB/s 
[K     |████████████████████████████████| 43 kB 895 kB/s 
[K     |████████████████████████████████| 1.1 MB 32.5 MB/s 
[K     |████████████████████████████████| 47.7 MB 2.3 MB/s 
[K     |████████████████████████████████| 90 kB 5.1 MB/s 
[K     |████████████████████████████████| 234 kB 44.8 MB/s 
[K     |████████████████████████████████| 352 kB 15.4 MB/s 
[K     |████████████████████████████████| 4.9 MB 18.2 MB/s 
[K     |████████████████████████████████| 636 kB 29.6 MB/s 
[K     |████████████████████████████████| 99 kB 1.9 MB/s 
[K     |████████████████████████████████| 462 kB 55.9 MB/s 
[?25h  Building wheel for py-cpuinfo (setu

## Create TF records

This step is only required if you don't have the TF records yet. Here I am converting my CSV dataset to then upload it to a GCP bucket.

In [None]:
#Tokenize and encode the data in the TFRecord format.
#TFRecord encoded data can be read directly by TPU processing nodes.


import math
import random

from typing import List, Tuple

import tensorflow as tf
import pandas as pd

from tqdm import tqdm

from transformers import T5Tokenizer

CLASS_TOKENS = ['anger', 'not_anger']


def dataset_mapper(tokenizer, sep_token, example):
    """ Called for each example in order to implement manual truncation.
    """
    input_text = 'cola sentence: ' + example['text']
    target_text = CLASS_TOKENS[example['class']]

    input_encodings = tokenizer.encode_plus(input_text)
    target_encodings = tokenizer.encode_plus(target_text, max_length=2)

    vec = input_encodings['input_ids']

    if len(vec) > 512:
        input_encodings['input_ids'] = vec[:255] + [sep_token] + vec[-256:]
        input_encodings['attention_mask'] = input_encodings['attention_mask'][:512]
    elif len(vec) < 512:
        pad = [0] * (512 - len(vec))
        input_encodings['input_ids'] = vec + pad
        input_encodings['attention_mask'] = input_encodings['attention_mask'] + pad

    encodings = {
        'input_ids': input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids'],
        'decoder_attention_mask': target_encodings['attention_mask']
    }

    return encodings


def _int64_list_feature(value: List[int]):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=value))


def make_tfrecord_dataset(tokenizer, df, filename):
    sep_token = tokenizer.get_vocab()['<extra_id_1>']

    with tf.io.TFRecordWriter(filename) as wr:
        for _, row in tqdm(df.iterrows()):
            encodings = dataset_mapper(tokenizer, sep_token, row)
            features = {k: _int64_list_feature(v)
                        for k, v in encodings.items()}
            features['class'] = _int64_list_feature([row['class']])
            example = tf.train.Example(
                features=tf.train.Features(feature=features))
            wr.write(example.SerializeToString())

In [None]:
class_map = {k: i for i, k in enumerate(['anger', 'not_anger'])}

df_train = pd.read_csv('tweets_train.csv')
df_train['class'] = df_train.label.map(class_map)

df_valid = pd.read_csv('tweets_val.csv')
df_valid['class'] = df_valid.label.map(class_map)

In [None]:
df_test = pd.read_csv('tweets_test.csv')
df_test['class'] = df_test.label.map(class_map)

The dataset contains tweets of UK electorate's reactions to the UK 2019 election candidate's tweets. They are manually annotated with (not)anger labels.

In [None]:
df_valid

Unnamed: 0.1,Unnamed: 0,text,label,class
0,3773,@JonAshworth @JamesFrith Same as always taking...,anger,0
1,4546,@DavidLammy @CromwellStuff @Conservatives @Mat...,not_anger,1
2,2689,@jeremycorbyn Yup says Douche,not_anger,1
3,4878,@DrPhillipLee @DavidpHearn Any Labour backers ...,not_anger,1
4,5117,@BorisJohnson Stop telling people you want to ...,anger,0
...,...,...,...,...
985,5140,@jessphillips IвЂ™d call it a waste of time. A...,anger,0
986,6704,@BorisJohnson IвЂ™m sorry to have to break it ...,anger,0
987,5968,@JennyWLibDem @djnicholl Yep. But I'm also roo...,not_anger,1
988,7175,@CarrieAHarper @bootlegger1974 @Plaid_Cymru Ne...,not_anger,1


In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-base')
make_tfrecord_dataset(tokenizer, df_train, 'dataset_t5_train.tfrecord')
make_tfrecord_dataset(tokenizer, df_valid, 'dataset_t5_valid.tfrecord')
make_tfrecord_dataset(tokenizer, df_test, 'dataset_t5_test.tfrecord')

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

0it [00:00, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
6000it [00:10, 547.47it/s]
990it [00:01, 901.40it/s]


## Environment setup

- authentication for GCS accesss
- install packages
- tf distribution strategy

To run the script yourself you would have to change the Google Cloud (GC) variables accordingly

In [None]:
from google.colab import auth
auth.authenticate_user()

# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = 'emotions-twitter'
!gcloud config set project {project_id}

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey



In [None]:
"""
Google cloud variables.
"""
GCLOUD_PROJECT= 'emotions-twitter'
GCS_BUCKET= 'tfrecord_bucket_1'
GCS_MODEL_DIR= '/t5-model2/'

In [None]:
import os
import numpy as np
import tensorflow as tf
from official.nlp import optimization

if os.environ.get('COLAB_TPU_ADDR'):
  cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
  tf.config.experimental_connect_to_cluster(cluster_resolver)
  tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
  strategy = tf.distribute.TPUStrategy(cluster_resolver)
  print('Using TPU')
elif tf.config.list_physical_devices('GPU'):
  strategy = tf.distribute.MirroredStrategy()
  print('Using GPU') 
else:
  strategy = tf.distribute.get_strategy()
  print('Running on CPU is not recommended.')

INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Initializing the TPU system: grpc://10.3.146.122:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.3.146.122:8470


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


Using TPU


## Load the dataset

With the TFRecord format the dataset can be read directly in the TPU nodes.

In [None]:
def tf_record_decoder(encoded):
    features = {
        "input_ids": tf.io.FixedLenFeature([512], tf.int64),
        "attention_mask": tf.io.FixedLenFeature([512], tf.int64),
        "labels": tf.io.FixedLenFeature([2], tf.int64),
        "decoder_attention_mask": tf.io.FixedLenFeature([2], tf.int64),
        "class": tf.io.FixedLenFeature([1], tf.int64),
    }
    tf_record = tf.io.parse_single_example(encoded, features)
    return tf_record

def dataset_prepare(dataset, batch_size=32, training=False):
    dataset = dataset.map(tf_record_decoder)
    dataset = dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

    if training:
        dataset = dataset.shuffle(1*1024)

    dataset = dataset.batch(batch_size)
    return dataset

In [None]:
ds_tr_train = tf.data.TFRecordDataset(f'gs://{GCS_BUCKET}/t5-model2/dataset_t5_train.tfrecord')
ds_tr_valid = tf.data.TFRecordDataset(f'gs://{GCS_BUCKET}/t5-model2/dataset_t5_valid.tfrecord')

In [None]:
with strategy.scope():
    ds_train = dataset_prepare(ds_tr_train, training=True)
    ds_valid = dataset_prepare(ds_tr_valid)

## Import the pre-trained model
and helper functions


In [None]:
from transformers import T5Tokenizer, TFT5ForConditionalGeneration

# with strategy.scope():
#     tokenizer = T5Tokenizer.from_pretrained('t5-large')
#     model = TFT5ForConditionalGeneration.from_pretrained('t5-large')

with strategy.scope():
    tokenizer = T5Tokenizer.from_pretrained('t5-base')
    model = TFT5ForConditionalGeneration.from_pretrained('t5-base')

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/851M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFT5ForConditionalGeneration.

All the layers of TFT5ForConditionalGeneration were initialized from the model checkpoint at t5-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.


### Workaround for a bug in {train,test}_step

There was an issue with using metrics with the Model train API which was fixed in https://github.com/huggingface/transformers/pull/14009


In [None]:
from tensorflow.python.keras.engine import data_adapter

def train_step(self, data):
    """
    A modification of Keras's default train_step that cleans up the printed metrics when we use a dummy loss.
    """
    # These are the only transformations `Model.fit` applies to user-input
    # data when a `tf.data.Dataset` is provided.
    data = data_adapter.expand_1d(data)
    x, y, sample_weight = data_adapter.unpack_x_y_sample_weight(data)
    # These next two lines differ from the base method - they avoid issues when the labels are in
    # the input dict (and loss is computed internally)
    if y is None and "labels" in x:
        y = x["labels"]  # Stops confusion with metric computations
    # Run forward pass.
    with tf.GradientTape() as tape:
        y_pred = self(x, training=True)
        loss = self.compiled_loss(y, y_pred, sample_weight, regularization_losses=self.losses)
    # Run backwards pass.
    self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    self.compiled_metrics.update_state(y, y_pred['logits'], sample_weight)
    # Collect metrics to return
    return_metrics = {}
    for metric in self.metrics:
        result = metric.result()
        if isinstance(result, dict):
            return_metrics.update(result)
        else:
            return_metrics[metric.name] = result
    # These next two lines are also not in the base method - they correct the displayed metrics
    # when we're using a dummy loss, to avoid a bogus "loss_loss" value being shown.
    if "loss" in return_metrics and "loss_loss" in return_metrics:
        del return_metrics["loss_loss"]
    return return_metrics

def test_step(self, data):
    """
    A modification of Keras's default test_step that cleans up the printed metrics when we use a dummy loss.
    """
    data = data_adapter.expand_1d(data)
    x, y, sample_weight = data_adapter.unpack_x_y_sample_weight(data)
    # These next two lines differ from the base method - they avoid issues when the labels are in
    # the input dict (and loss is computed internally)
    if y is None and "labels" in x:
        y = x["labels"]  # Stops confusion with metric computations
    y_pred = self(x, training=False)
    if not self.loss:
        self.loss_tracker.update_state(y_pred.loss)
        return_metrics = {"loss": self.loss_tracker.result()}
    else:
        # Run anyway to update state
        return_metrics = {}
    # Updates stateful loss metrics.
    self.compiled_loss(y, y_pred, sample_weight, regularization_losses=self.losses)
    self.compiled_metrics.update_state(y, y_pred['logits'], sample_weight)
    # Collect metrics to return
    for metric in self.metrics:
        result = metric.result()
        if isinstance(result, dict):
            return_metrics.update(result)
        else:
            return_metrics[metric.name] = result
    # These next two lines are also not in the base method - they correct the displayed metrics
    # when we're using a dummy loss, to avoid a bogus "loss_loss" value being shown.
    if "loss" in return_metrics and "loss_loss" in return_metrics:
        del return_metrics["loss_loss"]
    return return_metrics


In [None]:
import functools
model.train_step = functools.partial(train_step, model)
model.test_step = functools.partial(test_step, model)

## Define a class accuracy metric

Save the model weights that achieve the highest accuracy on the test set. Used as a form of regularization. 

In [None]:
def _accuracy(y_true, y_pred):
    return tf.keras.metrics.sparse_categorical_accuracy(y_true[:, 0], y_pred[:, 0])

class ClassificationAccuracy(tf.keras.metrics.MeanMetricWrapper):
  def __init__(self, name='accuracy', **kwargs):
    super().__init__(_accuracy, name=name, **kwargs)


## Compile and train the model 

- sets up training hyperparameters (learning rate)
- Unwrapping the `model.loss` dictionary is done so that `save_weights` works correctly. Otherwise `save_weights` throws an exception that a trackable has been modified.

In [None]:
epochs = 30
batch_size = 32
init_lr = 1e-5

steps_per_epoch = 1406
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = num_train_steps // 10

with strategy.scope():

    optimizer = optimization.create_optimizer(
        init_lr=init_lr,
        num_train_steps=num_train_steps,
        num_warmup_steps=num_warmup_steps,
        optimizer_type='adamw')

    model.compile(optimizer=optimizer, metrics=[ClassificationAccuracy()])
    model.loss = dict(model.loss)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as keys in the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


## Train

- The last batch with 8 examples, rather than 32, generates a NaN loss on TPU (but not on CPU). Use only the full batches.
- Use only a subset of the validation set since in order to save on computation costs.

In [None]:
checkpoint_filepath = f'gs://{GCS_BUCKET}/{GCS_MODEL_DIR}/checkpoint'

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_accuracy',
    mode='max',  
    save_best_only=True)


with strategy.scope():
    history = model.fit(
                x=ds_train.take(1406),
                validation_data=ds_valid.take(200),
                callbacks=[model_checkpoint_callback],
                epochs=epochs)

model.save_pretrained('t5-model')

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
#model.save_pretrained('t5-model')

In [None]:
!gsutil rsync t5-model gs://$GCS_BUCKET/$GCS_
MODEL_DIR/

Building synchronization state...
Starting synchronization...
Copying file://t5-model/config.json [Content-Type=application/json]...
Copying file://t5-model/tf_model.h5 [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

/
Operation completed over 2 objects/850.8 MiB.                                    


## Evaluate the Results

In [None]:
import pandas as pd
test_df = pd.read_csv('tweets_test.csv')

In [None]:
def tf_record_decoder(encoded):
    features = {
        "input_ids": tf.io.FixedLenFeature([512], tf.int64),
        "attention_mask": tf.io.FixedLenFeature([512], tf.int64),
        "labels": tf.io.FixedLenFeature([2], tf.int64),
        "decoder_attention_mask": tf.io.FixedLenFeature([2], tf.int64),
        "class": tf.io.FixedLenFeature([1], tf.int64),
    }
    tf_record = tf.io.parse_single_example(encoded, features)
    return tf_record

def dataset_prepare(dataset, batch_size=32, training=False):
    dataset = dataset.map(tf_record_decoder)
    dataset = dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)

    if training:
        dataset = dataset.shuffle(1*1024)

    dataset = dataset.batch(batch_size)
    return dataset

def eval_mapper(batch):
    """ Map training entries in the format expected by model.predict
        i.e. the decoder_inputs are set with the <bos> token (id = 1)
        Since the expected classes are expressed as a single token
        we can retrieve the output with a single call to predict rather
        than using the more expensive text generation strategy that T5
        uses to predict sentences.
    """
    batch_size = tf.shape(batch['input_ids'])[0]
    inputs = {
        'input_ids': batch['input_ids'],
        'attention_mask': batch['attention_mask'],
        'decoder_input_ids': tf.zeros((batch_size, 1), dtype=tf.int32),
        'decoder_attention_mask': tf.ones((batch_size, 1)),
    }
    return inputs

In [None]:
ds_tr_test = tf.data.TFRecordDataset(f'gs://{GCS_BUCKET}/t5-model2/dataset_t5_test.tfrecord')

In [None]:
with strategy.scope():
    ds_test = dataset_prepare(ds_tr_test)
    ds_eval = ds_test.map(eval_mapper)

In [None]:
class PredictionModel(tf.keras.Model):
    """ The model call function is executed in the TPU.

        This wrapper exists so that the argmax computation on logits is performed
        on the TPU and only the token indices are transfered between TPU
        and colab CPU. colab will run out of memory otherwise. Or one is
        forced to execute the predict calls once batch at a time which leads
        to graph setup/tear down costs.
    """
    def __init__(self, model):
        super().__init__()
        self._model = model
    
    def call(self, inputs):
        outputs = self._model(inputs)
        return tf.argmax(outputs['logits'], axis=-1)

In [None]:
import numpy as np

class ClassDecoder(object):
    """ Translate the class tokens into class ids.
    """
    TOKENS = ['anger', 'not_anger']
    def __init__(self, tokenizer):
        self.tok_ids = [tokenizer.encode(tok)[0] for tok in self.TOKENS]

    def _index(self, x):
        try:
            return self.tok_ids.index(x)
        except ValueError:
            return -1
    
    def __call__(self, y_pred_ids):
        result = [self._index(x) for x in y_pred_ids]
        return result

In [None]:
# import os
# os.makedirs('t5-tuned', exist_ok=True)

# !gsutil cp gs://$GCS_BUCKET/$GCS_MODEL_DIR/config.json t5-tuned/
# !gsutil cp gs://$GCS_BUCKET/$GCS_MODEL_DIR/tf_model.h5 t5-tuned/

CommandException: No URLs matched: gs://tfrecord_bucket_1//t5-model//config.json
CommandException: No URLs matched: gs://tfrecord_bucket_1//t5-model//tf_model.h5


In [None]:
#from transformers import T5Tokenizer, TFT5ForConditionalGeneration

with strategy.scope():
    tokenizer = T5Tokenizer.from_pretrained('t5-base')
    #model = TFT5ForConditionalGeneration.from_pretrained('./')
    xmodel = PredictionModel(model)

    decoder = ClassDecoder(tokenizer)

In [None]:
with strategy.scope():
    y_pred_ids = xmodel.predict(ds_eval, verbose=1)



In [None]:
decoder = ClassDecoder(tokenizer)
y_pred = np.array(decoder(y_pred_ids))

In [None]:
test_df['eval'] = y_pred
test_df.loc[test_df['eval'] == 1, 'eval'] = 'not_anger'
test_df.loc[test_df['eval'] == 0, 'eval'] = 'anger'

A quick look at the predicted v. true labels

In [None]:
test_df

Unnamed: 0.1,Unnamed: 0,text,label,eval
0,5527,@Azhar4Pendle @NinaWarhurst Well done you have...,not_anger,not_anger
1,6839,@BorisJohnson @JamesCleverly The only thing yo...,anger,anger
2,6550,@DavidGauke Don't think your convincing anyone.,not_anger,not_anger
3,5036,"@BorisJohnson Say what you want, Al.. you are ...",anger,anger
4,4098,@jeremycorbyn They won't be voting for you dip...,anger,anger
...,...,...,...,...
505,4438,@DavidLammy HeвЂ™d better get back to his vaul...,anger,not_anger
506,6257,@BorisJohnson You utter fraud. shuffling dirt ...,anger,anger
507,2354,"@BorisJohnson Keep it up Boris , i love seeing...",anger,not_anger
508,2478,@jeremycorbyn Have a word then. https://t.co/v...,not_anger,not_anger


A formal evaluation showing very satisfactory results given the nature of the task

In [None]:
from sklearn.metrics import classification_report, accuracy_score

In [None]:
print(classification_report(test_df['label'], test_df['eval']))

              precision    recall  f1-score   support

       anger       0.74      0.83      0.79       187
   not_anger       0.90      0.83      0.86       323

    accuracy                           0.83       510
   macro avg       0.82      0.83      0.82       510
weighted avg       0.84      0.83      0.84       510

