Before we can use the rest of the notebook, we need to install the dependencies: this example uses transformers. To use TPUs on Colab, we need to install torch_xla and the last line installs accelerate from source.

# Import of Libraries

In [None]:
!pip --version

pip 19.3.1 from /usr/local/lib/python3.7/dist-packages/pip (python 3.7)


Make sure that we connect to TPU.

In [None]:
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'

We need to install 🤗 Accelerate in a virtual environment. 

In [None]:
! pip install virtualenv

Collecting virtualenv
[?25l  Downloading https://files.pythonhosted.org/packages/76/38/6a1ed00d05ec4f6e7c587e3e5ec7de827e1164c030d9a2a891463aff95d5/virtualenv-20.5.0-py2.py3-none-any.whl (5.3MB)
[K     |████████████████████████████████| 5.3MB 5.0MB/s 
[?25hCollecting backports.entry-points-selectable>=1.0.4
  Downloading https://files.pythonhosted.org/packages/0c/cd/1e156227cad9f599524eb10af62a2362f872910a49402dbd2bea2dedc91c/backports.entry_points_selectable-1.1.0-py2.py3-none-any.whl
Collecting distlib<1,>=0.3.1
[?25l  Downloading https://files.pythonhosted.org/packages/87/26/f6a23dd3e578132cf924e0dd5d4e055af0cd4ab43e2a9f10b7568bfb39d9/distlib-0.3.2-py2.py3-none-any.whl (338kB)
[K     |████████████████████████████████| 348kB 36.2MB/s 
Collecting platformdirs<3,>=2
  Downloading https://files.pythonhosted.org/packages/88/c4/71ec865898efd2473c7d17d91b95d90e7de5cef9353d42b82c7c5a477d83/platformdirs-2.0.0-py2.py3-none-any.whl
Installing collected packages: backports.entry-points-sel

Check virtual environment version.

In [None]:
!virtualenv --version

virtualenv 20.5.0 from /usr/local/lib/python3.7/dist-packages/virtualenv/__init__.py


In [None]:
!virtualenv my_project

created virtual environment CPython3.7.10.final.0-64 in 790ms
  creator CPython3Posix(dest=/content/my_project, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==21.1.3, setuptools==57.1.0, wheel==0.36.2
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator


In [None]:
!source my_project/bin/activate

Installing datasets and transfomers as well as Accelerate.

In [None]:
! pip install datasets transformers
! pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
# ! pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8.1-cp37-cp37m-linux_x86_64.whl
! pip install git+https://github.com/huggingface/accelerate
# ! pip install accelerate

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/86/27/9c91ddee87b06d2de12f134c5171a49890427e398389f07f6463485723c3/datasets-1.9.0-py3-none-any.whl (262kB)
[K     |████████████████████████████████| 266kB 5.2MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 10.5MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/7d/4f/0a862cad26aa2ed7a7cd87178cbbfa824fc1383e472d63596a0d018374e7/xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243kB)
[K     |████████████████████████████████| 245kB 36.6MB/s 
Collecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/35/03/071adc023c0a7e540cf4652fa9cad13ab32e6ae469bf0cc0262045244812/huggingface_hub-0.0.13-py3-none-any.whl
Collecting fsspec>=2021.05.0
[?25l  Dow

Here are all the imports we will need for this notebook.

In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset, RandomSampler

from accelerate import Accelerator, DistributedType
from datasets import load_dataset, load_metric
from transformers import (
    AdamW,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    set_seed,
)

from tqdm.auto import tqdm





# Import of Data

Download data from Kaggle 

In [None]:
import os

# Download Kaggle Dataset via the Kaggle API

kaggle_api = {"username":"aygulzagidullina","key":"da122ee89f2edede46cc7dc5d55d55ba"}

import json
with open('/content/kaggle.json', 'w') as file:
    json.dump(kaggle_api, file)

!chmod 600 /content/kaggle.json
os.environ['KAGGLE_CONFIG_DIR'] = "/content"

!kaggle competitions download -c jigsaw-unintended-bias-in-toxicity-classification

!unzip \*.zip  && rm *.zip

Downloading all_data.csv.zip to /content
 97% 316M/326M [00:03<00:00, 96.3MB/s]
100% 326M/326M [00:03<00:00, 97.5MB/s]
Downloading train.csv.zip to /content
 97% 267M/276M [00:02<00:00, 99.6MB/s]
100% 276M/276M [00:02<00:00, 104MB/s] 
Downloading toxicity_individual_annotations.csv.zip to /content
 76% 49.0M/64.7M [00:00<00:00, 111MB/s] 
100% 64.7M/64.7M [00:00<00:00, 129MB/s]
Downloading sample_submission.csv.zip to /content
  0% 0.00/224k [00:00<?, ?B/s]
100% 224k/224k [00:00<00:00, 73.8MB/s]
Downloading identity_individual_annotations.csv.zip to /content
 73% 9.00M/12.3M [00:00<00:00, 49.3MB/s]
100% 12.3M/12.3M [00:00<00:00, 58.9MB/s]
Downloading test_public_expanded.csv.zip to /content
 57% 9.00M/15.9M [00:00<00:00, 39.1MB/s]
100% 15.9M/15.9M [00:00<00:00, 52.7MB/s]
Downloading test_private_expanded.csv.zip to /content
 32% 5.00M/15.8M [00:00<00:00, 30.3MB/s]
100% 15.8M/15.8M [00:00<00:00, 62.5MB/s]
Downloading test.csv.zip to /content
 41% 5.00M/12.1M [00:00<00:00, 47.4MB/s]
100% 

Load train dataset, split with 10% for the validation in the training procedure.

In [None]:
from datasets import load_dataset

dataset_dict_train = load_dataset('csv', data_files='/content/train.csv')
dataset_train = dataset_dict_train['train']




Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-5ec55c99d77ea78c/0.0.0/e138af468cb14e747fb46a19c787ffcfa5170c821476d20d5304287ce12bbc23...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-5ec55c99d77ea78c/0.0.0/e138af468cb14e747fb46a19c787ffcfa5170c821476d20d5304287ce12bbc23. Subsequent calls will reuse this data.


*Select* only 600 000 rows

In [None]:
 # choosing only part of the data
 
 my_list = range(600000) 
 dataset_train = dataset_train.select(my_list)

 dataset_train = dataset_train.train_test_split(test_size=0.1) 

Create copy of the column 'target', i.e. 'labels'

In [None]:
dataset_train = dataset_train.map(lambda batch: {'labels': batch['target']}, batched = True)

HBox(children=(FloatProgress(value=0.0, max=540.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=60.0), HTML(value='')))




In [None]:
dataset_train = dataset_train.map(lambda example: {'labels': 1 if example['labels'] >= 0.5 else 0})

HBox(children=(FloatProgress(value=0.0, max=540000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=60000.0), HTML(value='')))




Change the float64 type of 'labels' column to int64

In [None]:
from datasets import ClassLabel, Value

new_features = dataset_train['train'].features.copy()

new_features["labels"] = Value('int64') 

dataset_train['train'] = dataset_train['train'].cast(new_features)

HBox(children=(FloatProgress(value=0.0, max=54.0), HTML(value='')))




Change the int64 type to ClassLabel for 'labels' column

In [None]:
from datasets import ClassLabel, Value

new_features = dataset_train['train'].features.copy()

new_features["labels"] = ClassLabel(names=['non toxic', 'toxic'])  

dataset_train['train'] = dataset_train['train'].cast(new_features)

HBox(children=(FloatProgress(value=0.0, max=54.0), HTML(value='')))




Function to print 2 examples of the data

In [None]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=2):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))


In [None]:
show_random_elements(dataset_train["train"])

Unnamed: 0,article_id,asian,atheist,bisexual,black,buddhist,christian,comment_text,created_date,disagree,female,funny,heterosexual,hindu,homosexual_gay_or_lesbian,id,identity_annotator_count,identity_attack,insult,intellectual_or_learning_disability,jewish,labels,latino,likes,male,muslim,obscene,other_disability,other_gender,other_race_or_ethnicity,other_religion,other_sexual_orientation,parent_id,physical_disability,psychiatric_or_mental_illness,publication_id,rating,sad,severe_toxicity,sexual_explicit,target,threat,toxicity_annotator_count,transgender,white,wow
0,165709,,,,,,,"Agreed. There is an old saying that ""it is better to be silent and be thought a fool, than to speak and remove all doubt."" Rex would do well to reflect on that wisdom.",2017-02-03 17:06:31.089929+00,0,,0,,,,944732,0,0.0,0.0,,,non toxic,,0,,,0.0,,,,,,938462.0,,,54,approved,0,0.0,0.0,0.0,0.0,6,,,0
1,148532,,,,,,,"I don't feel embarrassed, Lets sit back and see the truth come out one has been proven to be a lie . What difference at this point does it make?",2016-10-14 19:59:32.932202+00,0,,0,,,,528936,0,0.0,0.0,,,non toxic,,3,,,0.0,,,,,,528782.0,,,21,approved,0,0.0,0.0,0.0,0.0,4,,,0


Choose the tokenizer 

In [None]:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




Function to clean text: remove stopwords, punctuation, lemmatize etc

In [None]:
from bs4 import BeautifulSoup # Text Cleaning
import re, string # Regular Expressions, String
import nltk
from nltk.corpus import stopwords # stopwords
from nltk.stem.porter import PorterStemmer # for word stemming
from nltk.stem import WordNetLemmatizer # for word lemmatization
import unicodedata
import html

nltk.download('stopwords')
nltk.download('wordnet')

# set of stopwords to be removed from text
stop = set(stopwords.words('english'))

# update stopwords to have punctuation too
stop.update(list(string.punctuation))

def clean_text(text):
    
    # Remove unwanted html characters
    re1 = re.compile(r'  +')
    x1 = text.lower().replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
    'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
    '<br />', "\n").replace('\\"', '"').replace('<unk>', 'u_n').replace(' @.@ ', '.').replace(
    ' @-@ ', '-').replace('\\', ' \\ ')
    text = re1.sub(' ', html.unescape(x1))
    
    # remove non-ascii characters
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    # remove between square brackets
    text = re.sub('\[[^]]*\]', '', text)
    
    # remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # remove twitter tags
    text = text.replace("@", "")
    
    # remove hashtags
    text = text.replace("#", "")
    
    # remove all non-alphabetic characters
    text = re.sub(r'[^a-zA-Z ]', '', text)
    
    # remove stopwords from text
    final_text = []
    for word in text.split():
        if word.strip().lower() not in stop:
            final_text.append(word.strip().lower())
    
    text = " ".join(final_text)
    
    # lemmatize words
    lemmatizer = WordNetLemmatizer()    
    text = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
    text = " ".join([lemmatizer.lemmatize(word, pos = 'v') for word in text.split()])
    
    # replace all numbers with "num"
    text = re.sub("\d", "num", text)
    
    return text.lower()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Apply "clean" function to the 'comment_text' column of the dataset

In [None]:
dataset_train = dataset_train.map(lambda example: {'comment_text': clean_text(example['comment_text'])})

HBox(children=(FloatProgress(value=0.0, max=540000.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=60000.0), HTML(value='')))




Create tokenized dataset

In [None]:
tokenized_dataset = dataset_train.map(lambda batch: tokenizer(batch["comment_text"], truncation=True, padding="max_length", max_length=128), batched=True, \
                  remove_columns=['id', 'comment_text', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat', 'asian', 'atheist', 'bisexual', 'black', \
                                  'buddhist', 'christian', 'female', 'heterosexual', 'hindu', 'homosexual_gay_or_lesbian', 'intellectual_or_learning_disability', \
                                  'jewish', 'latino', 'male', 'muslim', 'other_disability', 'other_gender', 'other_race_or_ethnicity', 'other_religion', 'other_sexual_orientation',\
                                  'physical_disability', 'psychiatric_or_mental_illness', 'target', 'transgender', 'white', 'created_date', \
                                  'publication_id', 'parent_id', 'article_id', 'rating', 'funny', 'wow', 'sad', 'likes', 'disagree', 'sexual_explicit',\
                                  'identity_annotator_count', 'toxicity_annotator_count'])

HBox(children=(FloatProgress(value=0.0, max=540.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=60.0), HTML(value='')))




In [None]:
tokenized_dataset["train"].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'labels': ClassLabel(num_classes=2, names=['non toxic', 'toxic'], names_file=None, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

Set the torch format for the dataset attributes

In [None]:
tokenized_dataset.set_format("torch")

Choose the transformer model

In [None]:
model_checkpoint = "bert-base-uncased"

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading the testing data: Davidson et al 2017 data

In [None]:
import pandas as pd

# Loading t-davidson raw dataset directly from the github

url = 'https://raw.githubusercontent.com/t-davidson/hate-speech-and-offensive-language/master/data/labeled_data.csv'
test_davidson = pd.read_csv(url)

from datasets import load_dataset
from datasets import Dataset

test_davidson = Dataset.from_pandas(test_davidson)

test_davidson = test_davidson.filter(lambda example: example['class'] != 1)

test_davidson = test_davidson.map(lambda example: {'labels': 1 if example['class'] == 0 else 0})

from datasets import ClassLabel, Value

new_features = test_davidson.features.copy()

new_features["labels"] = ClassLabel(names=['non toxic', 'toxic'])  

test_davidson = test_davidson.cast(new_features)

test_davidson = test_davidson.map(lambda example: {'tweet': clean_text(example['tweet'])})

HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5593.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5593.0), HTML(value='')))




In [None]:
test_davidson = test_davidson.train_test_split(test_size=0.1) 

In [None]:
test1_tokenized_dataset = test_davidson.map(lambda batch: tokenizer(batch["tweet"], truncation=True, padding="max_length", max_length=128), batched=True, \
                  remove_columns=['count', 'hate_speech', 'neither', 'offensive_language', 'Unnamed: 0', 'class', 'tweet'])

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
test1_tokenized_dataset.set_format("torch")

Downloading the testing data: HASOC 2019

In [None]:
from google.colab import files
uploaded = files.upload()

Saving english_dataset.tsv to english_dataset.tsv


In [None]:
import io

test_hasoc = pd.read_csv(io.BytesIO(uploaded['english_dataset.tsv']), sep='\t')

In [None]:
test_hasoc = Dataset.from_pandas(test_hasoc)

test_hasoc = test_hasoc.filter(lambda example: example['task_2'] != 'PRFN' and example['task_2'] != 'OFFN')

test_hasoc = test_hasoc.map(lambda example: {'labels': 1 if example['task_2'] == 'HATE' else 0})

new_features = test_hasoc.features.copy()

new_features["labels"] = ClassLabel(names=['non toxic', 'toxic'])  

test_hasoc = test_hasoc.cast(new_features)

test_hasoc = test_hasoc.map(lambda example: {'text': clean_text(example['text'])})

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=4734.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=4734.0), HTML(value='')))




In [None]:
test_hasoc = test_hasoc.train_test_split(test_size=0.1) 

In [None]:
test2_tokenized_dataset = test_hasoc.map(lambda batch: tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128), batched=True, \
                  remove_columns=['task_1', 'task_2', 'task_3', 'text_id', 'text'])

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
test2_tokenized_dataset.set_format("torch")

In [None]:
def create_dataloaders(train_batch_size=8, eval_batch_size=8, test1_batch_size=8, test2_batch_size=8):
    train_dataloader = DataLoader(
        tokenized_dataset["train"], shuffle=True, batch_size=train_batch_size
    )
    eval_dataloader = DataLoader(
        tokenized_dataset["test"], shuffle=False, batch_size=eval_batch_size
    )
    test1_dataloader_train = DataLoader(
        test1_tokenized_dataset["train"], shuffle=True, batch_size=test1_batch_size
    )
    test1_dataloader_eval = DataLoader(
        test1_tokenized_dataset["test"], shuffle=False, batch_size=test1_batch_size
    )
    test2_dataloader_train = DataLoader(
        test2_tokenized_dataset["train"], shuffle=True, batch_size=test2_batch_size
    )
    test2_dataloader_eval = DataLoader(
        test2_tokenized_dataset["test"], shuffle=False, batch_size=test2_batch_size
    )
    return train_dataloader, eval_dataloader, test1_dataloader_train, test1_dataloader_eval, test2_dataloader_train, test2_dataloader_eval

In [None]:
train_dataloader, eval_dataloader, test1_dataloader_test, test1_dataloader_eval, test2_dataloader_test, test2_dataloader_eval = create_dataloaders()

In [None]:
for batch in train_dataloader:
    print({k: v.shape for k, v in batch.items()})
    outputs = model(**batch)
    break

{'attention_mask': torch.Size([8, 128]), 'input_ids': torch.Size([8, 128]), 'labels': torch.Size([8]), 'token_type_ids': torch.Size([8, 128])}


In [None]:
outputs

SequenceClassifierOutput([('loss', tensor(1.2487, grad_fn=<NllLossBackward>)),
                          ('logits', tensor([[-0.7981,  0.2546],
                                   [-0.7392,  0.2166],
                                   [-0.6511,  0.3472],
                                   [-0.3189,  0.3405],
                                   [-0.2938,  0.4627],
                                   [-0.6182,  0.4291],
                                   [-0.5023,  0.3663],
                                   [-0.7150,  0.2114]], grad_fn=<AddmmBackward>))])

In [None]:
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.

%%file NewMetric.py

import datasets
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, roc_auc_score

# TODO: Add BibTeX citation
_CITATION = """  """

# TODO: Add description of the metric here
_DESCRIPTION = """  """


# TODO: Add description of the arguments of the metric here
_KWARGS_DESCRIPTION = """  """

@datasets.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class NewMetric(datasets.Metric):
    """TODO: Short description of my metric."""

    def _info(self):
        # TODO: Specifies the datasets.MetricInfo object
        return datasets.MetricInfo(
            # This is the description that will appear on the metrics page.
            description=_DESCRIPTION,
            citation=_CITATION,
            inputs_description=_KWARGS_DESCRIPTION,
            # This defines the format of each prediction and reference
            features=datasets.Features({
                'predictions': datasets.Value('int64'),
                'references': datasets.Value('int64'),
            }),
      )

    def _compute(self, predictions, references):

        precision, recall, f1, _ = precision_recall_fscore_support(references, predictions, average='binary')
        acc = accuracy_score(references, predictions)

        try:
            auroc = roc_auc_score(references, predictions)
        except ValueError:
            pass
            auroc = 0.5

        return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall,
        'auroc': auroc
        } 

Writing NewMetric.py


In [None]:
metric = load_metric('/content/NewMetric.py')

In [None]:
predictions = outputs.logits.detach().argmax(dim=-1)
metric.compute(predictions=predictions, references=batch["labels"])

  _warn_prf(average, modifier, msg_start, len(result))


{'accuracy': 0.0, 'auroc': 0.5, 'f1': 0.0, 'precision': 0.0, 'recall': 0.0}

In [None]:
hyperparameters = {
    "learning_rate": 1e-5,
    "num_epochs": 3,
    "train_batch_size": 8, # Actual batch size will this x 8
    "eval_batch_size": 8, # Actual batch size will this x 8
    "test1_batch_size": 8, # Actual batch size will this x 8
    "test2_batch_size": 8, # Actual batch size will this x 8
    "seed": 42,
}

In [None]:
import transformers

def training_function():
    # Initialize accelerator
    accelerator = Accelerator()

    # To have only one message (and not 8) per logs of Transformers or Datasets, we set the logging verbosity
    # to INFO for the main process only.
    if accelerator.is_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    train_dataloader, eval_dataloader, test1_dataloader_train, test1_dataloader_eval, test2_dataloader_train, test2_dataloader_eval = create_dataloaders(
        train_batch_size=hyperparameters["train_batch_size"], eval_batch_size=hyperparameters["eval_batch_size"],
        test1_batch_size=hyperparameters["test1_batch_size"], test2_batch_size=hyperparameters["test2_batch_size"]
    )
    # The seed need to be set before we instantiate the model, as it will determine the random head.
    set_seed(hyperparameters["seed"])

    # Instantiate the model, let Accelerate handle the device placement.
    model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=hyperparameters["learning_rate"])

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.
    model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader
    )

    num_epochs = hyperparameters["num_epochs"]
    # Instantiate learning rate scheduler after preparing the training dataloader as the prepare method
    # may change its length.
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=500,
        num_training_steps=len(train_dataloader) * num_epochs,
    )

    # Instantiate a progress bar to keep track of training. Note that we only enable it on the main
    # process to avoid having 8 progress bars.
    progress_bar = tqdm(range(num_epochs * len(train_dataloader)), disable=not accelerator.is_main_process)

    # Now we train the model on Jigsaw data
    for epoch in range(num_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

        model.eval()
        all_predictions = []
        all_labels = []

        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 8 TPUs to have them all.
            all_predictions.append(accelerator.gather(predictions))
            all_labels.append(accelerator.gather(batch["labels"]))

        # Concatenate all predictions and labels.
        # The last thing we need to do is to truncate the predictions and labels we concatenated
        # together as the prepared evaluation dataloader has a little bit more elements to make
        # batches of the same size on each process.
        all_predictions = torch.cat(all_predictions)[:len(tokenized_dataset["test"])]
        all_labels = torch.cat(all_labels)[:len(tokenized_dataset["test"])]

        eval_metric = metric.compute(predictions=all_predictions, references=all_labels)

        # Use accelerator.print to print only on the main process.
        accelerator.print(f"epoch {epoch}:", eval_metric)

###  transfer learning/training 

    optimizer = AdamW(params=model.parameters(), lr=hyperparameters["learning_rate"])
    
    model, optimizer, test1_dataloader_train, test1_dataloader_eval = accelerator.prepare(
        model, optimizer, test1_dataloader_train, test1_dataloader_eval
    )

    num_epochs = hyperparameters["num_epochs"]
    # Instantiate learning rate scheduler after preparing the training dataloader as the prepare method
    # may change its length.
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=500,
        num_training_steps=len(test1_dataloader_train) * num_epochs,
    )

    progress_bar = tqdm(range(num_epochs * len(test1_dataloader_train)), disable=not accelerator.is_main_process)

    # Now we perform transfer learning/training on Davidson et al 2017 data

    for epoch in range(num_epochs):

        model.train()

        for param in model.bert.embeddings.parameters():
            param.requires_grad = False

        for layer in model.bert.encoder.layer[:6]:
          for param in layer.parameters():
            param.requires_grad = False 

        for step, batch in enumerate(test1_dataloader_train):
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

        model.eval()
        all_predictions = []
        all_labels = []

        for step, batch in enumerate(test1_dataloader_eval):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 8 TPUs to have them all.
            all_predictions.append(accelerator.gather(predictions))
            all_labels.append(accelerator.gather(batch["labels"]))

        # Concatenate all predictions and labels.
        # The last thing we need to do is to truncate the predictions and labels we concatenated
        # together as the prepared evaluation dataloader has a little bit more elements to make
        # batches of the same size on each process.
        all_predictions = torch.cat(all_predictions)[:len(test1_tokenized_dataset["test"])]
        all_labels = torch.cat(all_labels)[:len(test1_tokenized_dataset["test"])]

        eval_metric = metric.compute(predictions=all_predictions, references=all_labels)

        # Use accelerator.print to print only on the main process.
        accelerator.print(f"Davidson et al 2017 data -- epoch {epoch}:", eval_metric)

######################################################################################################

    # Now we perform transfer learning/training on HASOC data

    optimizer = AdamW(params=model.parameters(), lr=hyperparameters["learning_rate"])

    model, optimizer, test2_dataloader_train, test2_dataloader_eval = accelerator.prepare(
        model, optimizer, test2_dataloader_train, test2_dataloader_eval
    )

    num_epochs = hyperparameters["num_epochs"]
    # Instantiate learning rate scheduler after preparing the training dataloader as the prepare method
    # may change its length.
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=500,
        num_training_steps=len(test2_dataloader_train) * num_epochs,
    )

    progress_bar = tqdm(range(num_epochs * len(test2_dataloader_train)), disable=not accelerator.is_main_process)

    for epoch in range(num_epochs):

        model.train()

        for param in model.bert.embeddings.parameters():
            param.requires_grad = False

        for layer in model.bert.encoder.layer[:6]:
          for param in layer.parameters():
            param.requires_grad = False 

        for step, batch in enumerate(test2_dataloader_train):
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)
            
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)

        model.eval()
        all_predictions = []
        all_labels = []

        for step, batch in enumerate(test2_dataloader_eval):
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)

            # We gather predictions and labels from the 8 TPUs to have them all.
            all_predictions.append(accelerator.gather(predictions))
            all_labels.append(accelerator.gather(batch["labels"]))

        # Concatenate all predictions and labels.
        # The last thing we need to do is to truncate the predictions and labels we concatenated
        # together as the prepared evaluation dataloader has a little bit more elements to make
        # batches of the same size on each process.
        all_predictions = torch.cat(all_predictions)[:len(test2_tokenized_dataset["test"])]
        all_labels = torch.cat(all_labels)[:len(test2_tokenized_dataset["test"])]

        eval_metric = metric.compute(predictions=all_predictions, references=all_labels)

        # Use accelerator.print to print only on the main process.
        accelerator.print(f"HASOC data -- epoch {epoch}:", eval_metric)


In [None]:
%%time

from accelerate import notebook_launcher
 
notebook_launcher(training_function)

Launching a training on 8 TPU cores.


loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.8.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/tra

HBox(children=(FloatProgress(value=0.0, max=25314.0), HTML(value='')))

epoch 0: {'accuracy': 0.9485, 'f1': 0.6608122941822173, 'precision': 0.7131011608623549, 'recall': 0.6156678257312334, 'auroc': 0.7968469955532833}
epoch 1: {'accuracy': 0.9496166666666667, 'f1': 0.657838143746463, 'precision': 0.7364419665484034, 'recall': 0.5943955819185928, 'auroc': 0.7877622880651374}
epoch 2: {'accuracy': 0.9483166666666667, 'f1': 0.670071284179168, 'precision': 0.6982261640798226, 'recall': 0.6440989977500511, 'auroc': 0.8097016917221885}


HBox(children=(FloatProgress(value=0.0, max=237.0), HTML(value='')))

Davidson et al 2017 data -- epoch 0: {'accuracy': 0.8803571428571428, 'f1': 0.7615658362989324, 'precision': 0.722972972972973, 'recall': 0.8045112781954887, 'auroc': 0.8542462714162455}
Davidson et al 2017 data -- epoch 1: {'accuracy': 0.925, 'f1': 0.8421052631578947, 'precision': 0.8421052631578947, 'recall': 0.8421052631578947, 'auroc': 0.8964624676445211}
Davidson et al 2017 data -- epoch 2: {'accuracy': 0.9267857142857143, 'f1': 0.8452830188679245, 'precision': 0.8484848484848485, 'recall': 0.8421052631578947, 'auroc': 0.8976334278318747}


HBox(children=(FloatProgress(value=0.0, max=201.0), HTML(value='')))

HASOC data -- epoch 0: {'accuracy': 0.6751054852320675, 'f1': 0.23000000000000004, 'precision': 0.26436781609195403, 'recall': 0.20353982300884957, 'auroc': 0.5131272522246464}
HASOC data -- epoch 1: {'accuracy': 0.759493670886076, 'f1': 0.13636363636363638, 'precision': 0.47368421052631576, 'recall': 0.07964601769911504, 'auroc': 0.5259725933370921}
HASOC data -- epoch 2: {'accuracy': 0.770042194092827, 'f1': 0.08403361344537814, 'precision': 0.8333333333333334, 'recall': 0.04424778761061947, 'auroc': 0.5207388522540632}
CPU times: user 2min 24s, sys: 23.4 s, total: 2min 48s
Wall time: 4h 31min 51s
