# Fine-tuning models

Note: before running code in this notebook, make sure GPU is being used. Go to edit > notebook settings > and select GPU under Hardware accelerator

Model used in this notebook:

In [41]:
model_checkpoint = 'bert-base-uncased'

Repo to write the final model to; change name based on the base model specified above

In [42]:
hub_model_id = f'repro-rights-amicus-briefs/{model_checkpoint}-finetuned-RRamicus'

# Set up environment

1. Load required packages
2. Log into HuggingFace w/access token
3. Load dataset from HuggingFace website

In [None]:
! pip install transformers
! pip install torch
! pip install datasets

In [31]:
import numpy as np
import pandas as pd

from html import unescape
from random import randint
import math

from transformers import pipeline                                                   
from transformers.pipelines.pt_utils import KeyDataset
#import datasets
from datasets import load_dataset, load_metric, Dataset
from transformers import DataCollatorWithPadding
from transformers import AdamW
from transformers import AutoTokenizer, AutoModelForMaskedLM, TrainingArguments, Trainer
from tokenizers import normalizers
from tokenizers.normalizers import BertNormalizer
from transformers import DataCollatorForLanguageModeling

from huggingface_hub import notebook_login

import torch as torch
#from torch.nn import functional as F

Log into huggingface to access the amicus files as a transformers dataset object

In [5]:
# run this once at the start of the session so it saves the token you enter
# in the login in the next code chunk
!git config --global credential.helper store

In [6]:
# get access token on Huggingface website > settings > access token (make sure it's a write token)
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


Load data from HuggingFace Hub

In [None]:
ds_path = 'repro-rights-amicus-briefs/repro-rights-amicus'
# use_auth_token must be true bc this is a private dataset
ds = load_dataset(ds_path, use_auth_token=True)

Check - should have train/test/val with at least `id` and `text` columns

In [8]:
# check
ds

DatasetDict({
    valid: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party'],
        num_rows: 178
    })
    test: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party'],
        num_rows: 149
    })
    train: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party'],
        num_rows: 414
    })
})

# Pre-process inputs

Before tokenizing, we can do some very minimal text pre-processing. Though our text is already lowercase, we can do a few other, simple preprocessing steps. 

1. Remove html characters that sometimes show up in pdf/legal documents. For example, changes '&amp' to &, something transformer models can understand.
2. Lowercasing. See about using Bert normalizer later (this ensures that text has the same preprocessing steps used for text in Bert models)

In [9]:
# remove html characters if they exist! 
ds = ds.map(
    lambda x: {"text": [unescape(o) for o in x["text"]]}, batched=True
)

# lowercase (we've already done this)
#def lowercase_condition(example):
#    return {"condition": example["condition"].lower()}
#ds = ds.map(lowercase_condition)

# normalize text for Bert
#normalizer = normalizers.BertNormalizer()

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Next, we instantiate the tokenizer. Here, using AutoTokenizer and specifying a model. Can also use the correct tokenizer for our model directly. AutoTokenizer makes sure to grab the correct tokenizer for us w/o us having to specify it.

In [None]:
#instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Next, we tokenize. Note that, even though this model uses a subword tokenizer and typically less tokens are used to represent N words, many of our documents still exceed max token limits of transformer models (512). 

Since we don't want to cut off our text after the first 512 tokens, we can instead `tokenize_and_split`, which keeps the "overflow tokens". We also dont want a hard cut after 512 tokens - we want to retain some overlap in case an idea is being expressed in the middle of two splits. 

See [transformers tutorial](https://huggingface.co/course/chapter5/3?fw=pt) for more info

In [12]:
# tokenize in split in documentation for how to break up long text
# instead of returning 1 row per tokenized text, we may instead return multiple
#   with this version, we can also save our metadata by replicating metadata across
#   all of our newly created rows
def tokenize_and_split(examples):
    result = tokenizer(
        examples["text"],
        truncation = True,
        max_length = 512,
        stride = 128,
        return_overflowing_tokens = True,
        padding = 'max_length',
        return_special_tokens_mask=True
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

Tokenize

In [None]:
# do the tokenizing using map function; extra cols are removed bc we don't need 
#   the metadata for the mlm finetuning step
tokenized_ds = ds.map(tokenize_and_split,
                      batched = True,
                      batch_size = 100,
                      remove_columns=['case', 'brief', 'id', 'text', 'brief_party'])

As a result, we have tokenized our train, val, and test sets

In [40]:
tokenized_ds

DatasetDict({
    valid: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 3821
    })
    test: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 3111
    })
    train: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 8901
    })
})

In [37]:
tokenized_ds['train'].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'brief': Value(dtype='string', id=None),
 'brief_party': Value(dtype='int64', id=None),
 'case': Value(dtype='string', id=None),
 'id': Value(dtype='int64', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'special_tokens_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'text': Value(dtype='string', id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

# Train Model

Code from here is adapted from a [huggingface example notebook on language modeling](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)

Download the model from huggingface and cache

In [16]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Next, we need to add a label column to the dataset. If the task we're doing is masking a word in a sentence and predicting the word, our label is the full, complete sentence. This is the input_ids. (Not sure if this is actually used?) 

In [17]:
def gen_label(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples


tokenized_ds = tokenized_ds.map(gen_label,
                                batched=True,
                                batch_size=500)

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

  0%|          | 0/18 [00:00<?, ?ba/s]

In [18]:
# check that above worked -- we should have input_id = labels
print(tokenized_ds['train']['input_ids'][0])
print(tokenized_ds['train']['labels'][0])

[101, 1996, 11324, 2645, 2003, 1037, 4736, 2090, 1996, 4372, 17897, 9250, 2157, 2000, 2166, 1998, 1996, 16655, 19172, 16848, 2157, 2000, 11324, 1012, 3081, 8553, 14623, 2000, 2035, 5022, 1010, 2025, 2074, 16725, 1012, 3081, 8553, 5761, 1996, 3754, 2000, 7438, 1996, 5776, 2030, 10527, 1010, 2025, 1996, 16112, 6459, 2030, 8438, 1997, 1996, 5776, 2030, 10527, 1012, 3081, 8553, 3141, 2000, 21371, 16725, 2038, 2061, 5301, 2144, 3381, 2008, 2116, 5190, 1997, 16725, 2031, 3728, 2042, 11113, 15613, 2012, 5535, 2073, 2073, 2027, 2089, 2031, 2042, 2641, 2512, 14874, 1999, 3381, 1010, 2021, 2651, 2027, 2052, 2031, 1037, 2488, 2084, 3938, 1003, 3382, 2000, 5788, 2065, 5459, 2512, 14196, 2013, 2045, 2388, 1012, 3522, 9849, 1999, 2966, 2671, 5769, 2008, 2023, 2095, 1010, 2901, 1010, 4022, 3081, 8553, 2071, 2022, 2004, 2220, 2004, 1996, 2203, 1997, 1996, 2034, 12241, 20367, 1012, 1008, 8890, 2023, 4766, 15841, 9849, 1999, 3949, 1997, 5186, 21371, 3653, 10280, 4286, 2405, 2205, 2397, 2000, 2022, 14964

The task here is masked language modeling. How do we mask random tokens? Transformers library has a function that will randomly mask tokens for us! 

In [19]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm_probability=0.15)

Note that optimizer defaults to AdamW and lr scheduler so we don't set these up

Define the training parameters

In [20]:
# run this to be able to push model to the hub
!apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-470
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 2s (975 kB/s)
Selecting previously unselected package git-lfs.
(Reading database ... 155320 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


In [28]:
#set training arguments
batch_size = 8
logging_steps = len(ds["train"]) // batch_size

training_args = TrainingArguments('test-trainer',
                                  logging_strategy='epoch',
                                  evaluation_strategy='epoch',
                                  save_strategy='epoch',
                                  report_to='all',
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  load_best_model_at_end = True,
                                  push_to_hub=True,
                                  hub_model_id=hub_model_id,
                                  #output_dir=f'{model_checkpoint}-finetuned-RRamicus',
                                  overwrite_output_dir=True,
                                  logging_steps=logging_steps,
                                  # parameters to tweak
                                  learning_rate=2e-5,
                                  num_train_epochs=10,
                                  weight_decay=0.01)

#setup training loop with arguments
trainer = Trainer(model = model, 
                  args = training_args,
                  data_collator = data_collator,
                  tokenizer = tokenizer,
                  train_dataset = tokenized_ds['train'],
                  eval_dataset = tokenized_ds['valid'])

PyTorch: setting up devices
/content/test-trainer is already a clone of https://huggingface.co/repro-rights-amicus-briefs/bert-base-uncased-finetuned-RRamicus. Make sure you pull the latest changes with `repo.git_pull()`.


Train!

Before running this, go to Edit -> Notebook Settings and make sure to select GPU under "Hardware accelerator"

In [None]:
#train
trainer.train()

In [33]:
# if run into cuda out of memory error, try
# this first
#torch.cuda.empty_cache()
# and if no change, try this
#import gc
#del trainer
#gc.collect()

849

# Evaluate trained model

The main metrics used to evaluate language models are:

* [perplexity](https://huggingface.co/docs/transformers/perplexity): P(word | context, aka, k-1 preceeding tokens) = P($X_k$ | $X_{<k}$)
   - lower values are better 
   - is actually just e^(-average(CE))
   - a huggingface tutorial called a value of 21 "somewhat larger" and this is the only frame of reference I have at this point
* cross entropy loss
  - lower values better
  - doesn't give us more info if we compute perplexity


Note that though the data is split into "train", "test", and "valid"... this distinction does not matter here because we trained a language model using this splits, but here we are using the language model on a different task. 

## Model 1

In [None]:
eval_train = trainer.evaluate(tokenized_ds['train'])
print(f"Perplexity: {math.exp(eval_train['eval_loss']):.2f}")

In [None]:
eval_valid = trainer.evaluate(tokenized_ds['valid'])
print(f"Perplexity: {math.exp(eval_valid['eval_loss']):.2f}")

In [None]:
eval_test = trainer.evaluate(tokenized_ds['test'])
print(f"Perplexity: {math.exp(eval_test['eval_loss']):.2f}")