<a href="https://colab.research.google.com/github/amandakonet/amicus-iv/blob/main/nlp/mlm_bert_base_uncased.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning models

Model used in this notebook:

In [1]:
model_checkpoint = 'bert-base-uncased'

Repo to write the final model to; change name based on the base model specified above

In [2]:
hub_model_id = 'amandakonet/reprorights-amicus-bert'

## Set up environment

1. Load required packages
2. Log into HuggingFace w/access token
3. Load dataset from HuggingFace website

In [None]:
! pip install transformers
! pip install torch
! pip install datasets

In [4]:
import numpy as np
import pandas as pd

from html import unescape
from random import randint
import math

from transformers import pipeline                                                   
#from transformers.pipelines.base import KeyDataset # doesn't work?? +no info online
#import datasets
from datasets import load_dataset, load_metric, Dataset
from transformers import DataCollatorWithPadding
from transformers import AutoTokenizer, AutoModelForMaskedLM, TrainingArguments, Trainer
from tokenizers import normalizers
from tokenizers.normalizers import BertNormalizer
from transformers import DataCollatorForLanguageModeling

from huggingface_hub import notebook_login

import torch as pt
#from torch.nn import functional as F

Log into huggingface to access the amicus files as a transformers dataset object

In [5]:
# run this once at the start of the session so it saves the token you enter
# in the login in the next code chunk
!git config --global credential.helper store

In [6]:
# get access token on Huggingface website > settings > access token (make sure it's a write token)
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


Load data from HuggingFace Hub

In [7]:
ds_path = 'repro-rights-amicus-briefs/repro-rights-amicus'
# use_auth_token must be true bc this is a private dataset
ds = load_dataset(ds_path, use_auth_token=True)

Downloading:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

Using custom data configuration repro-rights-amicus-briefs--repro-rights-amicus-08ba1d56c9dcd8df


Downloading and preparing dataset None/None (download: 13.97 MiB, generated: 26.25 MiB, post-processed: Unknown size, total: 40.22 MiB) to /root/.cache/huggingface/datasets/parquet/repro-rights-amicus-briefs--repro-rights-amicus-08ba1d56c9dcd8df/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/8.23M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.54M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.88M [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/repro-rights-amicus-briefs--repro-rights-amicus-08ba1d56c9dcd8df/0.0.0/0b6d5799bb726b24ad7fc7be720c170d8e497f575d02d47537de9a5bac074901. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Check - should have train/test/val with at least `id` and `text` columns

In [8]:
# check
ds

DatasetDict({
    train: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party'],
        num_rows: 414
    })
    valid: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party'],
        num_rows: 178
    })
    test: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party'],
        num_rows: 149
    })
})

## Pre-process inputs

Recall that transformers use sub-word tokenizers 

Before tokenizing, we can do some very minimal text pre-processing. Though our text is already lowercase, we can do a few other, simple preprocessing steps. 

1. Remove html characters that sometimes show up in pdf/legal documents. For example, changes '&amp' to &, something transformer models can understand.
2. Lowercasing. See about using Bert normalizer later (this ensures that text has the same preprocessing steps used for text in Bert models)

In [9]:
# remove html characters if they exist! 
ds = ds.map(
    lambda x: {"text": [unescape(o) for o in x["text"]]}, batched=True
)

# lowercase (we've already done this)
#def lowercase_condition(example):
#    return {"condition": example["condition"].lower()}
#ds = ds.map(lowercase_condition)

# normalize text for Bert
#normalizer = normalizers.BertNormalizer()

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Next, we instantiate the tokenizer. Here, using AutoTokenizer and specifying a model. Can also use the correct tokenizer for our model directly. AutoTokenizer makes sure to grab the correct tokenizer for us w/o us having to specify it.

In [10]:
#instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Next, we tokenize. Note that, even though this model uses a subword tokenizer and typically less tokens are used to represent N words, many of our documents still exceed max token limits of transformer models (512). 

Since we don't want to cut off our text after the first 512 tokens, we can instead `tokenize_and_split`, which keeps the "overflow tokens". We also dont want a hard cut after 512 tokens - we want to retain some overlap in case an idea is being expressed in the middle of two splits. 

See [transformers tutorial](https://huggingface.co/course/chapter5/3?fw=pt) for more info

In [11]:
# tokenize in split in documentation for how to break up long text
# instead of returning 1 row per tokenized text, we may instead return multiple
#   with this version, we can also save our metadata by replicating metadata across
#   all of our newly created rows
def tokenize_and_split(examples):
    result = tokenizer(
        examples["text"],
        truncation = True,
        max_length = 510,#512,
        stride = 128,
        return_overflowing_tokens = True,
        padding = 'max_length'
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

Tokenize

In [12]:
#do the tokenizing using map function
tokenized_ds = ds.map(tokenize_and_split,
                      batched = True,
                      batch_size = 100)

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

As a result, we have tokenized our train, val, and test sets AND retained all of our metadata

In [13]:
tokenized_ds

DatasetDict({
    train: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 8943
    })
    valid: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3838
    })
    test: Dataset({
        features: ['case', 'brief', 'id', 'text', 'brief_party', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3130
    })
})

In [14]:
tokenized_ds['train'].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'brief': Value(dtype='string', id=None),
 'brief_party': Value(dtype='int64', id=None),
 'case': Value(dtype='string', id=None),
 'id': Value(dtype='int64', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'text': Value(dtype='string', id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

## Train Model

Code from here is adapted from a [huggingface example notebook on language modeling](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb)

Download the model from huggingface and cache

In [None]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Next, we need to add a label column to the dataset. If the task we're doing is masking a word in a sentence and predicting the word, our label is the full, complete sentence. This is the input_ids! 

In [None]:
def gen_label(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples


tokenized_ds = tokenized_ds.map(gen_label,
                                batched=True,
                                batch_size=500)

  0%|          | 0/24 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

In [None]:
# check that above worked -- we should have input_id = labels
print(tokenized_ds['train']['input_ids'][0])
print(tokenized_ds['train']['labels'][0])

[101, 1996, 2110, 1997, 5284, 2038, 11955, 1037, 2375, 4013, 11020, 3089, 10472, 1996, 2224, 1997, 2270, 5029, 2000, 1010, 2426, 2060, 2477, 1010, 9517, 1037, 2450, 2000, 2031, 2019, 11324, 2025, 4072, 2000, 3828, 2014, 2166, 1012, 2004, 1996, 2457, 1997, 9023, 2218, 1010, 2008, 9347, 2003, 20454, 2135, 13727, 2138, 1996, 2240, 2090, 7936, 1008, 1017, 6594, 1998, 10890, 17041, 2003, 2025, 3154, 1012, 2582, 1010, 1996, 5284, 2375, 6464, 3653, 20464, 22087, 2035, 2270, 2740, 2729, 11670, 1010, 2164, 5068, 11500, 1010, 2013, 21570, 2037, 2658, 14422, 1998, 28428, 2037, 6543, 2916, 2000, 3713, 10350, 1998, 10132, 2000, 7846, 2055, 2037, 2740, 2729, 1012, 1996, 2168, 9347, 23640, 2015, 1996, 7846, 1005, 2916, 2104, 1996, 15276, 7450, 2000, 6855, 2440, 1998, 3143, 2592, 2055, 2035, 2800, 2740, 2729, 7047, 1012, 2035, 5381, 1010, 2164, 6875, 2308, 1010, 2031, 2019, 7344, 3085, 2157, 2000, 2191, 1037, 3929, 1011, 6727, 3601, 2055, 2037, 2740, 2729, 1999, 9595, 2007, 18777, 1010, 4895, 7959, 14

The task here is masked language modeling. How do we mask random tokens? Transformers library has a function that will randomly mask tokens for us! 

In [None]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm_probability=0.15)

Define the training parameters

In [None]:
# run this to be able to push model to the hub
!apt-get install git-lfs

In [None]:
#set training arguments
training_args = TrainingArguments("test-trainer",
                                  logging_strategy = "epoch",
                                  evaluation_strategy="epoch",
                                  save_strategy='epoch',
                                  report_to='all',
                                  per_device_train_batch_size = 8,
                                  per_device_eval_batch_size = 8,
                                  num_train_epochs = 5, 
                                  load_best_model_at_end = True,
                                  push_to_hub=True,
                                  hub_model_id=hub_model_id,
                                  overwrite_output_dir=True,
                                  #greater_is_better = True,
                                  #metric_for_best_model = 'accuracy',
                                  learning_rate = 2e-5)

#setup training loop with arguments
trainer = Trainer(model = model, 
                  args = training_args,
                  data_collator = data_collator,
                  tokenizer = tokenizer,
                  train_dataset = tokenized_ds['train'],
                  eval_dataset=tokenized_ds['valid'])

PyTorch: setting up devices
/content/test-trainer is already a clone of https://huggingface.co/amandakonet/reprorights-amicus-bert. Make sure you pull the latest changes with `repo.git_pull()`.


Train!

Before running this, go to Edit -> Notebook Settings and make sure to select GPU under "Hardware accelerator"

In [None]:
#train
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: case, brief, id, text.
***** Running training *****
  Num examples = 11828
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 7395


Epoch,Training Loss,Validation Loss
1,1.7763,1.678926
2,1.76,1.619896
3,1.6881,1.568348
4,1.6424,1.543162
5,1.6131,1.526921


The following columns in the evaluation set  don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: case, brief, id, text.
***** Running Evaluation *****
  Num examples = 5168
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-1479
Configuration saved in test-trainer/checkpoint-1479/config.json
Model weights saved in test-trainer/checkpoint-1479/pytorch_model.bin
tokenizer config file saved in test-trainer/checkpoint-1479/tokenizer_config.json
Special tokens file saved in test-trainer/checkpoint-1479/special_tokens_map.json
tokenizer config file saved in test-trainer/tokenizer_config.json
Special tokens file saved in test-trainer/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: case, brief, id, text.
***** Running Evaluation *****
  Num examples = 5168
  Batch size = 8
Saving model checkpoint to test-trainer/checkpoint-2958
Config

TrainOutput(global_step=7395, training_loss=1.6959930894502198, metrics={'train_runtime': 4701.4174, 'train_samples_per_second': 12.579, 'train_steps_per_second': 1.573, 'total_flos': 1.5565932628992e+16, 'train_loss': 1.6959930894502198, 'epoch': 5.0})

## Evaluate trained model

The main metrics used to evaluate language models are:

* [perplexity](https://huggingface.co/docs/transformers/perplexity): P(word | context, aka, k-1 preceeding tokens) = P($X_k$ | $X_{<k}$)
   - lower values are better 
* cross entropy loss
  - lower values better
  - I think this is the loss calculated above

In [None]:
eval_train = trainer.evaluate(tokenized_ds['train'])
print(f"Perplexity: {math.exp(eval_train['eval_loss']):.2f}")

The following columns in the evaluation set  don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: case, brief, id, text.
***** Running Evaluation *****
  Num examples = 11828
  Batch size = 8


Perplexity: 4.41


In [None]:
eval_valid = trainer.evaluate(tokenized_ds['valid'])
print(f"Perplexity: {math.exp(eval_valid['eval_loss']):.2f}")

The following columns in the evaluation set  don't have a corresponding argument in `BertForMaskedLM.forward` and have been ignored: case, brief, id, text.
***** Running Evaluation *****
  Num examples = 5168
  Batch size = 8


Perplexity: 4.68


## Push to hub

In [None]:
trainer.push_to_hub()

Saving model checkpoint to test-trainer
Configuration saved in test-trainer/config.json
Model weights saved in test-trainer/pytorch_model.bin
tokenizer config file saved in test-trainer/tokenizer_config.json
Special tokens file saved in test-trainer/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.38k/418M [00:00<?, ?B/s]

Upload file runs/Jan10_17-49-10_17a4504e6eb1/events.out.tfevents.1641837349.17a4504e6eb1.79.2:  60%|#####9    …

Upload file runs/Jan10_17-49-10_17a4504e6eb1/events.out.tfevents.1641842316.17a4504e6eb1.79.4: 100%|##########…

To https://huggingface.co/amandakonet/reprorights-amicus-bert
   6c6af4b..009a5ec  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Masked Language Modeling', 'type': 'fill-mask'}}
To https://huggingface.co/amandakonet/reprorights-amicus-bert
   009a5ec..b561fd7  main -> main



'https://huggingface.co/amandakonet/reprorights-amicus-bert/commit/009a5ecf3dc271a8353b714e0c9dedce8b8750ad'

## Cross-model comparison task

Check this model's performance on our small set of labeled data.

# Zero-shot classification with pipeline


Now that we have a model fine-tuned on our data, we can perform zero-shot classification. The goal here is for the model to use what it knows about amicus text to 1) understand the candidate labels we provide and 2) classify texts using these labels.

## Create pipeline 
We can create a new pipeline using the model just pushed to the hub.

First, we need to load our tokenizer. We can use the tokenizer from the model we just trained by specifying the hub_model_id. Note that if we do not re-specify our custome arguments, then the tokenizer will default to Bert tokenizer arguments. Run cell below to investigate this.

In [None]:
#class_tokenizer = AutoTokenizer.from_pretrained(hub_model_id,
#                                                use_auth_token = True)
#class_tokenizer.init_kwargs

Instead, re-specify our sliding window to split up text we give the model as we expect: 

In [15]:
class_tokenizer = AutoTokenizer.from_pretrained(hub_model_id,
                                                use_auth_token = True,
                                                truncation = True,
                                                max_length = 512,
                                                stride = 128,
                                                return_overflowing_tokens = True,
                                                padding = 'max_length')

Downloading:   0%|          | 0.00/321 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [16]:
# check that our new args are added to the init arguments
class_tokenizer.init_kwargs

{'cls_token': '[CLS]',
 'do_lower_case': True,
 'mask_token': '[MASK]',
 'max_length': 512,
 'model_max_length': 512,
 'name_or_path': 'amandakonet/reprorights-amicus-bert',
 'pad_token': '[PAD]',
 'padding': 'max_length',
 'return_overflowing_tokens': True,
 'sep_token': '[SEP]',
 'special_tokens_map_file': None,
 'stride': 128,
 'strip_accents': None,
 'tokenize_chinese_chars': True,
 'truncation': True,
 'unk_token': '[UNK]'}

By default, the classification pipeline follows a premise-hypothesis setup called Natural Language Inference (NLI). The transformer is given two sequences for which it must determine whether they contradict each other, entail each other, or neither. In the classification pipeline, the transformer is given the text (the premise) and the label (the hypothesis). The hypothesis is structured as "This example is {label}." For example, if we want to know whether a text discusses "undue burden," the model is given the text and the hypothesis "This example is undue burden."

This works for most cases, but we can adapt if necessary. 



In [17]:
hyp_temp = 'This example is about {}.'

Now we can use our fine-tuned model in the pipeline by specifying the task, model_id, and tokenzier. 

Setting the device to 0 should utilize the GPU.

In [18]:
classifier = pipeline(task = 'zero-shot-classification', 
                      model = hub_model_id,
                      tokenizer = class_tokenizer,
                      hypothesis_template = hyp_temp,
                      batch_size = 8,
                      use_auth_token = True,
                      device = 0)

Downloading:   0%|          | 0.00/664 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/418M [00:00<?, ?B/s]

Some weights of the model checkpoint at amandakonet/reprorights-amicus-bert were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at amandakon

Next, we specify our candidate labels:



In [19]:
#labels_women = ['women’s rights', 'undue burden', 'compulsory motherhood', 'women’s citizenship']
labels_women = ['abortion is women\'s right', 'undue burden standard', 'forces motherhood']
labels_opp = ['lack of morality', 'abortion negative', 'psychological harm', 'fetus']
labels_dia = ['evidence', 'health']
labels = labels_women + labels_opp + labels_dia
labels

["abortion is women's right",
 'undue burden standard',
 'forces motherhood',
 'lack of morality',
 'abortion negative',
 'psychological harm',
 'fetus',
 'evidence',
 'health']

## Set up data

Next, we set up the data for the pipeline. According to the pipeline api, we can't pass a Dataset object to the pipeline. The pipeline takes a list of strings, where each str represents the text we want to classify.

Our tokenizer function splits our text into size 510 (leaves room for special tokens CLS and SEP) for us. Separating this manually can cause issues, as 510 words != 510 tokens, and some text of len n can vary in the number of tokens it generates if there is extra punctuation or many common subwords included.

One way to potentiall remedy this is to "decode" our tokenized text back into the "original" input text. NOTE THAT THIS IS NOT LOSSLESS, meaning we do not retain 100% of the original text. However, I think this may produce better results than splitting the text manually and worrying about the resulting split being greater than 510 tokens. We cannot send anything to the pipeline except lists of strings (aka, we can't send it pre-tokenized input), so this solves our long text issue. Alternatively, we use pytorch (see further section). 

In [20]:
def decode_chunks(example):
  result = tokenizer.batch_decode(
      example['input_ids'],
      skip_special_tokens=True,
      clean_up_tokenization_spaces=True
  )
  example['text_chunk'] = result
  return example

tokenized_ds = tokenized_ds.map(decode_chunks, batched=True, batch_size=100)

  0%|          | 0/90 [00:00<?, ?ba/s]

  0%|          | 0/39 [00:00<?, ?ba/s]

  0%|          | 0/32 [00:00<?, ?ba/s]

We can create this using our Dataset (`tokenized_ds`) object. Below, I show how we can index the dataset to get a list of strings

In [21]:
# the list of strings
print(type(tokenized_ds['train']['text_chunk']))
# showing that the first element of the list is a string
print(type(tokenized_ds['train']['text_chunk'][0]))

<class 'list'>
<class 'str'>


Create the list of strings using all of our data

In [22]:
tokenized_ds['train'].features

{'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'brief': Value(dtype='string', id=None),
 'brief_party': Value(dtype='int64', id=None),
 'case': Value(dtype='string', id=None),
 'id': Value(dtype='int64', id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'text': Value(dtype='string', id=None),
 'text_chunk': Value(dtype='string', id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}

In [23]:
# old way using all text
#sequences = ds['train']['text'] + ds['valid']['text'] + ds['test']['text']
#brief_ids = ds['train']['id'] + ds['valid']['id'] + ds['test']['id']
#brief_names = ds['train']['brief'] + ds['valid']['brief'] + ds['test']['brief']

# new way using decoded tokenized text
sequences = tokenized_ds['train']['text_chunk'] + tokenized_ds['valid']['text_chunk'] + tokenized_ds['test']['text_chunk']
brief_ids = tokenized_ds['train']['id'] + tokenized_ds['valid']['id'] + tokenized_ds['test']['id']
brief_names = tokenized_ds['train']['brief'] + tokenized_ds['valid']['brief'] + tokenized_ds['test']['brief']
brief_party = tokenized_ds['train']['brief_party'] + tokenized_ds['valid']['brief_party'] + tokenized_ds['test']['brief_party']

# check we have the results we expect
print(type(sequences))
print(type(sequences[0]))
print(len(sequences))

<class 'list'>
<class 'str'>
15911


Convert lists to df and then Dataset

In [30]:
df = pd.DataFrame(list(zip(brief_ids, brief_party, sequences)),
                  columns = ['id', 'brief_party', 'text'])
df.reset_index(inplace = True)

# convert to ds
ds_split = Dataset.from_pandas(df)

Alternatively - split the text into chunks manually using spaces. Note that this results in sequences that are often longer than the allotted 512 for models (median of ~130 tokens too many)

Do not run this section if doing above method.

In [None]:
#@title
def split_text(text, max_len, stride_len):
  # split text on space
  text = text.split()
  # take list and separate into lists of lists, overlapping 
  #text = [text[i : i + max_len] for i in range(0, len(text), stride_len)]
  #separate list into list of strings len "max_len", overlapping by "stride_len"
  text = [' '.join(text[i : i + max_len]) for i in range(0, len(text), stride_len)]
  return text

# example of how it works
#test = 'The Supreme Court of the United States (SCOTUS) is the highest court in the federal judiciary of the United States of America. It has ultimate and largely discretionary appellate jurisdiction over all federal and state court cases that involve a point of federal law, and original jurisdiction over a narrow range of cases, specifically "all Cases affecting Ambassadors, other public Ministers and Consuls, and those in which a State shall be Party."'
#res = split_text(test, max_len = 20, stride_len = 5)
#print(test)
#for r in res: print(r)

# set max len & stride
max_len = 312#512
stride_len = 78#128

# take sequences and ids and turn into df
df = pd.DataFrame(list(zip(brief_ids, brief_party, sequences)),
                  columns = ['id', 'party', 'text_full'])

# split each text into len 'max_len' with 'stride_len' overlap
df['text'] = df.apply(lambda row: split_text(row['text_full'],
                                             max_len=max_len,
                                             stride_len=stride_len),axis=1)
df = df.explode('text')
df.reset_index(inplace = True)
df.drop(['text_full', 'index'], axis=1, inplace=True)

# convert to ds
ds_split = Dataset.from_pandas(df)
ds_split

## Define pipeline function

Next - filter ds to create one w/pro-women briefs and one w/pro-opp briefs & repeat this process 

In [25]:
def classifier_pipeline(example):
    if example['brief_party'] == 1:
      curr_labels = labels_women
    else:
      curr_labels = labels_opp
    output = classifier(example['text'], curr_labels, multi_label=True, device=0)
    example['pred_labels'] = output['labels']
    example['pred_scores'] = output['scores']
    return example

In [None]:
# KeyDataset not working??
# open issue on HF github: https://github.com/huggingface/transformers/issues/15524
# apparently working fine in 4.15.0
#dataset = datasets.load_dataset("imdb", name="plain_text", split="unsupervised")
#pipe = pipeline("text-classification", device=0)
#for out in pipe(KeyDataset(dataset, "text"), batch_size=8, truncation="only_first"):
#    print(out)

## Define function to select only top predictions


Now that we have a list of predicted labels and associated scores, only save those that meet a certain threshold or the top k predictions, whichever is specified. 

In [26]:
def get_preds(example, threshold=None, topk=None):
    preds = []
    if threshold:
        for label, score in zip(example["pred_labels"], example["pred_scores"]):
            if score >= threshold:
                preds.append(label)
    elif topk:
        for i in range(topk):
            preds.append(example["pred_labels"][i])
    else:
        raise ValueError("Set either `threshold` or `topk`.")
    return {"pred_label_ids": list(np.squeeze(preds))}

## Run pipeline on text

Separate the Dataset into one containing feminist briefs and one containing opp briefs. 

Takes 8 min 30s mins to do fem.

In [31]:
# split datasets into fem and opp
ds_split_fem = ds_split.filter(lambda x: x['brief_party'] == 1)
ds_split_opp = ds_split.filter(lambda x: x['brief_party'] == 0)

# zero shot
ds_0shot_fem = ds_split_fem.map(classifier_pipeline)#, batched=True, batch_size=4)

  0%|          | 0/16 [00:00<?, ?ba/s]

  0%|          | 0/16 [00:00<?, ?ba/s]

0ex [00:00, ?ex/s]



Example

In [35]:
print(ds_0shot_fem['text'][0])
print(ds_0shot_fem['pred_scores'][0])
print(ds_0shot_fem['pred_labels'][0])
print(get_preds(ds_0shot_fem[0], threshold=0.60))

this brief focuses on pregnant teenage girls, on the individual and societal importance of preserving their right to make a personal individual choice between abortion and childbirth without the compelled involvement of their parents, and on the devastating harm that removing or restricting that right would cause to teenage girls in particular. at minimum, preservation of their right to choose requires that the state not compel parental notice or consent unless it provides for a meaningful judicial or other alternative. whatever we might wish, sexual activity and unintended pregnancies have become a concrete reality of teenage girls'lives. because most teenage pregnancies are unintended, and because unwanted childbirth and motherhood are so exceptionally burdensome for a teenage girl, a pregnant teenager's right to determine her future must include the right to choose safe, legal abortion instead of childbirth. legislatively - compelled parental notification can be the functional equiv

# Zero-shot classification with pytorch

In [None]:
# pose sequence as a NLI premise and label as a hypothesis
from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('joeddav/xlm-roberta-large-xnli')
tokenizer = AutoTokenizer.from_pretrained('joeddav/xlm-roberta-large-xnli')

premise = sequence
hypothesis = f'This example is {label}.'

# run through model pre-trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                     truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]

# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (2) as the probability of the label being true 
entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]

In [None]:
# init model
from transformers import BertForSequenceClassification, BertTokenizer
tokenizer = BertTokenizer.from_pretrained(hub_model_id, use_auth_token=True)
model = BertForSequenceClassification.from_pretrained(hub_model_id, use_auth_token=True)

# ex
txt = ds['train']['text'][0]

# get tokens -- return pytorch
def tokenize_no_max(examples):
    result = tokenizer.encode_plus(
        examples["text"],
        add_special_tokens=False,
        return_tensors='pt')
    return result

tokens = ds.map(tokenize_no_max, batched=True, batch_size=100)

chunk_size = 512

# split into size 510 (leave room for CLS & SEP tokens)
#input_id_chunks = tokens['input_ids'][0].split(510)
#mask_chunks = tokens['attention_mask'][0].split(510)
input_id_chunks = list(tokens['input_ids'][0].split(chunk_size - 2))
mask_chunks = list(tokens['attention_mask'][0].split(chunk_size - 2))

for i in range(len(input_id_chunks)):
    # add CLS and SEP tokens to input IDs
    input_id_chunks[i] = pt.cat([
        pt.tensor([101]), input_id_chunks[i], pt.tensor([102])
    ])
    # add attention tokens to attention mask
    mask_chunks[i] = pt.cat([
        pt.tensor([1]), mask_chunks[i], pt.tensor([1])
    ])
    # get required padding length
    pad_len = chunk_size - input_id_chunks[i].shape[0]
    # check if tensor length satisfies required chunk size
    if pad_len > 0:
        # if padding length is more than 0, we must add padding
        input_id_chunks[i] = pt.cat([
            input_id_chunks[i], pt.Tensor([0] * pad_len)
        ])
        mask_chunks[i] = pt.cat([
            mask_chunks[i], pt.Tensor([0] * pad_len)
        ])

# reshape for BERT
input_ids = pt.stack(input_id_chunks)
attention_mask = pt.stack(mask_chunks)

input_dict = {
    'input_ids': input_ids.long(),
    'attention_mask': attention_mask.int()
}
input_dict

# check
#outputs = model(**input_dict)
#outputs

Some weights of the model checkpoint at amandakonet/reprorights-amicus-bert were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at amandakon

  0%|          | 0/2 [00:00<?, ?ba/s]

TypeError: ignored