## Data Workflow for Data Extraction - CUADv1 - Fine Tune Transformer

In [2]:
!pip install wandb --quiet
!pip install transformers --quiet
!pip install datasets --quiet
!pip install seqeval --quiet
!pip install sentencepiece --quiet
!pip install --upgrade accelerate --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m80.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m206.5/206.5 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m92.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m118.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Sources of information, code and discussions


1. The foundation workflow is from Hugging Face's Token Classification example hosted on Colab [here][1]
2. The models are base models, each using a downstream token clasification task, example [here][2]

[1]: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb
[2]: https://huggingface.co/roberta-base

# Initialize Environment

In [4]:
import os
os.chdir("/content/drive/MyDrive/CUAD_NER_TRANSFORMERS/")
cwd = os.getcwd()

import os, re, math, random, json, string
# Logging date for w&b
from datetime import date
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import wandb

import transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import TrainerCallback, AdamW, get_cosine_schedule_with_warmup
from transformers import DataCollatorForTokenClassification, PreTrainedModel, RobertaTokenizerFast

from datasets import load_dataset, ClassLabel, Sequence, load_metric

from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

In [None]:
# Need to log in to weights and biases in the command line using: wandb login
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mengr2243[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

### Step1: File and dataset handling
Data cleaning, annotations and  formatting has already been done, tokenized to seperate words, tagged using the IOB format and serialized using the Pandas df.to_json() function using the orient="table" parameter to a JSONL file. 

Here we load in the dataset with this JSON format.

In [3]:
DATA_FILE = cwd+ '/cuad-v1-annotated.json'

NameError: ignored

In [None]:
data_files = DATA_FILE
datasets = load_dataset('json', data_files=data_files, field='data')



  0%|          | 0/1 [00:00<?, ?it/s]

```
# This is formatted as code
```

### **Step 3: Buiding, Training and Validating model**
Since all our tasks are about token classification, we use the AutoModelForTokenClassification class. Like with the tokenizer, the from_pretrained method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which we can get from the features, as seen before).

The warning is telling us we are throwing away some weights (the vocab_transform and vocab_layer_norm layers) and randomly initializing some other (the pre_classifier and classifier layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

#### Training Schedule
This is a common train schedule for transfer learning. The learning rate starts at zero, to initially preserve the pre-trained weights, then increases to a maximum, then reduces using a cosine exponential curve to attempt to find the global optima.

Changing the schedule and/or learning rates is a popular way to experiment to find good model performance. Note how the learning rate max is larger with larger batches sizes. This is a good practice to follow.

Weight decay is the amount of L2 regularization to force into the model's optimizer to make it work harder and offset any tendancy for the model to overfit.


To instantiate a Trainer, we will need to define three more things. The most important is the TrainingArguments, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional.

Then we will need a data collator that will batch our processed examples together while applying padding to make them all the same size (each pad will be padded to the length of its longest example). There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels.

The last thing to define for our Trainer is how to compute the metrics from the predictions. Here we will load the seqeval metrics (which are commonly used to evaluate results on the benchmark CONLL dataset). https://github.com/chakki-works/seqeval

Note - Either BILOU or IOB tags can be used. Whilst BILOU provides for more features, research suggests using the simpler IOB for token classification shouldn't impact accuracy. 

So we will need to do a bit of post-processing on our predictions:
 - select the predicted index (with the maximum logit) for each token
 - convert it to its string label
 - ignore everywhere we set a label of -100

The following function does all this post-processing on the result of Trainer.evaluate (which is a namedtuple containing predictions and labels) before applying the metric:

In [None]:
# SPECIFY THE WEIGHTS AND BIASES PROJECT NAME
%env WANDB_PROJECT = 'P2D-NER-2021' 
# DETERMINE WHETHER TO SAVE THE MODEL IN THE 100GB OF FREE W&B STORAGE
%env WANDB_LOG_MODEL = false 

class BUILD_MODEL:
  def __init__(self, DATASETS, MODEL_NAME,
               TRAIN_SPLIT, RANDOM_SEED,
               BATCH_SIZES, EPOCHS,
               FEATURE_CLASS_LABELS,
               TEMP_MODEL_OUTPUT_DIR,
               TRAIN, LR=0.0000075):
    
    #List of labels saved in the data preparation stage
    self.label_list = self.load_labels_list(FEATURE_CLASS_LABELS)

    #Models to test: Update as required
    self.models = dict(
        ROBERTA = "roberta-base",
        DISTILBERT_U = "distilbert-base-uncased",
        DISTILBERT_C = "distilbert-base-cased",
        DEBERTA_V2_XL = "microsoft/deberta-v2-xlarge",
        DEBERTA_V2_XXL = "microsoft/deberta-v2-xxlarge")
    
    #Input Dataset
    self.DATASETS=DATASETS

    #Name of model: Select based on models dictionary keys
    self.MODEL_NAME = MODEL_NAME
    
    # LOAD OR TRAIN MODEL: 1 to TRAIN WEIGHTS or 0 to LOAD WEIGHTS
    self.TRAIN = TRAIN 
    
    # TRAIN/VALIDATION SPLIT
    self.TRAIN_SPLIT = TRAIN_SPLIT

    # RANDOM SEED FOR REPRODUCIBILITY
    self.RANDOM_SEED = RANDOM_SEED

    # BATCH SIZE
    # TRY 4, 8, 16, 32, 64, 128, 256. REDUCE IF OOM ERROR, HIGHER FOR TPUS
    self.BATCH_SIZES = BATCH_SIZES

    # EPOCHS - TRANSFORMERS ARE TYPICALLY FINE-TUNED BETWEEN 1 AND 3 EPOCHS 
    self.EPOCHS = EPOCHS
    
    #Path to Feature class labels saved
    self.FEATURE_CLASS_LABELS = FEATURE_CLASS_LABELS
    
    # Model out path
    self.TEMP_MODEL_OUTPUT_DIR = TEMP_MODEL_OUTPUT_DIR

    #Learning Rate
    self.LR = LR

  #Loads the list of labels
  def load_labels_list(self, FEATURE_CLASS_LABELS):
      # Open the label list created in pre-processing corresponding to the ner_tag indices
      with open(FEATURE_CLASS_LABELS, 'r') as f:
          label_list = json.load(f)
      return label_list

#==================DATA PREPROCESSING AND TOKENIZATION=========================#
  def word_id_func(self, input_ids, tokenizer, print_labs=False):
      tokens = tokenizer.convert_ids_to_tokens(input_ids)
      
      word_ids = []
      i=0
      spec_toks = ['[CLS]', '[SEP]', '[PAD]']
      for t in tokens:
          if t in spec_toks:
              word_ids.append(-100)
              print(t, i) if print_labs else None
          elif t.startswith('▁'):
              i += 1
              word_ids.append(i)
              print(t, i) if print_labs else None
          else:
              word_ids.append(i)
              print(t, i) if print_labs else None
          print("Total:", i) if print_labs else None
      return word_ids

  def tokenize_and_align_labels(self, examples, tokenizer,  label_all_tokens=False):
      tokenized_inputs = tokenizer(examples["split_tokens"],
                                  truncation=True,
                                  is_split_into_words=True)
      labels = []
      for i, label in enumerate(examples["ner_tags"]):
          word_ids = tokenized_inputs.word_ids(batch_index=i)
          previous_word_idx = None
          label_ids = []
          for word_idx in word_ids:
              # Special tokens have a word id that is None. We set the label to -100 so they are automatically
              # ignored in the loss function.
              if word_idx is None:
                  label_ids.append(-100)
              # We set the label for the first token of each word.
              elif word_idx != previous_word_idx:
                  label_ids.append(label[word_idx])
              # For the other tokens in a word, we set the label to either the current label or -100, depending on
              # the label_all_tokens flag.
              else:
                  label_ids.append(label[word_idx] if label_all_tokens else -100)
              previous_word_idx = word_idx
          labels.append(label_ids)

      tokenized_inputs["labels"] = labels
      return tokenized_inputs

  def tokenize_and_align_labels_deberta(self, examples, tokenizer, label_all_tokens=False):
      tokenized_inputs = tokenizer(examples["split_tokens"],
                                  truncation=True,
                                  is_split_into_words=True)
      labels = []
      word_ids_list = []
      for input_ids in tokenized_inputs["input_ids"]:
          wids = self.word_id_func(input_ids, tokenizer,  print_labs=False)
          word_ids_list.append(wids)
      
      for i, label in enumerate(examples["ner_tags"]):
          word_ids = word_ids_list[i]
          previous_word_idx = None
          label_ids = []
          for word_idx in word_ids:
              # Special tokens have a word id that is None. We set the label to -100 so they are automatically
              # ignored in the loss function.
              if word_idx == -100:
                  label_ids.append(-100)
              #We set the label for the first token of each word.
              elif word_idx != previous_word_idx:
                  label_ids.append(label[word_idx-1])
              # For the other tokens in a word, we set the label to either the current label or -100, depending on
              # the label_all_tokens flag.
              else:
                  label_ids.append(label[word_idx-1] if label_all_tokens else -100)
              previous_word_idx = word_idx
          labels.append(label_ids)

      tokenized_inputs["labels"] = labels
      return tokenized_inputs



  def SET_PARAMETERS(self, DATASETS, MODEL, MODEL_CHECKPOINT):
    EPOCHS = self.EPOCHS
    BATCH_SIZES = self.BATCH_SIZES
    RANDOM_SEED = self.RANDOM_SEED

    today = date.today()
    log_date = today.strftime("%d-%m-%Y")

    #Optimizer
    learning_rate = self.LR
    lr_max = learning_rate * self.BATCH_SIZES
    weight_decay = 0.05

    optimizer = AdamW(
        MODEL.parameters(),
        lr=lr_max,
        weight_decay=weight_decay)

    print("The maximum learning rate is: ",lr_max)

    # Learning Rate Schedule
    num_train_samples = len(DATASETS["train"])
    warmup_ratio = 0.2 # Percentage of total steps to go from zero to max learning rate
    num_cycles=0.8 # The cosine exponential rate

    num_training_steps = num_train_samples*EPOCHS/BATCH_SIZES
    num_warmup_steps = num_training_steps*warmup_ratio

    lr_sched = get_cosine_schedule_with_warmup(optimizer=optimizer,
                                              num_warmup_steps=num_warmup_steps,
                                              num_training_steps = num_training_steps,
                                              num_cycles=num_cycles)
    
    args = TrainingArguments(output_dir = self.TEMP_MODEL_OUTPUT_DIR,
                        evaluation_strategy = "epoch",
                        learning_rate=lr_max,
                        per_device_train_batch_size=BATCH_SIZES,
                        per_device_eval_batch_size=BATCH_SIZES,
                        num_train_epochs=EPOCHS,
                        weight_decay=weight_decay,
                        lr_scheduler_type = 'cosine',
                        warmup_ratio=warmup_ratio,
                        logging_strategy="epoch",
                        save_strategy="epoch",
                        seed=RANDOM_SEED,
                        report_to = 'wandb', # enable logging to W&B
                        run_name = MODEL_CHECKPOINT+"-"+log_date,
                        metric_for_best_model="f1",
                        load_best_model_at_end = True)   # name of the W&B run (optional)
      
    return args, lr_sched, optimizer

#==================TRAINING AND EVALUATION======================================#
  def compute_metrics(self, p):
      label_list = self.label_list
      predictions, labels = p
      predictions = np.argmax(predictions, axis=2)

      # Remove ignored index (special tokens)
      true_predictions = [
          [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
          for prediction, label in zip(predictions, labels)]
      true_labels = [
          [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
          for prediction, label in zip(predictions, labels)]
      
      # Define the metric parameters
      overall_precision = precision_score(true_labels, true_predictions, zero_division=1)
      overall_recall = recall_score(true_labels, true_predictions, zero_division=1)
      overall_f1 = f1_score(true_labels, true_predictions, zero_division=1)
      overall_accuracy = accuracy_score(true_labels, true_predictions)
      
      # Return a dictionary with the calculated metrics
      return {
          "precision": overall_precision,
          "recall": overall_recall,
          "f1": overall_f1,
          "accuracy": overall_accuracy,}

  def TRAINING(self):
      # WHICH PRE-TRAINED TRANSFORMER TO FINE-TUNE?
      models = self.models
      MODEL_CHECKPOINT = models[self.MODEL_NAME]
      datasets = self.DATASETS

      label_list = self.label_list

      # Create train and validation datasets
      datasets = datasets['train'].train_test_split(test_size=1-TRAIN_SPLIT, seed=RANDOM_SEED)

      # Instantiate the tokenizer
      #For RoBERTa-base, need to use RobertaTokenizerFast with add_prefix_space=True to use it with pretokenized inputs.
      # SentencePiece will need to be installed for DeBERTa v2: pip install sentencepiece
      if MODEL_CHECKPOINT == models['ROBERTA']:
          tokenizer = RobertaTokenizerFast.from_pretrained(MODEL_CHECKPOINT, add_prefix_space=True)
      else:
          tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

      # To apply this function on all the words and labels in our dataset,
      # we just use the map method of our dataset object we created earlier.
      # This will apply the function on all the elements of all the splits in dataset, so our training, 
      # validation and testing data will be preprocessed in one single command.

      # 🤗 Datasets warns you when it uses cached files, you can pass load_from_cache_file=False in the
      # call to map to not use the cached files and force the preprocessing to be applied again.
      tokenize_and_align_labels = self.tokenize_and_align_labels
      if MODEL_CHECKPOINT == models['DEBERTA_V2_XL'] or MODEL_CHECKPOINT == models['DEBERTA_V2_XXL']:
          tokenize_and_align_labels = self.tokenize_and_align_labels_deberta


      tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True,
                                        load_from_cache_file=True,
                                        fn_kwargs={"tokenizer": tokenizer})
      
      data_collator = DataCollatorForTokenClassification(tokenizer)

      model = AutoModelForTokenClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=len(label_list))

      args, lr_sched, optimizer = self.SET_PARAMETERS(DATASETS=datasets, MODEL=model, MODEL_CHECKPOINT = MODEL_CHECKPOINT)

      # Define and instantiate the Trainer...
      trainer = Trainer(
                model=model,
                args=args,
                train_dataset=tokenized_datasets["train"],
                eval_dataset=tokenized_datasets["test"],
                data_collator=data_collator,
                tokenizer=tokenizer,
                compute_metrics=self.compute_metrics,
                optimizers=(optimizer, lr_sched)
                )
      
      print('STARTING TRAINING----')
      trainer.train()
      print('TRAINING COMPLETED')

      print('STARTING EVALUATION ON CHOSEN EPOCH---')
      # Evaluate based on the chosen epoch (usually best or last)
      trainer.evaluate()
      print('EVALUATION FINISHED---')

      # Finish Weighs & Biases logging for this run
      wandb.finish()

      # Save the model, good practice given the work required to train a model and  
      # also can be used just for inference on new data
      print("Saving model...")
      SAVED_MODEL = cwd + f"/models/p2d-NER-Fine-Tune-Transformer-{MODEL_CHECKPOINT}" # Change for notebook version
      trainer.save_model(SAVED_MODEL)
      print("Saved_model model...")
      return 


env: WANDB_PROJECT='P2D-NER-2021'
env: WANDB_LOG_MODEL=false


## TEST-1

In [None]:
DATASETS = datasets
MODEL_NAME = "ROBERTA" #CHOOSE FROM ROBERTA = "roberta-base",DISTILBERT_U, DISTILBERT_C, DEBERTA_V2_XL, DEBERTA_V2_XXL      
TRAIN_SPLIT = 0.90
RANDOM_SEED = 42
BATCH_SIZES=1
EPOCHS = 10
FEATURE_CLASS_LABELS = cwd+"/feature_class_labels.json"
TEMP_MODEL_OUTPUT_DIR = cwd+'temp_model_output_dir'
TRAIN = 1
LR=0.0000075


BUILD_MODEL(DATASETS=DATASETS,
            MODEL_NAME=MODEL_NAME,
            TRAIN_SPLIT=TRAIN_SPLIT,
            RANDOM_SEED=RANDOM_SEED,
            BATCH_SIZES=BATCH_SIZES,
            EPOCHS=EPOCHS,
            FEATURE_CLASS_LABELS = FEATURE_CLASS_LABELS,
            TEMP_MODEL_OUTPUT_DIR=TEMP_MODEL_OUTPUT_DIR,
            TRAIN=TRAIN,
            LR=LR).TRAINING()



Map:   0%|          | 0/282 [00:00<?, ? examples/s]

Map:   0%|          | 0/32 [00:00<?, ? examples/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForTokenClassification: ['lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able

The maximum learning rate is:  7.5e-06
STARTING TRAINING----


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.651,0.130176,0.641791,0.697297,0.668394,0.959348
2,0.0925,0.05823,0.806452,0.945946,0.870647,0.984514
3,0.0512,0.043918,0.863636,0.924324,0.89295,0.987256
4,0.037,0.050029,0.872549,0.962162,0.915167,0.986772
5,0.0278,0.049311,0.87,0.940541,0.903896,0.987579
6,0.021,0.043977,0.893401,0.951351,0.921466,0.989998
7,0.0193,0.048092,0.898477,0.956757,0.926702,0.989192
8,0.0186,0.050502,0.885,0.956757,0.919481,0.988385
9,0.0191,0.052557,0.871921,0.956757,0.912371,0.988547
10,0.0185,0.06497,0.841584,0.918919,0.878553,0.986933


TRAINING COMPLETED
STARTING EVALUATION ON CHOSEN EPOCH---


EVALUATION FINISHED---


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/accuracy,▁▇▇▇▇████▇█
eval/f1,▁▆▇█▇████▇█
eval/loss,█▂▁▁▁▁▁▂▂▃▁
eval/precision,▁▅▇▇▇███▇▆█
eval/recall,▁█▇█▇████▇█
eval/runtime,▄▂▂▁▃▁▄▁▃▁█
eval/samples_per_second,▅▇▇█▆█▅█▆█▁
eval/steps_per_second,▅▇▇█▆█▅█▆█▁
train/epoch,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇████
train/global_step,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇████

0,1
eval/accuracy,0.98919
eval/f1,0.9267
eval/loss,0.04809
eval/precision,0.89848
eval/recall,0.95676
eval/runtime,0.7158
eval/samples_per_second,44.707
eval/steps_per_second,44.707
train/epoch,10.0
train/global_step,2820.0


Saving model...
Saved_model model...


## TEST-2

In [None]:
DATASETS = datasets
MODEL_NAME = "DISTILBERT_U" #CHOOSE FROM ROBERTA = "roberta-base",DISTILBERT_U, DISTILBERT_C, DEBERTA_V2_XL, DEBERTA_V2_XXL      
TRAIN_SPLIT = 0.90
RANDOM_SEED = 42
BATCH_SIZES=1
EPOCHS = 10
FEATURE_CLASS_LABELS = cwd+"/feature_class_labels.json"
TEMP_MODEL_OUTPUT_DIR = cwd+'temp_model_output_dir'
TRAIN = 1
LR=0.0000075


BUILD_MODEL(DATASETS=DATASETS,
            MODEL_NAME=MODEL_NAME,
            TRAIN_SPLIT=TRAIN_SPLIT,
            RANDOM_SEED=RANDOM_SEED,
            BATCH_SIZES=BATCH_SIZES,
            EPOCHS=EPOCHS,
            FEATURE_CLASS_LABELS = FEATURE_CLASS_LABELS,
            TEMP_MODEL_OUTPUT_DIR=TEMP_MODEL_OUTPUT_DIR,
            TRAIN=TRAIN,
            LR=LR).TRAINING()



Map:   0%|          | 0/282 [00:00<?, ? examples/s]

Map:   0%|          | 0/32 [00:00<?, ? examples/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN t

The maximum learning rate is:  7.5e-06
STARTING TRAINING----




You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,1.0701,0.278909,0.0,0.0,0.0,0.902565
2,0.1443,0.078494,0.810526,0.832432,0.821333,0.981449
3,0.0669,0.052848,0.840206,0.881081,0.860158,0.985965
4,0.0464,0.05392,0.865285,0.902703,0.883598,0.985643
5,0.0341,0.049228,0.870466,0.908108,0.888889,0.986127
6,0.0288,0.050217,0.878307,0.897297,0.887701,0.986449
7,0.0253,0.050736,0.883598,0.902703,0.893048,0.986933
8,0.0255,0.051075,0.879581,0.908108,0.893617,0.986449
9,0.0257,0.050423,0.865979,0.908108,0.886544,0.986772
10,0.0237,0.055131,0.859296,0.924324,0.890625,0.986933


TRAINING COMPLETED
STARTING EVALUATION ON CHOSEN EPOCH---


EVALUATION FINISHED---


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/accuracy,▁██████████
eval/f1,▁▇█████████
eval/loss,█▂▁▁▁▁▁▁▁▁▁
eval/precision,▁▇█████████
eval/recall,▁▇█████████
eval/runtime,▂▄▁▂▅▂█▁▁▁▇
eval/samples_per_second,▆▅█▇▄▇▁█▇█▂
eval/steps_per_second,▆▅█▇▄▇▁█▇█▂
train/epoch,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇████
train/global_step,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇████

0,1
eval/accuracy,0.98645
eval/f1,0.89362
eval/loss,0.05108
eval/precision,0.87958
eval/recall,0.90811
eval/runtime,0.4221
eval/samples_per_second,75.812
eval/steps_per_second,75.812
train/epoch,10.0
train/global_step,2820.0


Saving model...
Saved_model model...


## TEST-3

In [None]:
DATASETS = datasets
MODEL_NAME = "DISTILBERT_C" #CHOOSE FROM ROBERTA = "roberta-base",DISTILBERT_U, DISTILBERT_C, DEBERTA_V2_XL, DEBERTA_V2_XXL      
TRAIN_SPLIT = 0.90
RANDOM_SEED = 42
BATCH_SIZES=1
EPOCHS = 10
FEATURE_CLASS_LABELS = cwd+"/feature_class_labels.json"
TEMP_MODEL_OUTPUT_DIR = cwd+'temp_model_output_dir'
TRAIN = 1
LR=0.0000075


BUILD_MODEL(DATASETS=DATASETS,
            MODEL_NAME=MODEL_NAME,
            TRAIN_SPLIT=TRAIN_SPLIT,
            RANDOM_SEED=RANDOM_SEED,
            BATCH_SIZES=BATCH_SIZES,
            EPOCHS=EPOCHS,
            FEATURE_CLASS_LABELS = FEATURE_CLASS_LABELS,
            TEMP_MODEL_OUTPUT_DIR=TEMP_MODEL_OUTPUT_DIR,
            TRAIN=TRAIN,
            LR=LR).TRAINING()



Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/282 [00:00<?, ? examples/s]

Map:   0%|          | 0/32 [00:00<?, ? examples/s]

Downloading pytorch_model.bin:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForTokenClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this 

The maximum learning rate is:  7.5e-06
STARTING TRAINING----




You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,1.1375,0.288827,0.069519,0.07027,0.069892,0.905307
2,0.1899,0.135344,0.563452,0.6,0.581152,0.957412
3,0.1023,0.098185,0.610577,0.686486,0.64631,0.969189
4,0.07,0.08591,0.729592,0.772973,0.750656,0.977254
5,0.0487,0.081796,0.787565,0.821622,0.804233,0.97919
6,0.0382,0.085711,0.798942,0.816216,0.807487,0.979513
7,0.0346,0.084656,0.796875,0.827027,0.811671,0.979029
8,0.0341,0.086139,0.770408,0.816216,0.792651,0.978222
9,0.0347,0.086132,0.761421,0.810811,0.78534,0.979029
10,0.0329,0.088825,0.761421,0.810811,0.78534,0.978222


TRAINING COMPLETED
STARTING EVALUATION ON CHOSEN EPOCH---


EVALUATION FINISHED---


0,1
eval/accuracy,▁▆▇████████
eval/f1,▁▆▆▇███████
eval/loss,█▃▂▁▁▁▁▁▁▁▁
eval/precision,▁▆▆▇███████
eval/recall,▁▆▇████████
eval/runtime,▃▁█▆▂▂▂▁▁▂█
eval/samples_per_second,▆█▁▃▆▇▇██▇▁
eval/steps_per_second,▆█▁▃▆▇▇██▇▁
train/epoch,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇████
train/global_step,▁▁▂▂▃▃▃▃▄▄▅▅▆▆▆▆▇▇████

0,1
eval/accuracy,0.97903
eval/f1,0.81167
eval/loss,0.08466
eval/precision,0.79688
eval/recall,0.82703
eval/runtime,0.4958
eval/samples_per_second,64.542
eval/steps_per_second,64.542
train/epoch,10.0
train/global_step,2820.0


Saving model...
Saved_model model...


## TEST-4

In [None]:
DATASETS = datasets
MODEL_NAME = "DEBERTA_V2_XL" #CHOOSE FROM ROBERTA = "roberta-base",DISTILBERT_U, DISTILBERT_C, DEBERTA_V2_XL, DEBERTA_V2_XXL      
TRAIN_SPLIT = 0.90
RANDOM_SEED = 42
BATCH_SIZES=1
EPOCHS = 10
FEATURE_CLASS_LABELS = cwd+"/feature_class_labels.json"
TEMP_MODEL_OUTPUT_DIR = cwd+'temp_model_output_dir'
TRAIN = 1
LR=0.0000075


BUILD_MODEL(DATASETS=DATASETS,
            MODEL_NAME=MODEL_NAME,
            TRAIN_SPLIT=TRAIN_SPLIT,
            RANDOM_SEED=RANDOM_SEED,
            BATCH_SIZES=BATCH_SIZES,
            EPOCHS=EPOCHS,
            FEATURE_CLASS_LABELS = FEATURE_CLASS_LABELS,
            TEMP_MODEL_OUTPUT_DIR=TEMP_MODEL_OUTPUT_DIR,
            TRAIN=TRAIN,
            LR=LR).TRAINING()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/282 [00:00<?, ? examples/s]

Map:   0%|          | 0/32 [00:00<?, ? examples/s]

Some weights of the model checkpoint at microsoft/deberta-v2-xlarge were not used when initializing DebertaV2ForTokenClassification: ['lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'deberta.embeddings.position_embeddings.weight']
- This IS expected if you are initializing DebertaV2ForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaV2ForTokenClassification were not initialized from the model checkpoint at microsoft/deberta-v2-xlarge and

The maximum learning rate is:  7.5e-06
STARTING TRAINING----


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss


# Documents(s) Inference


In [None]:
import os
os.chdir("/content/drive/MyDrive/CUAD_NER_TRANSFORMERS/")
cwd = os.getcwd()

from spacy.lang.en import English
from datasets import Dataset, DatasetDict
from collections import defaultdict

nlp = English()

class INFERENCE:
  def __init__(self, LST, MODEL_NAME, FEATURE_CLASS_LABELS, TEMP_MODEL_OUTPUT_DIR, BATCH_SIZES, RANDOM_SEED):
    self.LST = LST

    self.MODEL_NAME = MODEL_NAME
    self.FEATURE_CLASS_LABELS = FEATURE_CLASS_LABELS
    self.BATCH_SIZES = BATCH_SIZES
    self.RANDOM_SEED = RANDOM_SEED

    #List of labels saved in the data preparation stage
    self.label_list = self.load_labels_list(FEATURE_CLASS_LABELS)

    #Models to test: Update as required
    self.models = dict(
        ROBERTA = "roberta-base",
        DISTILBERT_U = "distilbert-base-uncased",
        DISTILBERT_C = "distilbert-base-cased",
        DEBERTA_V2_XL = "microsoft/deberta-v2-xlarge",
        DEBERTA_V2_XXL = "microsoft/deberta-v2-xxlarge")
    
    self.TEMP_MODEL_OUTPUT_DIR = TEMP_MODEL_OUTPUT_DIR

  #Loads the list of labels
  def load_labels_list(self, FEATURE_CLASS_LABELS):
      # Open the label list created in pre-processing corresponding to the ner_tag indices
      with open(FEATURE_CLASS_LABELS, 'r') as f:
          label_list = json.load(f)
      return label_list

  # Text cleaning function for standard PDF parsing workflow
  def clean(self, text):
      text = text.replace("\n", " ")  # Simple replacement for "\n"   
      text = text.replace("\xa0", " ")  # Simple replacement for "\xa0"
      text = text.replace("\x0c", " ")  # Simple replacement for "\x0c"
      
      regex = "\ \.\ "
      subst = "."
      text = re.sub(regex, subst, text, 0)  # Get rid of multiple dots
          
      regex = "_"
      subst = " "
      text = re.sub(regex, subst, text, 0)  # Get rid of underscores
        
      regex = "--+"
      subst = " "
      text = re.sub(regex, subst, text, 0)   # Get rid of multiple dashes
          
      regex = "\*+"
      subst = "*"
      text = re.sub(regex, subst, text, 0)  # Get rid of multiple stars
          
      regex = "\ +"
      subst = " "
      text = re.sub(regex, subst, text, 0)  # Get rid of multiple whitespace
      
      text = text.strip()  #Strip leading and trailing whitespace
      return text
  
  def tokenize(self, text_df):
    # We tokenize each agreement prior to bringing into the transformer model
    # Create tokens using spaCy
    text_df['tokens'] = text_df['Short_Text'].apply(lambda x: nlp(x))

    # Split tokens into a list ready for CSV
    text_df['split_tokens'] = text_df['tokens'].apply(lambda x: [tok.text for tok in x])

    # Create dummy NER tags for alignment purposes (a bit lazy, but convinient)
    text_df['dummy_ner_tags'] = text_df['tokens'].apply(lambda x: [0 for tok in x])

    # Serialise the data to JSON for archive
    export_columns = ['split_tokens', 'dummy_ner_tags']
    export_df = text_df[export_columns]
    ds = Dataset.from_pandas(export_df)

    datasets = DatasetDict()
    datasets['inference'] = ds

    # export_df.to_json(TEST_DATA_FILE, orient="table", index=False)
    text_df = text_df.drop(['dummy_ner_tags'], axis=1)

    # Re-import the serialized JSON data and create a dataset in the format needed for the transformer
    return datasets, text_df


#==================DATA PREPROCESSING AND TOKENIZATION=========================#
  def word_id_func(self, input_ids, tokenizer, print_labs=False):
      tokens = tokenizer.convert_ids_to_tokens(input_ids)
      
      word_ids = []
      i=0
      spec_toks = ['[CLS]', '[SEP]', '[PAD]']
      for t in tokens:
          if t in spec_toks:
              word_ids.append(-100)
              print(t, i) if print_labs else None
          elif t.startswith('▁'):
              i += 1
              word_ids.append(i)
              print(t, i) if print_labs else None
          else:
              word_ids.append(i)
              print(t, i) if print_labs else None
          print("Total:", i) if print_labs else None
      return word_ids

  def tokenize_and_align_labels(self, examples, tokenizer,  label_all_tokens=False):
      tokenized_inputs = tokenizer(examples["split_tokens"],
                                  truncation=True,
                                  is_split_into_words=True)
      labels = []
      for i, label in enumerate(examples["dummy_ner_tags"]):
          word_ids = tokenized_inputs.word_ids(batch_index=i)
          previous_word_idx = None
          label_ids = []
          for word_idx in word_ids:
              # Special tokens have a word id that is None. We set the label to -100 so they are automatically
              # ignored in the loss function.
              if word_idx is None:
                  label_ids.append(-100)
              # We set the label for the first token of each word.
              elif word_idx != previous_word_idx:
                  label_ids.append(label[word_idx])
              # For the other tokens in a word, we set the label to either the current label or -100, depending on
              # the label_all_tokens flag.
              else:
                  label_ids.append(label[word_idx] if label_all_tokens else -100)
              previous_word_idx = word_idx
          labels.append(label_ids)

      tokenized_inputs["labels"] = labels
      return tokenized_inputs

  def tokenize_and_align_labels_deberta(self, examples, tokenizer, label_all_tokens=False):
      tokenized_inputs = tokenizer(examples["split_tokens"],
                                  truncation=True,
                                  is_split_into_words=True)
      labels = []
      word_ids_list = []
      for input_ids in tokenized_inputs["input_ids"]:
          wids = self.word_id_func(input_ids, tokenizer,  print_labs=False)
          word_ids_list.append(wids)
      
      for i, label in enumerate(examples["dummy_ner_tags"]):
          word_ids = word_ids_list[i]
          previous_word_idx = None
          label_ids = []
          for word_idx in word_ids:
              # Special tokens have a word id that is None. We set the label to -100 so they are automatically
              # ignored in the loss function.
              if word_idx == -100:
                  label_ids.append(-100)
              #We set the label for the first token of each word.
              elif word_idx != previous_word_idx:
                  label_ids.append(label[word_idx-1])
              # For the other tokens in a word, we set the label to either the current label or -100, depending on
              # the label_all_tokens flag.
              else:
                  label_ids.append(label[word_idx-1] if label_all_tokens else -100)
              previous_word_idx = word_idx
          labels.append(label_ids)

      tokenized_inputs["labels"] = labels
      return tokenized_inputs
#-------------------------------------------------------------------------------#
  def LOAD_MODEL(self, SAVED_MODEL, TOKENIZER, BATCH_SIZES, RANDOM_SEED):
    # Load the model and instantiate
    loaded_model = AutoModelForTokenClassification.from_pretrained(SAVED_MODEL)

    args = TrainingArguments(output_dir = TEMP_MODEL_OUTPUT_DIR,
                            per_device_train_batch_size=BATCH_SIZES,
                            per_device_eval_batch_size=BATCH_SIZES,
                            seed=RANDOM_SEED
                            )

    data_collator = DataCollatorForTokenClassification(TOKENIZER)

    # Note instantiation currently takes a bit of time: https://github.com/huggingface/transformers/issues/9205
    # Instantiate the predictor
    pred_trainer = Trainer(
        loaded_model,
        args,
        data_collator=data_collator,
        tokenizer=TOKENIZER)
    return pred_trainer
  
  def GET_PREDICTIONS(self, text_df, pred_trainer, tokenized_datasets, label_list):
    # Extract the predictions
    predictions, labels, _ = pred_trainer.predict(tokenized_datasets["inference"])
    predictions = np.argmax(predictions, axis=2)
    text_df['predictions'] = list(predictions)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    text_df['true_predictions'] = true_predictions
    return text_df

  # Consolidate all the information into the DataFrame
  def EXTRACT_DATA(self, tuple_list):
      de_list = []
      for tup in tuple_list:
          if tup[1] != 'O':
              de_list.append(tup)
      return de_list

  #------------------------RETRIEVING INFORMATION-----------------------------#
  def extract_agreement_date(self, tuple_list):
      for d in tuple_list:
          if d[1] == "B-AGMT_DATE":
              temp_date=d[0]
          elif d[1] == "I-AGMT_DATE":
              temp_date = temp_date + " " + d[0]
          else:
              continue
      return temp_date


  def extract_agreement_name(self, tuple_list):
      for n in tuple_list:
          if n[1] == "B-DOC_NAME":
              temp_name=n[0]
          elif n[1] == "I-DOC_NAME":
              temp_name = temp_name + " " + n[0]
          else:
              continue
      return temp_name
  

  def extract_agreement_parties(self, tuple_list):
      data_dict = defaultdict(list)
      for i, p in enumerate(tuple_list):
          if p[1] == "B-PARTY":
              temp_party=p[0]
              if i == (len(tuple_list)-1):
                  data_dict["Parties"].append(temp_party)
              elif tuple_list[i+1][1] != "I-PARTY":
                  data_dict["Parties"].append(temp_party)
          elif p[1] == "I-PARTY":
              temp_party = temp_party + " " + p[0]
              if i == (len(tuple_list)-1):
                  data_dict["Parties"].append(temp_party)
              elif tuple_list[i+1][1] != "I-PARTY":
                  data_dict["Parties"].append(temp_party)

      return list(dict.fromkeys(data_dict['Parties']))
  #------------------------RETRIEVING INFORMATION-----------------------------#


  def RUN_INFERENCE(self):
    columns = ['File_Name','Full_Text']
    lst_docs = self.LST
    df = pd.DataFrame(lst_docs)
    df.columns = columns
    df['Short_Text'] = df.apply(lambda x: self.clean(x['Full_Text']), axis=1)
    datasets, df = self.tokenize(df)
    label_list = self.label_list

    # WHICH PRE-TRAINED TRANSFORMER TO FINE-TUNE?
    models = self.models
    MODEL_CHECKPOINT = models[self.MODEL_NAME]
    SAVED_MODEL = cwd + f"/models/p2d-NER-Fine-Tune-Transformer-{MODEL_CHECKPOINT}" # Change for notebook version

    # Instantiate the tokenizer
    #For RoBERTa-base, need to use RobertaTokenizerFast with add_prefix_space=True to use it with pretokenized inputs.

    if MODEL_CHECKPOINT == models['ROBERTA']:
        tokenizer = RobertaTokenizerFast.from_pretrained(MODEL_CHECKPOINT, add_prefix_space=True)
    else:
        tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

    # To apply this function on all the words and labels in our dataset,
    # we just use the map method of our dataset object we created earlier.

    # 🤗 Datasets warns you when it uses cached files, you can pass load_from_cache_file=False in the
    # call to map to not use the cached files and force the preprocessing to be applied again.
    if MODEL_CHECKPOINT == models['DEBERTA_V2_XL']:
        tokenize_and_align_labels = self.tokenize_and_align_labels_deberta
    else:
        tokenize_and_align_labels = self.tokenize_and_align_labels

    tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True,
                                        load_from_cache_file=True,
                                        fn_kwargs={"tokenizer": tokenizer})
    
    pred_trainer = self.LOAD_MODEL(SAVED_MODEL=SAVED_MODEL, TOKENIZER=tokenizer,
                                   BATCH_SIZES = self.BATCH_SIZES,
                                   RANDOM_SEED = self.RANDOM_SEED)
    
    df = self.GET_PREDICTIONS(text_df = df, pred_trainer=pred_trainer,
                              tokenized_datasets = tokenized_datasets,
                              label_list = label_list)
    
    df['check_pred'] = list(list(zip(a,b)) for a,b in zip(df['split_tokens'], df['true_predictions']))
    df['data_tuples'] = df['check_pred'].apply(self.EXTRACT_DATA)

    #FORMAT INFORMATIONS TAGGING
    df['agmt_name'] = df['data_tuples'].apply(self.extract_agreement_name)
    df['agmt_date'] = df['data_tuples'].apply(self.extract_agreement_date)
    df['agmt_parties'] = df['data_tuples'].apply(self.extract_agreement_parties)

    # Create a dataframe with just the information we want to keep and 
    df_ex = df[['File_Name', 'agmt_name', 'agmt_date', 'agmt_parties', 'Full_Text']].copy()
    df_ex = df_ex.sort_values('File_Name', axis=0)
    return df_ex

In [None]:
path = "/content/drive/MyDrive/CUAD_NER_TRANSFORMERS/CUAD_v1/full_contract_txt/ADMA BioManufacturing, LLC -  Amendment #3 to Manufacturing Agreement .txt"
txt= open(path, 'r').read()
docs = [
    ["1", txt],
    ['2', txt]
    ]
MODEL_NAME = 'DISTILBERT_U' #CHOOSE FROM ROBERTA = "roberta-base",DISTILBERT_U, DISTILBERT_C, DEBERTA_V2_XL, DEBERTA_V2_XXL
FEATURE_CLASS_LABELS = "/content/drive/MyDrive/CUAD_NER_TRANSFORMERS/feature_class_labels.json"
TEMP_MODEL_OUTPUT_DIR = 'temp_model_output_dir'
BATCH_SIZES = 4
RANDOM_SEED = 42

results=  INFERENCE(LST=docs, MODEL_NAME = MODEL_NAME,
                    FEATURE_CLASS_LABELS=FEATURE_CLASS_LABELS,
                    TEMP_MODEL_OUTPUT_DIR=TEMP_MODEL_OUTPUT_DIR,
                    BATCH_SIZES = BATCH_SIZES,
                    RANDOM_SEED = RANDOM_SEED).RUN_INFERENCE()

results

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Unnamed: 0,File_Name,agmt_name,agmt_date,agmt_parties,Full_Text
0,1,Manufacturing Agreement,"December 21 , 2017","[ADMA BioManufacturing , LLC, Sanofi Pasteur S...",Confidential treatment has been requested with...
1,2,Manufacturing Agreement,"December 21 , 2017","[ADMA BioManufacturing , LLC, Sanofi Pasteur S...",Confidential treatment has been requested with...
