<a href="https://colab.research.google.com/github/eddieguo-1128/LS190/blob/main/labs/lab6_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 6
## Introduction
In this lab, we’ll be introducing a revolutionary deep learning architecture for NLP - Transformers. Previously, our machine learning models were based on distributed word embeddings that that encode information about the distribution of context a word appears in. Word embedding models such as Word2Vec are based on the distribution hypothesis: **words that appear in similar context have similar representations**. However there are several shortcomings of a distributed representation of words. 

First, pre-trained word embeddings like Word2Vec are great for words that appear frequently in the corpus but not for obscure or unknown words. These models tend to capture more common words such as “the”,  “to”, but would poorly represent words like “Zsombor” and “Sandberger.” 

Another shortcoming of distributed representations is polysemy (i.e. a single word has multiple meanings). Consider the following example:
> Let's go to the **park**

> I'll **park** the car

The word “park” appears in both sentences, and from word2vec, we’ll get the same vector representation for “park” in both cases. However, it is obvious that the two “park” are different in their part of speech as well as the meaning. This is usually known as the **type-token distinction problem** - meaning that word2vec actually captures type (the general class / concept of word “park” but not individual instances or tokens).

The big idea of contextuliazed word embedding is that it transforms the representation of a token in a sentence (e.g., from a static word embedding) to be sensitive to its local context in a sentence and trainable to be optimized for a specific NLP task.

## Lab Setup
### Extract Data from Case.law API

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd gdrive/MyDrive/LS190

Mounted at /content/gdrive
/content/gdrive/MyDrive/LS190


In [None]:
import os
import sys
sys.path.append('..')

import lzma
import json

from config import settings
import utils

In [None]:
compressed_file = utils.get_cases_from_bulk(jurisdiction="California", data_format="json")

[35mdownloading California-20200302-text.zip into ../data dir[0m


553580it [00:09, 57375.24it/s]


[35mextracting California-20200302-text.zip into ../data dir[0m
[35mDone.[0m


In [None]:
compressed_file

'/content/gdrive/MyDrive/LS190/config/../data/California-20200302-text/data/data.jsonl.xz'

In [None]:
cases = []
print("File path:", compressed_file)
with lzma.open(compressed_file) as infile:
    for line in infile:
        record = json.loads(str(line, 'utf-8'))
        cases.append(record)

print("Case count: %s" % len(cases))

File path: /content/gdrive/MyDrive/LS190/config/../data/California-20200302-text/data/data.jsonl.xz
Case count: 141535


In [None]:
df = pd.DataFrame(cases)
df.head()

Unnamed: 0,id,url,name,name_abbreviation,decision_date,docket_number,first_page,last_page,citations,volume,reporter,court,jurisdiction,frontend_url,preview,casebody
0,505141,https://api.capapi.org/v1/cases/505141/,"JIMMY DEAN ZIEGLER, as Trustee, etc., et al., ...",Ziegler v. Nickel,1998-06-04,No. B100335,545,549,"[{'type': 'official', 'cite': '64 Cal. App. 4t...",{'url': 'https://api.capapi.org/v1/volumes/320...,{'url': 'https://api.capapi.org/v1/reporters/3...,{'url': 'https://api.capapi.org/v1/courts/cal-...,"{'name': 'Cal.', 'name_long': 'California', 'w...",https://cite.capapi.org/cal-app-4th/64/545/,[],"{'data': {'judges': [], 'attorneys': ['Counsel..."
1,505122,https://api.capapi.org/v1/cases/505122/,"THE PEOPLE, Plaintiff and Respondent, v. ALAN ...",People v. Shaw,1998-06-02,No. F026821,492,501,"[{'type': 'official', 'cite': '64 Cal. App. 4t...",{'url': 'https://api.capapi.org/v1/volumes/320...,{'url': 'https://api.capapi.org/v1/reporters/3...,{'url': 'https://api.capapi.org/v1/courts/cal-...,"{'name': 'Cal.', 'name_long': 'California', 'w...",https://cite.capapi.org/cal-app-4th/64/492/,[],"{'data': {'judges': [], 'attorneys': ['Counsel..."
2,505083,https://api.capapi.org/v1/cases/505083/,"THE PEOPLE, Plaintiff and Respondent, v. DAVID...",People v. Lopez,1998-06-15,No. B115397,1122,1129,"[{'type': 'official', 'cite': '64 Cal. App. 4t...",{'url': 'https://api.capapi.org/v1/volumes/320...,{'url': 'https://api.capapi.org/v1/reporters/3...,{'url': 'https://api.capapi.org/v1/courts/cal-...,"{'name': 'Cal.', 'name_long': 'California', 'w...",https://cite.capapi.org/cal-app-4th/64/1122/,[],"{'data': {'judges': [], 'attorneys': ['Counsel..."
3,505120,https://api.capapi.org/v1/cases/505120/,"20TH CENTURY INSURANCE COMPANY et al., Plainti...",20th Century Insurance v. Quackenbush,1998-05-22,No. A079667,135,142,"[{'type': 'official', 'cite': '64 Cal. App. 4t...",{'url': 'https://api.capapi.org/v1/volumes/320...,{'url': 'https://api.capapi.org/v1/reporters/3...,{'url': 'https://api.capapi.org/v1/courts/cal-...,"{'name': 'Cal.', 'name_long': 'California', 'w...",https://cite.capapi.org/cal-app-4th/64/135/,[],"{'data': {'judges': [], 'attorneys': ['Counsel..."
4,505113,https://api.capapi.org/v1/cases/505113/,"SAN DIEGO GAS & ELECTRIC CO., Plaintiff and Re...",San Diego Gas & Electric Co. v. City of Carlsbad,1998-06-09,No. D027407,785,806,"[{'type': 'official', 'cite': '64 Cal. App. 4t...",{'url': 'https://api.capapi.org/v1/volumes/320...,{'url': 'https://api.capapi.org/v1/reporters/3...,{'url': 'https://api.capapi.org/v1/courts/cal-...,"{'name': 'Cal.', 'name_long': 'California', 'w...",https://cite.capapi.org/cal-app-4th/64/785/,[],"{'data': {'judges': [], 'attorneys': ['Counsel..."


### Text Preprocessing

In [None]:
import nltk
from nltk.corpus import treebank 
from nltk.tree import Tree
import string
import re
import os
import argparse
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Extract text body from a nested dictionary 
def get_text(x):
    if len(x['data']['opinions'])>0:
        return x['data']['opinions'][0]['text']
    return 0

def remove_punct(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

def tokenization(text):
    text = text.strip()
    text = re.split('\W+', text)
    return text

def remove_stopwords(text):
    stopword = nltk.corpus.stopwords.words('english')
    text = [word for word in text if word not in stopword]
    return text

def lemmatizer(text):
    wn = nltk.WordNetLemmatizer()
    text = [wn.lemmatize(word) for word in text]
    return text

def clean_text(text):
    text = remove_punct(text)
    text = tokenization(text)
    text = remove_stopwords(text)
    text = lemmatizer(text)
    return ' '.join(text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
df['court_name'] = df["court"].apply(lambda x:x['name'])
courts = ['Court of Appeal of the State of California','Supreme Court of California']
data = df[(df['court_name']==courts[0])|(df['court_name']==courts[1])]
data['label'] = data['court_name'].replace(['Court of Appeal of the State of California','Supreme Court of California'],[0, 1])
data['case_text'] = data['casebody'].apply(lambda x:get_text(x))
data = data[data['case_text']!=0]
data = data.groupby('label',group_keys=False).apply(lambda x: x.sample(1000))
data['clean_text'] = data['case_text'].apply(lambda x : clean_text(x))
data['clean_text'] = data['clean_text'].str.lower()
data = data[['case_text','clean_text','label']]
data.head()

Unnamed: 0,case_text,clean_text,label
108929,"Opinion\nGABBERT, J.\nIn this action to recove...",opinion gabbert j in action recover personal i...,0
104685,"Opinion\nASHBY, J.\nWhile he was in the course...",opinion ashby j while course burglarizing resi...,0
110520,"Opinion\nSTRANKMAN, P. J.\nBy petition for wri...",opinion strankman p j by petition writ habeas ...,0
81310,"Opinion\nSTANIFORTH, J.\nHusband George Thomas...",opinion staniforth j husband george thomas cul...,0
134260,"Opinion\nNICHOLSON, J.\nThe trial court denied...",opinion nicholson j the trial court denied def...,0


## Part 1 Huggingface Introduction
Huggingface is a platform where a broad community of data scientists, researchers, and Machine Learning engineers can come together and share ideas, get support and contribute to open source projects. It provides tools that enable users to build, train and deploy ML models based on open source code and technologies. The transformer model we use in this lab is from huggingface. Shown below is a seires of tasks that transformers can achieve.

In [None]:
!git clone https://github.com/nlp-with-transformers/notebooks.git
%cd notebooks
from install import *
install_requirements()

fatal: destination path 'notebooks' already exists and is not an empty directory.
/content/gdrive/MyDrive/LS190/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!


In [None]:
from transformers import pipeline

In [None]:
# Example Text
text = data.sample().case_text.values[0][:512]
text

'Opinion\nCROSKEY, J.\nLebas Fashion Imports of USA, Inc. (Lebas), appeals from a summary judgment granted in favor of ITT Hartford Insurance Group (Hartford) on Lebas’s first amended complaint for breach of an insurance contract and breach of the implied covenant of good faith. After Lebas had been sued in federal court for trademark infringement, Hartford, which had issued a commercial general liability (CGL) policy to Lebas, denied coverage and refused to provide Lebas with a defense on the ground that the '

### Name Entity Recognition
Named entity recognition is a NLP technique that can automatically scan entire documents and pull out some fundamental entities in a text and classify them into predefined categories (name, location, company).

In [None]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.953774,##OS,10,12
1,ORG,0.998531,J,17,18
2,ORG,0.992996,"Lebas Fashion Imports of USA, Inc",20,53
3,ORG,0.999178,Lebas,56,61
4,ORG,0.998107,ITT Hartford Insurance Group,116,144
5,ORG,0.995485,Hartford,146,154
6,ORG,0.996717,Lebas,159,164
7,ORG,0.997728,Lebas,283,288
8,ORG,0.998597,Hartford,348,356
9,ORG,0.994444,Lebas,422,427


### Text Summaization
Another important task of NLP is to let machines to read through a long text and generate a summary for us. 

In [None]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


 Lebas Fashion Imports of USA, Inc. (Lebas) appeals from a summary judgment granted in favor of ITT Hartford Insurance Group (Hartford) on Lebas’s first amended complaint


### Translation
We can also translate our text from English to other supported languages. In the demo below, we translate our text from English to German.

In [None]:
translator = pipeline("translation_en_to_de", 
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Stellungnahme CROSKEY, J. Lebas Fashion Imports of USA, Inc. (Lebas), Berufungen aus einem summarischen Urteil zugunsten der ITT Hartford Insurance Group (Hartford) auf Lebas' erste Änderung Beschwerde für die Verletzung eines Versicherungsvertrages und die Verletzung des impliziten Bund des guten Glaubens. Nachdem Lebas war vor Bundesgericht wegen Markenverletzung verklagt worden, Hartford, die eine kommerzielle allgemeine Haftung (CGL) Politik an Lebas erlassen hatte, verweigerte Deckung und weigerte sich, Lebas mit einer Verteidigung auf dem Boden, dass die


## Part 2 Fine Tuning for Text Classification
In previous lab, we experimented with several machine learning models to automatically classify case texts into their corresponding court. In this time, we will continue this binary classification task, but with the help of transformer models. By the end of this lab, you can compare your results to previous lab to see if there is any improvement in classification accuracy. 

The model we are going to deploy is **BERT**, which stands for Bidirectional Encoder Representations from Transformers. Transformers were originally proposed in [Attention Is All You Need](https://arxiv.org/abs/1706.03762?context=cs). BERT is a transformer-based model that predicts masked word using bidirectional context + next sentence prediction. 

Once we have the model, hang on a bit and don't hurry to put it in use. Usually language models are trained on large generic corpus like Wikipedia and BookCorpus, which might not be ready to apply in our specific task. So a necessary step when deploying transformer based model is **fine tuning**. 

In [None]:
!pip install transformers
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.2.2-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 3.2 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting datasets>=2.0.0
  Downloading datasets-2.4.0-py3-none-any.whl (365 kB)
[K     |████████████████████████████████| 365 kB 8.6 MB/s 
Installing collected packages: responses, datasets, evaluate
  Attempting uninstall: datasets
    Found existing installation: datasets 1.16.1
    Uninstalling datasets-1.16.1:
      Successfully uninstalled datasets-1.16.1
Successfully installed datasets-2.4.0 evaluate-0.2.2 responses-0.18.0


In [None]:
# Check GPU Available
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla P100-PCIE-16GB'

In [None]:
# Split our data into train, validation, and test set
from sklearn.model_selection import train_test_split

train_text,test_text,train_labels,test_labels = train_test_split(data.clean_text.tolist(),data.label.tolist(),test_size = 0.2,shuffle=True)
train_text,validation_text,train_labels,validation_labels = train_test_split(train_text,train_labels,test_size = 0.2,shuffle=True) 

In [None]:
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
# Tokenize the dataset, truncate when passed 'max_length', and pad with 0's when less than 'max_length'
train_encodings = tokenizer(train_text, truncation=True, padding=True,max_length=512)
val_encodings = tokenizer(validation_text, truncation=True, padding=True,max_length=512)
test_encodings = tokenizer(test_text, truncation=True, padding=True,max_length=512)

In [None]:
# Wraps our tokenized text data into a torch Dataset
class CaseLawDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# convert our tokenized data into a torch Dataset
train_dataset = CaseLawDataset(train_encodings, train_labels)
val_dataset = CaseLawDataset(val_encodings, validation_labels)
test_dataset = CaseLawDataset(test_encodings, test_labels)

Now that we have our data prepared, let's download and load our BERT model and its pre-trained weights

In [None]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased",num_labels=2)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

***** Running training *****
  Num examples = 1280
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 240


Step,Training Loss
10,0.7035
20,0.6932
30,0.6897
40,0.6864
50,0.6771
60,0.6706
70,0.6253
80,0.6083
90,0.5227
100,0.4751




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=240, training_loss=0.4695767253637314, metrics={'train_runtime': 105.8549, 'train_samples_per_second': 36.276, 'train_steps_per_second': 2.267, 'total_flos': 508674810839040.0, 'train_loss': 0.4695767253637314, 'epoch': 3.0})

In [None]:
predictions = trainer.predict(test_dataset)
predictions

***** Running Prediction *****
  Num examples = 320
  Batch size = 64


PredictionOutput(predictions=array([[ 1.3734094 , -1.651303  ],
       [-1.4373379 ,  1.4029785 ],
       [ 1.2980497 , -1.5520647 ],
       [ 1.2307769 , -1.4378827 ],
       [-1.4577458 ,  1.456336  ],
       [ 1.0682508 , -1.211369  ],
       [ 1.2908113 , -1.5119854 ],
       [-1.2316029 ,  1.1949487 ],
       [-1.5827463 ,  1.5750128 ],
       [-0.18768261,  0.12807383],
       [-1.4219595 ,  1.395848  ],
       [-1.3358335 ,  1.3325804 ],
       [ 1.2711787 , -1.4967383 ],
       [ 0.21260878, -0.25526133],
       [ 1.3112333 , -1.5778143 ],
       [ 1.253402  , -1.41781   ],
       [ 1.0423691 , -1.23419   ],
       [ 1.2851619 , -1.531584  ],
       [ 1.2998314 , -1.5858424 ],
       [-1.5740647 ,  1.5357155 ],
       [-1.5052648 ,  1.4722414 ],
       [-1.3763804 ,  1.3617523 ],
       [-0.8281723 ,  0.7590267 ],
       [-0.10482173, -0.11213654],
       [-1.4055827 ,  1.3274827 ],
       [ 1.3024869 , -1.5691125 ],
       [-1.4851418 ,  1.4797204 ],
       [-1.4351127 ,  1.42

In [None]:
preds = np.argmax(predictions.predictions, axis=-1)

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8875, 'f1': 0.8767123287671235}