# Classification of Unstructured Documents 
## *Transfer Learning with BERT*
### GRAD-E1394 Deep Learning

---

Authors:
*   Ma. Adelle Gia Arbo, m.arbo@students.hertie-school.org
*   Janine De Vera, j.devera@students.hertie-school.org
*   Lorenzo Gini, l.gini@students.hertie-school.org
*   Lukas Warode, l.warode@students.hertie-school.org | lukas.warode@gmx.de

---

This tutorial demonstrates the pipeline for classifying unstructured documents, particularly those in PDF format. 


# Table of Contents


*   [Memo](#memo)
*   [Overview](#overview)
*   [Background & Prerequisites](#background-and-prereqs)
*   [Software Requirements](#software-requirements)
*   [Data Description](#data-description)
*   [Methodology](#methodology)
*   [Results & Discussion](#results-and-discussion)
*   [References](#references)


<a name="memo"></a>
# Memo

Write a memo for the leadership explaining in layman's terms why this topic is relevant for public policy. Discuss relevant research works, real-world examples of successful applications, and organizations and governments that apply such approaches for policy making. 


### *Sketch Lukas (not done)*

#### Main Point / Take Away (Executive Summary???)

#### Background 

Companies as well as public and political institutions are able and ofent required to access loads of information nowadays. In doing that, they often face the problem of unstructured information, especially when dealing with text data. Data in textual form is the most common type of unstructured, unfortunately, it is also representing the most fundamental type of documents with respect to policymakers and public institutions: Legal documents, bills and policy papers are just some examples of common text sources, which are part of the daily business in the political world. Unstructured text data entails a variety of different problems when it comes actually gaining quantitative analytical insights from them. Computers commonly have difficulties understanding textual data. Analytical and technical competences are also scarce: Only 18% of companies are able to use unstructured data, while most organizations are make their (data-driven) decisions on the basis of only 10 to 20% of their available data source. The situation for public institutions is worsend by the fact that most of the modern text data analysis frameworks and models are dominated by different industry players, while being designed and trained according to the industry-specific needs, which do not transfer to the needs of the mentioned political text sources. The EU Commission (or directly mention DMA?) …

This motivates the necessity of implementing …

#### Evidence (Sktech our technical solution?)
- Text Classification Pipepline + BERT Model(ling)

- Results??


#### Conclusion and Implementation

- Results??

- Being specific about implementation and actual usage (maybe even specify DMA context a bit more and say who and what stakeholders/institutions are taking what position in context)

*Open points*

- Include practicality and gained efficiency
- Normative perspective: "Technical" objectivity (but one needs to be careful for potential biases)

\
- Directly mention DMA at beginning?
- How structured should the memo be?

\
- Sources need to be added for numbers of slides (e.g. 18% of companies blabla)


*Structure*
![](https://mitcommlab.mit.edu/broad/wp-content/uploads/sites/5/2017/12/policy-memo-struct-700x843.png)

![](https://www.bu.edu/sph/files/2014/06/Policy-Memo-Organization.jpg)

<a name="overview"></a>
# Overview

Over 80% of all data is *unstructured*. Most of the information we consume come in a format that is not organized in a pre-defined manner (e.g. tables) or with a specific data model in mind (e.g. matrices). 

<u>Text</u> is the most common type of unstructured data and it comes in a variety of forms like blogs, news articles, social media content, as well as official documents. This lack of structure that can be readily understood by machines is what makes it difficult to maximize text as a data source. Algorithms that efficiently and accurately process text would have a variety of applications in organisations, especially public institutions that have access to different types of documents. 

In this tutorial we demonstrate one such application in the context of the European Commission (EC). Whenever new legislation is proposed, the EC opens **public consultations** where various stakeholders (e.g. businesses, academia, law firms, associations, private individuals) submit documents that detail their views on the proposal. The EC receives anywhere between 10,000 to 4 million of these public consultation documents annually. Using machine learning and deep learning methods to process these documents would streamline the Commission's review of stakeholder comments, which will consequently allow them to integrate more information into their policymaking process. 

The main goal of this tutorial is to walk you through the steps of building a <u>document classifier</u> using a deep learning model called **Bidirectional Encoder Representations from Transformers (BERT)**. By the end of this tutorial you will understand how to:  

> 1. Extract, clean, and pre-process information from PDF documents
> 2. Use the pre-processed text as input to machine learning/deep learning models
> 3. Build a text/document classifier with BERT 
> 4. Compare BERT with text classifiers built using other models

We will then apply these learnings to accomplish a research objective: 

 > Classify public consultation documents of the recently enacted **Digital Markets Act** according to whether a stakeholder **agrees or disagrees** with the DMA proposal. 

The Digital Markets Act is a regulation in the European Union which came into force this 2022. It aims to promote fair competition within the digital market by defining rules for “gatekeepers” or large online platforms. Majority of the public consultation documents submitted for the DMA are from companies and business associations who will likely be affected by the law.

<a name="background-and-prereqs"></a>
# Background & Prerequisites

For this tutorial, you would need to be familiar with object oriented programming, common python libraries such as *numpy* and *pandas*, and libraries used for model building, such as *scikit-learn* and *pytorch*. Working knowledge of common language processing concepts (e.g. stemming, lemmatization, TF-IDF, embeddings) would also be useful. 

Important concepts that need to be explained briefly:  

## Reading materials

For detailed explanations of the topics covered in this tutorial, below is a list helpful references.

**Basics of natural language processing models:**
* Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media.


**BERT:**
* Horev, R. (2018). BERT Explained: State of the art language model for NLP. Towards Data Science, 10.
* 


<a name="software-requirements"></a>
# Software Requirements
To install software requirements and dependencies, please create a new environment using the *environment.yml* file which accompanies this notebook. 

In [None]:
!conda env create -f environment.yml

In [1]:
# Data visualization
import matplotlib.pyplot as plt 

# Data manipulation
import pandas as pd
import numpy as np

In [2]:
# Parsing and pre-processing
from glob import glob
import os 
import re

from pdfminer.high_level import extract_text
from langdetect import detect, DetectorFactory

from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from string import punctuation 

In [None]:
# Logistic and XGboost
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, precision_score, recall_score
from xgboost import XGBClassifier

In [4]:
# LSTM 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
import torch.nn as nn
from torch.nn import functional as F
import torch.optim as optim
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from tqdm import tqdm
import gc

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# BERT models
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import yaml
import transformers
from transformers import AutoModel, BertTokenizerFast
from tqdm.notebook import tqdm

In [6]:
# specify GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

<a name="data-description"></a>
# Data Description

As mentioned in the [Overview](#overview), the methods discussed in this tutorial will be applied to public consultation documents of the **Digital Markets Act (DMA)**. The open public consultation for DMA received **188 responses**. In addition to document submissions, each respondent answered an accompanying survey. They were asked a series of questions about what they think about the law. 

To build our document classifier we need the raw **text** from the public consultation submissions, and a corresponding **label** for each document which indicates whether or not the author/s agree or disagree with the DMA proposal. 

**Text:**
The submissions come in the form of **pdf files**. These documents need to be parsed in order to extract raw text. 

**Labels:**
The labels are extracted from the **survey**, specifically from the question: 

*"Do you consider that there is a need for the Commission to be able to intervene in gatekeeper scenarios to prevent/address structural competition problems?*
 

## Data Download

The pdf documents and the survey used to generate labels can be downloaded from the <a href="https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives/12416-New-competition-tool/public-consultation_en">DMA consultation page</a>. 

## Data Preprocessing
Additionally, you can include any data preprocessing steps and exploratory data analyses (e.g. visualize data distributions, impute missing values, etc.) in this section to allow the users to better understand the dataset. 

In this section, you might also want to describe the different input and output variables, the train/val/test splits, and any data transformations.

(Describe processing of labels here)

Insert cleaning of DMA excel file

In [None]:
# Insert data pre-processing and exploratory data analysis
# code here. Feel free to break this up into several code
# cells, interleaved with explanatory text. 

<a name="methodology"></a>
# Methodology

This section of the tutorial is a step-by-step walkthrough of text classification pipeline, using DMA public consultation documents as described above. Below is an outline of this section: 

<ol type="A">
  <li>Data Preparation </li>
  <li>Text Representation</li>
  <li>Model Training (with hyperparameter tuning)</li>
  <ol>
    <li> Training baseline models (logistic, XGBoost)
    <li> Training a DL classifier (LSTM)
    <li> Transfer learning with BERT and other variants
  </ol>
  <li> Model Evaluation
</ol>

## A. Data Preparation 
In this section of the tutorial, we will (1) parse PDF submissions from the Digital Markets Act public consultation and (2) pre-process raw text to keep only the most relevant information. 

### A.1 Parsing PDFs

Create a list of file paths of all pdf submissions: `pdf_list`.

In [None]:
pdf_dir = "../data/reports/"
pdf_list = glob(os.path.join(pdf_dir, "*.pdf"))
pdf_list

Each of the downloaded public consultation documents has a unique alpha-numeric ID. For example: the document ID for **"F549293-Statement_on_the_New_Competition_Tool"** is *F549293*.

We use a regular expression to extract this from the file paths and save it in a list called `pdf_id`. 

In [None]:
pdf_id = [re.search('[F][0-9]{6}', i)[0] for i in pdf_list]
pdf_id = list(set(pdf_id))
len(pdf_id) # no. of unique pdf submission

PDF (Portable Document Format) documents are hard to work with because this format was not designed as a data input. Instead, the PDF contains a set of instructions that describe how characters or objects are positioned on a page. 

Python has text analytics libraries that convert PDFs into the required encoding format. There are several of these PDF libraries, however, they sometimes yield varying results. Here, we demonstrate one such library, `pdfminer`. We have also tried other libraries like `pdfplumber`, so feel free to also experiment on which PDF parser works best for your corpus. 

We create a dataframe `df_text` which contains information for each document - the reference number (unique ID), file name, complete text of the document, and the language the document was written in. The full text of the document is parsed using the PDF library and the language of the text is determined by a language detection library.

In [None]:
df_text = pd.DataFrame(columns = ['Reference', 'file_name', 'text', 'lang'])

for pdf_file in pdf_list:
    Reference = re.search('[F][0-9]{6}', pdf_file)[0]
    file_name = re.search('[F][0-9]{6}(.*)[\\>.]', pdf_file)[0]
    text = extract_text(pdf_file)
    lang = detect(text)
    row = pd.DataFrame({'Reference': Reference,'file_name': file_name,
                        'text': text, 'lang': lang}, index=[0])
    df_text = pd.concat([row,df_text.loc[:]]).reset_index(drop=True)

The output of the parsing step looks likes this.

df_text.head()

Insert merging step

### A.2 Cleaning and Pre-Processing

After parsing PDFs, we need to further process the raw text to ensure that most of the information we feed into our model/s is relevant to the task at hand. For instance, stopwords like "the", "this", "and" will not give us any indication of whether a stakeholder agrees with a law, so we can remove these words. 

The pre-processing steps that we apply for this tutorial are the following: 

1. Removal of stopwords, punctuations, and numeric characters
2. Stemming 
3. Lemmatization
4. Language detection
5. Coreference resolution

This is what the raw text looks like before processing.

In [None]:
df_eng.text[1]

#### Removal of stopwords, punctuations, numeric characters

In [None]:
def preprocess_corpus(texts):
    eng_stopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        token_list =  [token.lower() for token in tokens if token not in eng_stopwords and token not in punctuation and token.isdigit() == False]
        processed_text = ' '.join(token_list)
        return processed_text
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

df_eng['text_clean'] = preprocess_corpus(df_eng['text'])

#### Stemming and lemmatization

In [None]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

In [None]:
def stem_lemmatize(text):
    stemmed = [stemmer.stem(token) for token in word_tokenize(text)]
    lemmatized = [lemmatizer.lemmatize(token) for token in stemmed]
    processed_text = ' '.join(lemmatized)
    return processed_text

df_eng['text_clean'] = [stem_lemmatize(text) for text in df_eng['text_clean']]

#### Coreference resolution

In [None]:
# installing neuralcoref from source
!git clone https://github.com/huggingface/neuralcoref.git
!cd neuralcoref
!pip install -r requirements.txt
!pip install -e .

In [None]:
import spacy
import neuralcoref

nlp = spacy.load('en_core_web_sm') 
neuralcoref.add_to_pipe(nlp)

In [None]:
def coref_res(texts):
    doc = nlp(texts)
    clean = doc._.coref_resolved
    return clean

df_eng['text_clean'] = [coref_res(text) for text in df_eng['text_clean']]

## B. Text Representation
Since this step is built inside BERT and other DL models, we can use what Lorenzo has done for the baseline models (TF-IDF & embeddings) to illustrate how text representation works. 

In the BERT section explain how the architecture processes text to create embeddings. 

## C. Model Training (with hyperparameter tuning)

Load and split data

In [7]:
df = pd.read_json(r"./data/df_final_document.json")

In [8]:
le = LabelEncoder()
df['label.132'] = le.fit_transform(df['label_132'])
df['label.134'] = le.fit_transform(df['label_134'])

### C.1 Baseline Models

In [43]:
vectorizer = TfidfVectorizer()
dfm = vectorizer.fit_transform(df['text_clean'])

##### Logistic Regression

In [53]:
clf = LogisticRegression(random_state=0).fit(train_text, train_labels)

y_pred = clf.predict(test_text)

accuracy = accuracy_score(test_labels, y_pred) *100.0
precision = precision_score(test_labels, y_pred, average='binary')
recall = recall_score(test_labels, y_pred, average='binary')
f_score = 2 * (precision * recall) / (precision + recall)

print(f' Accuracy: {accuracy:.2f} \n Precision: {precision:.3f} \n Recall: {recall:.3f} \n F1: {f_score:.3f}')

 Accuracy: 62.79 
 Precision: 0.625 
 Recall: 0.962 
 F1: 0.758


##### Gradient Boosting

In [54]:
bst = XGBClassifier(n_estimators=1000, max_depth=1000, learning_rate=0.1, objective='binary:logistic')

bst.fit(train_text, train_labels)

print(bst)

y_pred = bst.predict(test_text)

accuracy = accuracy_score(test_labels, y_pred) * 100.0
precision = precision_score(test_labels, y_pred, average='binary')
recall = recall_score(test_labels, y_pred, average='binary')
f_score = 2 * (precision * recall) / (precision + recall)

print(f' Accuracy: {accuracy:.2f} \n Precision: {precision:.3f} \n Recall: {recall:.3f} \n F1: {f_score:.3f}')

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.1, max_bin=256,
              max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,
              max_depth=1000, max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=1000, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0, ...)
 Accuracy: 67.44 
 Precision: 0.688 
 Recall: 0.846 
 F1: 0.759


### C.2 Long Short-Term Memory Network

After splitting the data into training and test set, build the corpus vocabulary by tokenizing all texts and assigning each word to a unique index.

In [10]:
tokenizer = get_tokenizer("basic_english")

def tokenize(datasets):
    for dataset in datasets:
        for text in dataset:
            yield tokenizer(text)

vocab = build_vocab_from_iterator(tokenize([train_text, test_text]), min_freq=1, specials=["<UNK>"])
vocab.set_default_index(vocab["<UNK>"])

In [11]:
# example
tokens = tokenizer("This is an example.")
index = vocab(tokens)
index

[7332, 1493, 719, 6370, 1]

Once the vocabulary is built, we create batches of text sequences and map the tokens to indices. We also pad the sequence of words so all are of the same length. This returns a tensor of the sequence length and batch size. 

In [12]:
target_classes = ["0", "1"]
max_words = 100

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X] # map tokes to index using vocab
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] # pad sequences

    return torch.tensor(X, dtype=torch.int32), torch.tensor(Y)

In [13]:
train_loader = DataLoader(train_dataset, batch_size=100, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=100, collate_fn=vectorize_batch)

In [14]:
for X, Y in train_loader:
    print(X.shape, Y.shape)
    break

torch.Size([100, 100]) torch.Size([100])


Create LSTM classifier

In [15]:
# define hyperparameters
embed_len = 100
hidden_dim = 60
n_layers = 2

class LSTMClassifier(nn.Module):
    def __init__(self):
        super(LSTMClassifier, self).__init__()
        self.hidden_dim = hidden_dim
        self.embed_len = embed_len
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(target_classes))

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim), torch.randn(n_layers, len(X_batch), hidden_dim)
        output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
        return self.linear(output[:,-1])

    def init_hidden(self):
      return (
               torch.zeros(n_layers, 1, self.hidden_dim, device=device),
               torch.zeros(n_layers, 1, self.hidden_dim, device=device)
            )

In [16]:
lstm_classifier = LSTMClassifier()
lstm_classifier

LSTMClassifier(
  (embedding_layer): Embedding(14499, 100)
  (lstm): LSTM(100, 60, num_layers=2, batch_first=True)
  (linear): Linear(in_features=60, out_features=2, bias=True)
)

In [17]:
for layer in lstm_classifier.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print()

Layer : Embedding(14499, 100)
Parameters : 
torch.Size([14499, 100])

Layer : LSTM(100, 60, num_layers=2, batch_first=True)
Parameters : 
torch.Size([240, 100])
torch.Size([240, 60])
torch.Size([240])
torch.Size([240])
torch.Size([240, 60])
torch.Size([240, 60])
torch.Size([240])
torch.Size([240])

Layer : Linear(in_features=60, out_features=2, bias=True)
Parameters : 
torch.Size([2, 60])
torch.Size([2])



In [18]:
out = lstm_classifier(torch.randint(0, len(vocab), (1024, max_words)))
out.shape

torch.Size([1024, 2])

In [19]:
def evaluate(model, loss_fn, val_loader):
    with torch.no_grad():
        Y_shuffled, Y_preds, losses = [],[],[]
        for X, Y in val_loader:
            preds = model(X)
            loss = loss_fn(preds, Y)
            losses.append(loss.item())

            Y_shuffled.append(Y)
            Y_preds.append(preds.argmax(dim=-1))

        Y_shuffled = torch.cat(Y_shuffled)
        Y_preds = torch.cat(Y_preds)

        print("Valid Loss : {:.3f}".format(torch.tensor(losses).mean()))
        print("Valid Acc  : {:.3f}".format(accuracy_score(Y_shuffled.detach().numpy(), Y_preds.detach().numpy())))

def train(model, loss_fn, optimizer, train_loader, val_loader, epochs=10):
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):
            Y_preds = model(X)

            loss = loss_fn(Y_preds, Y) 
            losses.append(loss.item())

            ## back propagation
            optimizer.zero_grad() 
            loss.backward() 
            optimizer.step() 

        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        evaluate(model, loss_fn, val_loader)

In [20]:
from torch.optim import Adam

epochs = 20
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifier()
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

train(lstm_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 1/1 [00:00<00:00,  1.81it/s]


Train Loss : 0.697
Valid Loss : 0.693
Valid Acc  : 0.512


100%|██████████| 1/1 [00:00<00:00,  3.76it/s]


Train Loss : 0.690
Valid Loss : 0.690
Valid Acc  : 0.535


100%|██████████| 1/1 [00:00<00:00,  3.89it/s]


Train Loss : 0.682
Valid Loss : 0.686
Valid Acc  : 0.581


100%|██████████| 1/1 [00:00<00:00,  3.90it/s]


Train Loss : 0.675
Valid Loss : 0.683
Valid Acc  : 0.605


100%|██████████| 1/1 [00:00<00:00,  3.93it/s]


Train Loss : 0.668
Valid Loss : 0.679
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.96it/s]


Train Loss : 0.660
Valid Loss : 0.676
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.91it/s]


Train Loss : 0.651
Valid Loss : 0.673
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.91it/s]


Train Loss : 0.642
Valid Loss : 0.669
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.97it/s]


Train Loss : 0.632
Valid Loss : 0.666
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.96it/s]


Train Loss : 0.621
Valid Loss : 0.662
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  4.00it/s]


Train Loss : 0.609
Valid Loss : 0.659
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.93it/s]


Train Loss : 0.595
Valid Loss : 0.656
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.90it/s]


Train Loss : 0.581
Valid Loss : 0.653
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.99it/s]


Train Loss : 0.564
Valid Loss : 0.651
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.96it/s]


Train Loss : 0.547
Valid Loss : 0.651
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.97it/s]


Train Loss : 0.528
Valid Loss : 0.652
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.94it/s]


Train Loss : 0.508
Valid Loss : 0.655
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.99it/s]


Train Loss : 0.487
Valid Loss : 0.660
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  3.98it/s]


Train Loss : 0.463
Valid Loss : 0.665
Valid Acc  : 0.628


100%|██████████| 1/1 [00:00<00:00,  4.00it/s]

Train Loss : 0.437
Valid Loss : 0.671
Valid Acc  : 0.628





### C.3 Transfer Learning with BERT and Other Variants

**Bidirectional Encoder Representations from Transformers** or **BERT** follows the Transformer architecture but it only uses the encoder part. Each transformer encoder consists of two sub-layers: self-attention and feed-forward. The key feature of BERT is its bidirectionality as it learns information from both left and right side of a token’s context. This allows for parallel processing, store the position of the input, and make lookup easy. 


BERT is trained from BookCorpus (800M words) and English Wikipedia (2.5B words). In a nutshell, BERT reads the input sequence and generates meaningful text representations, which it feeds into the encoder. This can then be augmented with additional neural network layers to fit a classification task. BERT has three embedding layers: token embedding layer, segment embedding layer, and position embedding layer. The element-wise sum of these three layers gives the final input representation.

<p align="center">
    <img src="./img/bert_embeddings.png" width="250" height="320">
</p>

For NLP related task, the best practice is to use pre-trained models first, and then fine-tune them. This is called **transfer learning**, a technique where a deep learning model trained from a very large dataset is used as “off-the-shelf” to perform similar tasks on another dataset. There are 3 different fine-tuning techniques:
> 1. Train the entire architecture which updates all pre-trained weights based on the new dataset
> 2. Train the pretrained model partially and freeze the weights of the initial layer and retrain only on the higher levels
> 3. Freeze entire architecture and attach a few NN layers to train the new model.

In this tutorial, we will implement transfer model using a pre-trained BERT model. After collecting, preparing, and pre-processing the unstructured documents, we split the dataset into training, validation, and testing. We then implement transfer learning using the bert-base-uncased model with 12 encoder transformer layers trained on lower-case English texts. We use the wordpiece embeddings from BERT as inputs to our text classification task.

<p align="center">
    <img src="./img/bert_pipeline.png" width="700" height="320">
<p>

In [56]:
# split data into train, test, validation sets
train_text, temp_text, train_labels, temp_labels = train_test_split(df['text_clean'], df['label.132'], 
                                                                    random_state=2018, 
                                                                    test_size=0.3, 
                                                                    stratify=df['label.132'])

val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, 
                                                                random_state=2018, 
                                                                test_size=0.5, 
                                                                stratify=temp_labels)

##### Import BERT Model and BERT Tokenizer

In [15]:
# bert-base-uncased
bert = AutoModel.from_pretrained('bert-base-uncased', return_dict=False)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', return_dict=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
# legal-bert-base-uncased
bert = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased", return_dict=False)
tokenizer = BertTokenizerFast.from_pretrained("nlpaueb/legal-bert-base-uncased", return_dict=False)

Some weights of the model checkpoint at nlpaueb/legal-bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


##### Tokenize the Sentences

In [18]:
# tokenize and encode sequences in the training set
tokens_train = tokenizer.batch_encode_plus(
    train_text.tolist(),
    max_length = 25,
    pad_to_max_length=True,
    truncation=True
)

# tokenize and encode sequences in the validation set
tokens_val = tokenizer.batch_encode_plus(
    val_text.tolist(),
    max_length = 25,
    pad_to_max_length=True,
    truncation=True
)

# tokenize and encode sequences in the test set
tokens_test = tokenizer.batch_encode_plus(
    test_text.tolist(),
    max_length = 25,
    pad_to_max_length=True,
    truncation=True
)



In [21]:
# convert lists to tensors
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(val_labels.tolist())

test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())

In [22]:
batch_size = 2
num_workers = 2

# dataLoader for train set
train_data = TensorDataset(train_seq, train_mask, train_y)
train_dataloader = DataLoader(train_data, num_workers=num_workers, shuffle=True, batch_size=batch_size)

# dataLoader for validation set
val_data = TensorDataset(val_seq, val_mask, val_y)
val_dataloader = DataLoader(val_data, num_workers=num_workers, shuffle=True, batch_size=batch_size)

##### BERT Model Architecture

In [23]:
class BERT_Arch(nn.Module):

    def __init__(self, bert):
      
      super(BERT_Arch, self).__init__()

      self.bert = bert 
      self.dropout = nn.Dropout(0.1)
      self.relu =  nn.ReLU()
      self.fc1 = nn.Linear(768,512)
      self.fc2 = nn.Linear(512,2)
      self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, sent_id, mask):

      #pass the inputs to the model  
      _, cls_hs = self.bert(sent_id, attention_mask=mask)
      x = self.fc1(cls_hs)
      x = self.relu(x)
      x = self.dropout(x)
      x = self.fc2(x)
      x = self.softmax(x)

      return x

In [24]:
# method to freeze all the parameters if freeze = T
def set_parameter_requires_grad(model, freeze):
    if freeze:
        for param in model.parameters():
            param.requires_grad = False

In [25]:
# freeze all parameters
set_parameter_requires_grad(model=bert, freeze=True)

# pass the pre-trained BERT to our define architecture
model = BERT_Arch(bert)

# push the model to GPU
model = model.to(device)

In [26]:
# optimizer from hugging face transformers
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr = 1e-5) # learning rate

# loss function
criterion  = nn.NLLLoss() 

# no. of training epochs
epochs = 20



##### Fine-tuning BERT

In [27]:
# function to train the model
def train(model, dataloader, criterion, optimizer):
  
  model.train()
  total_loss, total_accuracy = 0, 0
  running_total_correct = 0.0
  total_preds=[]
  
  for i, inputs in enumerate((dataloader)):
    
    # push to gpu
    inputs = [r.to(device) for r in inputs]
    sent_id, mask, labels = inputs

    # zero the parameter gradients
    model.zero_grad()        

    # forward + backward + optimize 
    preds = model(sent_id, mask)
    loss = criterion(preds, labels)
    total_loss = total_loss + loss.item()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) #prevent exploding gradient problem
    optimizer.step()
    preds=preds.detach().cpu().numpy()

    # append the model predictions
    total_preds.append(preds)

  # epoch loss and accuracy
  epoch_loss = total_loss / len(dataloader)
  # reshape from (no. of batches, size of batch, no. of classes) to (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)
  print(f"Train Loss: {epoch_loss:.2f}")

  return epoch_loss, total_preds


In [28]:
# function for evaluating the model
def evaluate(model, dataloader, criterion):
  
  model.eval()
  total_loss, total_accuracy = 0, 0
  total_preds = []

  for i, inputs in enumerate((dataloader)):
    
    # push to gpu
    inputs = [t.to(device) for t in inputs]
    sent_id, mask, labels = inputs

    with torch.no_grad():
      preds = model(sent_id, mask)
      loss = criterion(preds,labels)
      total_loss = total_loss + loss.item()
      preds = preds.detach().cpu().numpy()
      total_preds.append(preds)

  # epoch loss and accuracy
  epoch_loss = total_loss / len(dataloader) 
  total_preds  = np.concatenate(total_preds, axis=0)
  print(f"Validation Loss: {epoch_loss:.2f}")

  return epoch_loss, total_preds

In [29]:
# set initial loss to infinite
best_valid_loss = float('inf')

train_losses=[]
valid_losses=[]

for epoch in range(epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
    train_loss, _ = train(model, train_dataloader, criterion, optimizer)
    valid_loss, _ = evaluate(model, val_dataloader, criterion)
    
    #save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'saved_weights.pt')
    
    # append training and validation loss
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    


 Epoch 1 / 20
Train Loss: 0.67
Validation Loss: 0.70

 Epoch 2 / 20
Train Loss: 0.67
Validation Loss: 0.69

 Epoch 3 / 20
Train Loss: 0.68
Validation Loss: 0.69

 Epoch 4 / 20
Train Loss: 0.67
Validation Loss: 0.69

 Epoch 5 / 20
Train Loss: 0.67
Validation Loss: 0.67

 Epoch 6 / 20
Train Loss: 0.69
Validation Loss: 0.66

 Epoch 7 / 20
Train Loss: 0.67
Validation Loss: 0.69

 Epoch 8 / 20
Train Loss: 0.66
Validation Loss: 0.64

 Epoch 9 / 20
Train Loss: 0.66
Validation Loss: 0.67

 Epoch 10 / 20
Train Loss: 0.66
Validation Loss: 0.66

 Epoch 11 / 20
Train Loss: 0.67
Validation Loss: 0.65

 Epoch 12 / 20
Train Loss: 0.67
Validation Loss: 0.65

 Epoch 13 / 20
Train Loss: 0.68
Validation Loss: 0.70

 Epoch 14 / 20
Train Loss: 0.67
Validation Loss: 0.72

 Epoch 15 / 20
Train Loss: 0.67
Validation Loss: 0.69

 Epoch 16 / 20
Train Loss: 0.66
Validation Loss: 0.67

 Epoch 17 / 20
Train Loss: 0.68
Validation Loss: 0.67

 Epoch 18 / 20
Train Loss: 0.66
Validation Loss: 0.71

 Epoch 19 / 20
Tra

In [59]:
#load weights of best model
path = 'saved_weights.pt'
model.load_state_dict(torch.load(path))

<All keys matched successfully>

In [60]:
# get predictions for test data
with torch.no_grad():
  preds = model(test_seq.to(device), test_mask.to(device))
  preds = preds.detach().cpu().numpy()

In [61]:
preds = np.argmax(preds, axis = 1)
print(classification_report(test_y, preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         9
           1       0.59      1.00      0.74        13

    accuracy                           0.59        22
   macro avg       0.30      0.50      0.37        22
weighted avg       0.35      0.59      0.44        22



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## D. Model Evaluation
Including comparison of models

<a name="results-and-discussion"></a>
# Results & Discussion

In this section, describe and contextualize the results shown in the tutorial. Briefly describe the performance metrics and cross validation techniques used. 

Finally, include a discussion on the limitations and important takeaways from the exercise.

## Limitations
*   The tutorial is focused on education and learning. Explain all the simplifications you have made compared to applying a similar approach in the real world (for instance, if you have reduced your training data and performance).
*   ML algorithms and datasets can reinforce or reflect unfair biases. Reflect on the potential biases in the dataset and/or analysis presented in your tutorial, including its potential societal impact, and discuss how readers might go about addressing this challenge. 

## Next Steps
*   What do you recommend would be the next steps for your readers after finishing your tutorial?
*   Discuss other potential policy- and government-related applications for the method or tool discussed in the tutorial.
*   List anything else that you would want the reader to take away as they move on from the tutorial.

<a name="references"></a>
# References

Include all references used. 

For example, in this template:

*   EarthCube Notebook Template: https://github.com/earthcube/NotebookTemplates
*   Earth Engine Community Tutorials Style Guide: https://developers.google.com/earth-engine/tutorials/community/styleguide#colab
*   Google Cloud Community Tutorial Style Guide: https://cloud.google.com/community/tutorials/styleguide
*   Rule A, Birmingham A, Zuniga C, Altintas I, Huang S-C, Knight R, et al. (2019) Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks. PLoS Comput Biol 15(7): e1007007. https://doi.org/10.1371/journal.pcbi.1007007




## Acknowledgement

These guidelines are heavily based on the Climate Change AI template for the for the tutorials track at the [NeurIPS 2021 Workshop on Tackling Climate Change with Machine Learning](https://www.climatechange.ai/events/neurips2021). 