# Classification of Unstructured Documents 
## *Transfer Learning with BERT*
### GRAD-E1394 Deep Learning

---

Authors:
*   Ma. Adelle Gia Arbo, m.arbo@students.hertie-school.org
*   Janine De Vera, j.devera@students.hertie-school.org
*   Lorenzo Gini, l.gini@students.hertie-school.org
*   Lukas Warode, l.warode@students.hertie-school.org | lukas.warode@gmx.de

---


# Table of Contents


*   [Memo](#memo)
*   [Overview](#overview)
*   [Background & Prerequisites](#background-and-prereqs)
*   [Software Requirements](#software-requirements)
*   [Data Description](#data-description)
*   [Methodology](#methodology)
*   [Results & Discussion](#results-and-discussion)
*   [References](#references)


<a name="memo"></a>
# Memo

<a name="memo"></a>

## Classifying Unstructured DMA Respondents' Reports  <img src="https://www.iccitalia.org/wp-content/uploads/2022/07/digital-markets-act.png" width="80" align="right"/> <img src="https://ec.europa.eu/info/law/better-regulation/assets/images/ecl/logo/logo--en.svg" width="200" align="right"/> 


### *Executive Summary*

This tutorial shows how state-of-the-art **Deep Learning (DL)** frameworks can be used to **classify unstructured papers** and **reports** that were submitted by stakeholders that participated in the **Digital Markets Act (DMA**) public consultation survey[<sup>1</sup>](#fn1). The tutorial builds on a **strong demand** for an appropriate technical solution, while existing similar contributions show that **DL-based solutions** are **feasible**.

### *Background & Relevance*

Companies as well as public and political institutions are often required to access loads of information that may be **structured** or **unstructed**. Data in textual form is the **most common type of unstructured data**. Text also comprises the most fundamental type of documents for policymakers and public institutions: legal documents, bills, policy papers and reports are just some examples of common text sources, which are part of daily operations in the political world. Unstructured text data entails a variety of different problems concerning extraction of quantitative analytical insights. Computers commonly have difficulties understanding textual data. Analytical and technical competencies are also scarce: Only **18% of companies** are **able to use unstructured data**, while most organizations are make their (data-driven) **decisions on the basis of only 10 to 20%** of their available data source[<sup>1</sup>](#fn2). The situation for public institutions is worsened by the fact that most modern text data analysis frameworks and models are **industry-specific**. They are designed and trained according to the needs of certain industries, which often cannnot be generalized to the nuances of political text sources. 

### *Solution*

There are several potential public policy applications of a DL-based text classifiers. For instance, the European Commission regularly conducts **public consultations**, such as the survey around the **DMA**, where stakeholders **submitted unstructured papers** and **reports**. **Whether stakeholders agree** or not can be predicted with **deep learning models**[<sup>3</sup>](#fn3) that are trained on the document corpus. 

Training a state-of-the-art **DL model** allows us to **predict stakeholder agreement** on the DMA proposal. This algorithm can support human-based analyses by producing **objective and efficient** estimates. This text classification framework can be directly implemented to obtain immediate results for existing and new documents, while also **being transferrable** to different (EC) contexts.

</br>
</br>

---

<sup id="fn1">1</sup> https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives/12416-New-competition-tool/public-consultation_en

<sup id="fn2">2</sup> https://mitsloan.mit.edu/ideas-made-to-matter/tapping-power-unstructured-data

<sup id="fn3">3</sup> Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep learning--based text classification: a comprehensive review. ACM Computing Surveys (CSUR), 54(3), 1-40.


<a name="overview"></a>
# Overview

Over 80% of all data is *unstructured*. Most of the information we consume come in a format that is not organized in a pre-defined manner (e.g. tables) or with a specific data model in mind (e.g. matrices). 

<u>Text</u> is the most common type of unstructured data and it comes in a variety of forms like blogs, news articles, social media content, as well as official documents. This lack of structure that can be readily understood by machines is what makes it difficult to maximize text as a data source. Algorithms that efficiently and accurately process text would have a variety of applications in organisations, especially public institutions that have access to different types of documents. 

In this tutorial we demonstrate one such application in the context of the European Commission (EC). Whenever new legislation is proposed, the EC opens **public consultations** where various stakeholders (e.g. businesses, academia, law firms, associations, private individuals) submit documents that detail their views on the proposal. The EC receives anywhere between 10,000 to 4 million of these public consultation documents annually. Using machine learning and deep learning methods to process these documents would streamline the Commission's review of stakeholder comments, which will consequently allow them to integrate more information into their policymaking process. 

The main goal of this tutorial is to walk you through the steps of building a <u>document classifier</u> using a deep learning model called **Bidirectional Encoder Representations from Transformers (BERT)**. By the end of this tutorial you will understand how to:  

> 1. Extract, clean, and pre-process information from PDF documents
> 2. Use the pre-processed text as input to machine learning/deep learning models
> 3. Build a text/document classifier with BERT 
> 4. Compare BERT with text classifiers built using other models

We will then apply these learnings to accomplish a research objective: 

 > Classify public consultation documents of the recently enacted **Digital Markets Act** according to whether a stakeholder **agrees or disagrees** with the DMA proposal. 

The Digital Markets Act is a regulation in the European Union which came into force this 2022. It aims to promote fair competition within the digital market by defining rules for “gatekeepers” or large online platforms. Majority of the public consultation documents submitted for the DMA are from companies and business associations who will likely be affected by the law.

<br>

*Note: This tutorial was created primarily to demonstrate the document classification pipeline. For ease of replication, we purposely kept the dataset small. Results should therefore be taken with a grain of salt.*

<a name="background-and-prereqs"></a>
# Background & Prerequisites

For this tutorial, you would need to be familiar with object oriented programming, common python libraries such as *numpy* and *pandas*, and libraries used for model building, such as *scikit-learn* and *pytorch*. Working knowledge of common language processing concepts (e.g. stemming, lemmatization, TF-IDF, embeddings) and the basics of transformer models are also required.

## Reading materials

For detailed explanations of the topics covered in this tutorial, you may refer to the following reading materials:

* Vajjala, S., Majumder, B., Gupta, A., & Surana, H. (2020). Practical natural language processing: a comprehensive guide to building real-world NLP systems. O'Reilly Media.
* Gereon, A. (2018). Hands-on Machine Learning with Scikit-Learn and Tensor Flow. O’Reily Media Inc., USA.


<a name="software-requirements"></a>
# Software Requirements
This tutorial requires Python 3.6 or higher version. To install software requirements and dependencies, please create a new environment using the *environment.yml* file which accompanies this notebook. 

In [None]:
!conda env create -f environment.yml

In [1]:
# Data visualization
import matplotlib.pyplot as plt 

# Data manipulation
import pandas as pd
import numpy as np
import csv
from zipfile import ZipFile

In [2]:
# Parsing and pre-processing
from glob import glob
import os 
import re

from pdfminer.high_level import extract_text
from langdetect import detect, DetectorFactory

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from string import punctuation
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import spacy

In [3]:
# Vector representations and embeddings
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim

In [4]:
# Logistic and XGboost
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, precision_score, recall_score
from xgboost import XGBClassifier
import pickle

In [5]:
# LSTM 
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader

import torch
import torch.nn as nn
from torch.nn import functional as F
import torch.optim as optim
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from tqdm import tqdm
import gc

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# BERT models
from torch.utils.data import TensorDataset, RandomSampler, SequentialSampler
import transformers
from transformers import AutoModel, BertTokenizerFast

In [7]:
# specify GPU
device = "cuda" if torch.cuda.is_available() else "cpu"

<a name="data-description"></a>
# Data Description

As mentioned in the [Overview](#overview), the methods discussed in this tutorial will be applied to public consultation documents of the **Digital Markets Act (DMA)**. Respondents answered a survey where they were asked a series of questions about the proposed law. The public consultation received **188 survey responses**. Some respondents also provided **accompanying position papers and reports** where they discuss their views in detail. 

To build our document classifier we need the raw **text** from the public consultation submissions, and a corresponding **label** for each document which indicates whether or not the author/s agree or disagree with the DMA proposal. 

**Text:**
The submissions come in the form of **pdf files**. These documents need to be parsed in order to extract raw text. 

**Labels:**
The labels are extracted from the **survey**, specifically from the question: 

> *"Do you consider that there is a need for the Commission to be able to intervene in gatekeeper scenarios to prevent/address structural competition problems?"*
 

## Data Download

The pdf documents and the survey used to generate labels can be downloaded from the <a href="https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives/12416-New-competition-tool/public-consultation_en">DMA consultation page</a>. 

## Data Cleaning

### A. Text

First, create a list of file paths of all pdf submissions: `pdf_list`. 

Each of the downloaded public consultation documents have a unique alpha-numeric ID. For example: the document ID for **"F549293-Statement_on_the_New_Competition_Tool"** is *F549293*. A regular expression is used to extract this from the file paths. The IDs are then saved in a list called `pdf_id`. 

In [8]:
pdf_dir = "data/reports/"
pdf_list = glob(os.path.join(pdf_dir, "*.pdf"))
len(pdf_list)

97

In [9]:
pdf_id = [re.search('[F][0-9]{6}', i)[0] for i in pdf_list]
pdf_id = list(set(pdf_id))
len(pdf_id) # no. of unique pdf submission

89

### B. Labels

The Excel file containing survey answers and details about the respondents is imported as a pandas dataframe. The survey dataset also has a reference column which corresponds to the reference numbers in `pdf_id`.

In [11]:
dma = pd.read_excel("data/DMA Contributions_Clean.xlsx", sheet_name='Results')
dma.head()

Unnamed: 0,Reference,Feedback date,Language,User type,First name,Surname,Scope,Organisation name,Transparency register number,Organisation size,...,Q197,Q198,Q199,Q200,Q201,Q202,Q203,Q204,Q205,Q206
0,F550828,8.09.2020 23:57,English,,,,,,,,...,See our response above.,Somewhat effective,Very effective,Very effective,Most effective,Very effective,See our response above.,,No.,Yes
1,F550827,8.09.2020 23:54,English,NGO (Non-governmental organisation),Juliane,von Reppert-Bismarck,,Lie Detectors,094738529674-10,Small (< 50 employees),...,,Not effective,Most effective,Somewhat effective,Most effective,Most effective,An additional regulatory framework imposing ob...,Lie_Detectors_-_Digital_Services_Act_-_New_Com...,Please see attached submission,Yes
2,F550826,8.09.2020 23:50,German,Business Association,Antje,Woltermann,,Zentralverband Deutsches Kfz-Gewerbe e.V.,71649103246-10,Medium (< 250 employees),...,,Somewhat effective,Most effective,Somewhat effective,Sufficiently effective,Very effective,-,Positionspapier_Gleichberechtigter_Zugang_zum_...,-,Yes
3,F550825,8.09.2020 23:45,English,Academic/Research Institution,DIANA,MONTENEGRO,,Master student candidate Master in Competition...,,Large (250 or more),...,The new procedural must be agile.,Not effective,Very effective,Very effective,Very effective,Very effective,all are adequated,PREPARATIVE_DOCUMENT_FOR_THE_EUROEPAN_COMMISIO...,The document I prepared was more than 1MB.,Yes
4,F550824,8.09.2020 23:33,English,Business Association,Kamila,Sotomska,,Zwi?zek Przedsi?biorców i Pracodawców,868073924175-77,Small (< 50 employees),...,,Not applicable /No relevant experience or know...,Not applicable /No relevant experience or know...,Not applicable /No relevant experience or know...,Not applicable /No relevant experience or know...,Not applicable /No relevant experience or know...,,,,Yes


As we saw above, there are only 89 unique PDF submissions. We now check how many of survey respondents submitted supplementary papers and/or reports. An indicator variable is added to the survey dataframe to specify whether a response has an equivalent PDF submission. Note that we can only use observations that have both the document submission and survey response. 

In [12]:
# count unique Reference ids with pdf submission/s
len(dma.query('Reference in @pdf_id'))

85

In [13]:
# tag identifier with pdf submission = 1
dma['submit'] = [1 if v in pdf_id else 0 for v in dma['Reference'].values]
dma['submit'].value_counts()

0    95
1    85
Name: submit, dtype: int64

The question that will be used as a label is Q132. 
*"Do you consider that there is a need for the Commission to be able to intervene in gatekeeper scenarios to prevent/address structural competition problems?"*

We check how many Yes and No responses there are. Respondents who answered not applicable can be considered as **not in agreement** with the proposal. Their answers are lumped together with the No responses.

In [14]:
# Q132 as labels
dma.Q132.value_counts()

Yes                                                    115
Not applicable /no relevant experience or knowledge     46
No                                                      19
Name: Q132, dtype: int64

In [15]:
dma['label_132'] = np.where(dma['Q132'] == 'Not applicable /no relevant experience or knowledge', 'No', dma['Q132'])
dma.label_132.value_counts()

Yes    115
No      65
Name: label_132, dtype: int64

<a name="methodology"></a>
# Methodology

This section of the tutorial is a step-by-step walkthrough of text classification pipeline, using DMA public consultation documents as described above. Below is an outline of this section: 

<ol type="A">
  <li>Data Preparation </li>
  <li>Text Representation</li>
  <li>Model Training (with hyperparameter tuning)</li>
  <ol>
    <li> Training baseline models (logistic, XGBoost)
    <li> Training a DL classifier (LSTM)
    <li> Transfer learning with BERT and other variants
  </ol>
  <li> Model Evaluation
</ol>

## A. Data Preparation 
In this section of the tutorial, we will (1) parse PDF submissions from the Digital Markets Act public consultation and (2) pre-process raw text to keep only the most relevant information. 

### A.1 Parsing PDFs

PDF (Portable Document Format) documents are hard to work with because this format was not designed as a data input. Instead, the PDF contains a set of instructions that describe how characters or objects are positioned on a page. 

Python has text analytics libraries that convert PDFs into the required encoding format. There are several of these PDF libraries, however, they sometimes yield varying results. Here, we demonstrate one such library, `pdfminer`. We have also tried other libraries like `pdfplumber`, so feel free to also experiment on which PDF parser works best for your corpus. 

We create a dataframe `df_text` which contains information for each document - the reference number (unique ID), file name, and complete text of the document. The full text of the document is parsed using `extract_text` function of pdfminer.

In [16]:
df_text = pd.DataFrame(columns = ['Reference', 'file_name', 'text'])

for pdf_file in pdf_list:
    Reference = re.search('[F][0-9]{6}', pdf_file)[0]
    file_name = re.search('[F][0-9]{6}(.*)[\\>.]', pdf_file)[0]
    text = extract_text(pdf_file)
    row = pd.DataFrame({'Reference': Reference,'file_name': file_name, 'text': text}, index=[0])
    df_text = pd.concat([row,df_text.loc[:]]).reset_index(drop=True)

df_text.head()

Unnamed: 0,Reference,file_name,text
0,F550241,F550241-190326-Dobson_report-FINAL_VERSION.,Ref. Ares(2020)4669723 - 08/09/2020\n\nLEVELLI...
1,F550737,F550737-Mediaset_-_NCT_-_Position_paper_-_fina...,New Competition Tool \n\nPosition Paper \n\nMe...
2,F550604,F550604-LUISS_Study-Executive-summary.,Ref. Ares(2020)4713545 - 09/09/2020\n\nTHE EUR...
3,F541861,F541861-ACT_-_Perspectives_on_the_NCT_-_FINAL.,ACT PERSPECTIVES ON THE NEW COMPETITION TOOL \...
4,F549332,F549332-MCA_DSA_and_NCT_public_consultation_po...,The Malta Communications Authority’s ratio...


`df_text` is merged with the `dma` dataframe created in the [Data Description](#data-description) section using the unique Reference code of each document and respondent. Out of the 188 survey responses, only 85 have pdf submissions and some are duplicated. 

In [17]:
df_merged = dma.merge(df_text, how='left', on='Reference')
df_merged.head()

Unnamed: 0,Reference,Feedback date,Language,User type,First name,Surname,Scope,Organisation name,Transparency register number,Organisation size,...,Q201,Q202,Q203,Q204,Q205,Q206,submit,label_132,file_name,text
0,F550828,8.09.2020 23:57,English,,,,,,,,...,Most effective,Very effective,See our response above.,,No.,Yes,0,Yes,,
1,F550827,8.09.2020 23:54,English,NGO (Non-governmental organisation),Juliane,von Reppert-Bismarck,,Lie Detectors,094738529674-10,Small (< 50 employees),...,Most effective,Most effective,An additional regulatory framework imposing ob...,Lie_Detectors_-_Digital_Services_Act_-_New_Com...,Please see attached submission,Yes,1,Yes,F550827-Lie_Detectors_-_Digital_Services_Act_-...,"8 Sept, 2020 \n\nDigital Services Act and New ..."
2,F550826,8.09.2020 23:50,German,Business Association,Antje,Woltermann,,Zentralverband Deutsches Kfz-Gewerbe e.V.,71649103246-10,Medium (< 250 employees),...,Sufficiently effective,Very effective,-,Positionspapier_Gleichberechtigter_Zugang_zum_...,-,Yes,1,Yes,F550826-Positionspapier_Gleichberechtigter_Zug...,Ref. Ares(2020)4723306 - 10/09/2020\n\nGemeins...
3,F550825,8.09.2020 23:45,English,Academic/Research Institution,DIANA,MONTENEGRO,,Master student candidate Master in Competition...,,Large (250 or more),...,Very effective,Very effective,all are adequated,PREPARATIVE_DOCUMENT_FOR_THE_EUROEPAN_COMMISIO...,The document I prepared was more than 1MB.,Yes,1,Yes,F550825-PREPARATIVE_DOCUMENT_FOR_THE_EUROEPAN_...,Ref. Ares(2020)4723305 - 10/09/2020\n\nTHE LEG...
4,F550824,8.09.2020 23:33,English,Business Association,Kamila,Sotomska,,Zwi?zek Przedsi?biorców i Pracodawców,868073924175-77,Small (< 50 employees),...,Not applicable /No relevant experience or know...,Not applicable /No relevant experience or know...,,,,Yes,0,Yes,,


For the next step, we we remove duplicate observations and use only responses with both PDF submission and label.

In [24]:
df_merged = df_merged[df_merged['submit']==1]
df_merged = df_merged.drop_duplicates(subset='Reference', keep='first')
len(df_merged)

85

### A.2 Cleaning and Pre-Processing

After parsing PDFs, we need to further process the raw text to ensure that most of the information we feed into our model/s is relevant to the task at hand. For instance, stopwords like "the", "this", "and" will not give us any indication of whether a stakeholder agrees with a law, so we can remove these words. 

The pre-processing steps that we apply for this tutorial are the following: 

1. Removal of stopwords, punctuations, and numeric characters
2. Stemming and lemmatization
3. Coreference resolution
4. Language detection

This is what the raw text looks like before processing. White space characters (*"/n"*) are interspersed with the text, and there are mentions of dates and numbers. 

In [71]:
df_merged.text[1]

'8 Sept, 2020 \n\nDigital Services Act and New Competition Tool  \n\nLie Detectors response to public consultations \n\nLie  Detectors,  an  award-winning  journalist-led  media  literacy  campaign  in  Europe,  welcomes  the \nability to respond to the Commission consultation on the Digital Services Act and the New Competition \nTool.  \n\nIn relation to the Digital Services Act consultation, Lie Detectors supports the introduction of Ex Ante \nRegulation for large online platforms with significant network effects acting as gatekeepers, and of a \nNew Competition Tool.  \n\nThe  instinct  of  regulators  and  policymakers  to  hold  large  platforms  like  Facebook  and  Google  to \naccount for their role in the proliferation of online disinformation and in the undermining of quality \njournalism is right. With the necessary political will, solutions exist that will help rein in the epidemic \nof  disinformation  that  is  sweeping  away  trust  in  established  facts,  in  scientifi

#### Removal of stopwords, punctuations, numeric characters
As the first pre-processing step, we create a function that only lowercases all tokens and retains only those that are not in the dictionary of english stop words, not a punctuation mark, and not numeric characters.

In [36]:
def preprocess_corpus(texts):
    eng_stopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        token_list =  [token.lower() for token in tokens if token not in eng_stopwords and token not in punctuation and token.isdigit() == False]
        processed_text = ' '.join(token_list)
        return processed_text
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

In [37]:
df_merged['text_clean'] = preprocess_corpus(df_merged['text'])

#### Stemming and lemmatization

We then create a function which stems and lemmatizes tokens. This is to retain only principal and root words and reduce them to their dictionary form.

In [41]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

In [42]:
def stem_lemmatize(text):
    stemmed = [stemmer.stem(token) for token in word_tokenize(text)]
    lemmatized = [lemmatizer.lemmatize(token) for token in stemmed]
    processed_text = ' '.join(lemmatized)
    return processed_text

df_merged['text_clean'] = [stem_lemmatize(text) for text in df_merged['text_clean']]

#### Coreference resolution

As an optional pre-processing step, we apply coreference resolution to idenfity and link multiple mentions of an entity (e.g. Google = the company, Commission = EU). 

In [None]:
# installing neuralcoref from source
!git clone https://github.com/huggingface/neuralcoref.git
!cd neuralcoref
!pip install -r requirements.txt
!pip install -e .

In [47]:
import neuralcoref

nlp = spacy.load('en_core_web_lg') 
neuralcoref.add_to_pipe(nlp)

<spacy.lang.en.English at 0x285839df0>

In [48]:
def coref_res(texts):
    doc = nlp(texts)
    clean = doc._.coref_resolved
    return clean

df_merged['text_clean'] = [coref_res(text) for text in df_merged['text_clean']]

#### Language detection

After cleaning the text, we use language detection to subset documents that are written in English. The final corpus contains 75 public consultation documents.

In [50]:
for index, row in df_merged.iterrows():
    df_merged.at[index, 'lang'] = detect(df_merged.at[index, 'text_clean'])

df_merged.lang.value_counts()

en    75
de     7
fr     2
ro     1
Name: lang, dtype: int64

In [51]:
df_merged = df_merged[df_merged['lang']=="en"]

#### Final dataframe

In [52]:
df_final = df_merged[['Reference', 'Feedback date', 'User type', 'Scope', 
                    'Organisation name', 'Transparency register number', 'Organisation size',
                    'label_132', 'submit', 'file_name', 'lang', 'text', 'text_clean']]

In [54]:
df_final.reset_index(drop=True).to_json(r"./data/df_final_document.json")

The processed text should look like this.

In [55]:
df_final.text_clean[1]

'sept digit servic act new competit tool lie detector respon public consult lie detector award-win journalist-l medium literaci campaign europ welcom abil respond commiss consult digit servic act new competit tool in relat digit servic act consult lie detector support introduct ex ant regul larg onlin platform signif network effect act gatekeep new competit tool the instinct regul policymak hold larg platform like facebook googl account role prolif onlin disinform undermin qualiti journal right with necessari polit solut exist help rein epidem disinform sweep away trust establish fact scientif method democrat institut design protect u the basi solut lie take disinform sourc take busi model larg platform facebook googl stoke outrag revenu engag “ moneti lie ” european commiss call follow money appli exist new antitrust principl fundament avenu secur european democraci other approach proven incap dent outrag economi corro effect democraci fact-check initi long darl conflict-shi regul pol

In [11]:
df = pd.read_json(r"./data/df_final_document.json")

## B. Text Representation
In this section of the tutorial, we demonstrate how to **convert raw text into numerical form** that can be readily fed into different machine learning and deep learning algorithms. There are two main ways to do this:

1. Basic vectorization
2. Distributed representations. 

### B.1 Basic Vectorization
Basic vectorization techniques map each word of the corpus vocabulary (V) to a numeric value and represents each document as a V-dimensional vector. These methods are simple and straightforward and can be used to construct ML-based text and document classifiers with interpretable features. The most common methods for vectorization are one-hot encoding, bag of words, and the term frequency–inverse document frequency (TF-IDF) matrix. 

#### TF-IDF
Most basic vectorizaton approaches treat words in a text as equally important. TF-IDF introduces a weighting system which quantifies the importance of a given word relative to other words in the document and in the corpus. Below we show how to convert our corpus into a TF-IDF representation.

Our TF-IDF matrix has 75 documents with 10,527 features or words.

In [12]:
vectorizer = TfidfVectorizer()
dfm = vectorizer.fit_transform(df['text_clean'])
dfm.shape

(75, 10527)

### B.2 Distibuted Representations
Distributed representations or **embeddings** are dense, low-dimensional vectors which capture context and distributional similarities between words. These methods address some fundamental limitations of basic vectorization techniques. First, distributed representations are more computationally efficient since they don't retain the shape of the entire corpus vocabulary. Second, context and similarities between words are accounted for, unlike basic vectorizations where words are treated as atomic units. Finally, they provide a solution to the *out of vocabulary* problem, where a model is unable to represent a word that was not used in the training data. One of the biggest advantages of word embeddings is that they are able to generalize to unseen text.

Word or document embeddings that may be pre-trained based on a big corpus (e.g. Word2Vec or GloVe) or trained based on your own set of documents (using CBOW or SkipGram).
Below is an illustration of how to construct TF-IDF weighted word embeddings for our set of DMA consultation documents using pre-trained GloVe embeddings.

First, load the GloVe embeddings which can be download from the NLP Stanford <a href = "https://nlp.stanford.edu/projects/glove/">website</a>. We will use the 100-dimensional GloVe embeddings and save it into a matrix.

In [13]:
dims = 100

z = ZipFile("data/glove.6B.zip") # glove zip file saved in data folder
f = z.open(f'glove.6B.{dims}d.txt')

embed_matrix = pd.read_table(
    f, sep = " ", index_col = 0, 
    header = None, quoting = csv.QUOTE_NONE
)

The GloVe embeddings has a 400,000-word vocabulary and each word is represented by a vector of size 100. 

In [40]:
embed_matrix.shape

(400000, 100)

In [41]:
embed_matrix

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
the,-0.038194,-0.244870,0.728120,-0.399610,0.083172,0.043953,-0.391410,0.334400,-0.57545,0.087459,...,0.016215,-0.017099,-0.389840,0.87424,-0.725690,-0.510580,-0.520280,-0.145900,0.82780,0.270620
",",-0.107670,0.110530,0.598120,-0.543610,0.673960,0.106630,0.038867,0.354810,0.06351,-0.094189,...,0.349510,-0.722600,0.375490,0.44410,-0.990590,0.612140,-0.351110,-0.831550,0.45293,0.082577
.,-0.339790,0.209410,0.463480,-0.647920,-0.383770,0.038034,0.171270,0.159780,0.46619,-0.019169,...,-0.063351,-0.674120,-0.068895,0.53604,-0.877730,0.318020,-0.392420,-0.233940,0.47298,-0.028803
of,-0.152900,-0.242790,0.898370,0.169960,0.535160,0.487840,-0.588260,-0.179820,-1.35810,0.425410,...,0.187120,-0.018488,-0.267570,0.72700,-0.593630,-0.348390,-0.560940,-0.591000,1.00390,0.206640
to,-0.189700,0.050024,0.190840,-0.049184,-0.089737,0.210060,-0.549520,0.098377,-0.20135,0.342410,...,-0.131340,0.058617,-0.318690,-0.61419,-0.623930,-0.415480,-0.038175,-0.398040,0.47647,-0.159830
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
chanty,-0.155770,-0.049188,-0.064377,0.223600,-0.201460,-0.038963,0.129710,-0.294510,0.00359,-0.098377,...,0.093324,0.094486,-0.023469,-0.48099,0.623320,0.024318,-0.275870,0.075044,-0.56380,0.145010
kronik,-0.094426,0.147250,-0.157390,0.071966,-0.298450,0.039432,0.021870,0.008041,-0.18682,-0.311010,...,-0.305450,-0.011082,0.118550,-0.11312,0.339510,-0.224490,0.257430,0.631430,-0.20090,-0.105420
rolonda,0.360880,-0.169190,-0.327040,0.098332,-0.429700,-0.188740,0.455560,0.285290,0.30340,-0.366830,...,-0.044082,0.140030,0.300070,-0.12731,-0.143040,-0.069396,0.281600,0.271390,-0.29188,0.161090
zsombor,-0.104610,-0.504700,-0.493310,0.135160,-0.363710,-0.447500,0.184290,-0.056510,0.40474,-0.725830,...,0.151530,-0.108420,0.340640,-0.40916,-0.081263,0.095315,0.150180,0.425270,-0.51250,-0.170540


We find the words in our corpus that are also present in the GloVe embedding matrix. Results below show that there are 6,487 words in common with GloVe. We take the indices of these common words.

In [42]:
common_features = set(embed_matrix.index) & set(vectorizer.get_feature_names_out())
len(common_features)

6487

In [43]:
vocab_ids = [vectorizer.vocabulary_[x] for x in common_features]
vocab_ids[1:10]

[7897, 6866, 10396, 1494, 7119, 9995, 9804, 10335, 6570]

Using common features only, we multiply our 75 x 6487 TF-IDF matrix by the 6487 x 100 embedding matrix to get the document embedding matrix for all consultation documents. 

In [44]:
doc_matrix = dfm[:,vocab_ids].dot(embed_matrix.loc[common_features,])
doc_matrix.shape

  doc_matrix = dfm[:,vocab_ids].dot(embed_matrix.loc[common_features,])


(75, 100)

## C. Model Training
In this section of the tutorial, we compare 3 different types text classification models:

1. Traditional machine learning classifiers - logistic regression and gradient boosting
2. Deep learning model using sequence networks - LSTM
3. Deep learning transformer-based model - BERT

Traditional machine learning classifiers like logistic regression and gradient boosting have the advantage of introducing "hand-crafted" text features that are interpretable. For instance, tree-based methods can use features such as number of characters and number of mentions of a word as nodes. On the other hand, deep learning methods based on neural network architectures can learn features of the data given just the raw text. This yields features that are more in line with the task at hand, leading to improved model performance. 

The discussion below will mainly highlight the use of the **Bidirectional Encoder Representations from Transformers** or **BERT** for text classification tasks. Results of other models will be shown but the methodology will not be discussed in deatil. 

In [15]:
le = LabelEncoder()
df['label.132'] = le.fit_transform(df['label_132'])

### C.1 Baseline Models
For our baseline models, we will use traditional ML classifiers - logistic regression and gradient boosting - with the document embedding matrix above as features.

##### Logistic Regression

First, we split our data (the document embedding matrix and corresponding labels) into train and test sets. For the purpose of this demonstration, we only split our data into training and testing since the number of documents are limited.

In [158]:
# split data into train and test sets
l_train_text, l_test_text, l_train_labels, l_test_labels = train_test_split(doc_matrix, df['label.132'], 
                                                                            random_state=2018, 
                                                                            test_size=0.3, 
                                                                            stratify=df['label.132'])

We fit a logistic regression model using embedding vectors as features. The accuracy, precision, recall, and F1 scores are shown below. The trained model is saved to a pkl file so we can later on access its results and compare with the other models. 

In [159]:
clf = LogisticRegression(random_state=0).fit(l_train_text, l_train_labels)
with open('./models/logistic.pkl', 'wb') as f: pickle.dump(clf, f)

y_pred = clf.predict(l_test_text)

accuracy = accuracy_score(l_test_labels, y_pred) *100.0
precision = precision_score(l_test_labels, y_pred, average='binary')
recall = recall_score(l_test_labels, y_pred, average='binary')
f_score = 2 * (precision * recall) / (precision + recall)

print(f' Accuracy: {accuracy:.2f} \n Precision: {precision:.3f} \n Recall: {recall:.3f} \n F1: {f_score:.3f}')

 Accuracy: 56.52 
 Precision: 0.588 
 Recall: 0.769 
 F1: 0.667


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


##### Gradient Boosting

We do the same thing for an XGBoost model. Here we see that the logistic regression yields slightly better results. 

In [160]:
bst = XGBClassifier(n_estimators=1000, max_depth=1000, learning_rate=0.1, objective='binary:logistic')

bst.fit(l_train_text, l_train_labels)

print(bst)

y_pred = bst.predict(l_test_text)

accuracy = accuracy_score(l_test_labels, y_pred) * 100.0
precision = precision_score(l_test_labels, y_pred, average='binary')
recall = recall_score(l_test_labels, y_pred, average='binary')
f_score = 2 * (precision * recall) / (precision + recall)

print(f' Accuracy: {accuracy:.2f} \n Precision: {precision:.3f} \n Recall: {recall:.3f} \n F1: {f_score:.3f}')

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.1, max_bin=256,
              max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,
              max_depth=1000, max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=1000, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0, ...)
 Accuracy: 43.48 
 Precision: 0.500 
 Recall: 0.615 
 F1: 0.552


### C.2 Long Short-Term Memory Network
LSTM is a variant of Recurrent Neural Networks which are designed for handling sequential data such as text. Before transformer models became popular, RNNs were the go-to models for language processing. LSTM models handle long term dependencies in text through an architecture that uses three different types of gates - input, output, and forget gates. The gates operate together to decide which information to retain in the LSTM cell. 

As with our baseline models, we start by splitting the data into train and test sets. This time, we use raw text instead of the document embedding matrix. The LSTM architecture includes an embedding layer which learns feature representations from the data. Note that we can also use pre-trained embeddings in this layer. 

In [208]:
# split data into train and test sets
train_text, test_text, train_labels, test_labels = train_test_split(df['text_clean'], df['label.132'], 
                                                                    random_state=2018, 
                                                                    test_size=0.3, 
                                                                    stratify=df['label.132'])

In [209]:
# create train and test dataset 
train_dataset = list(zip(train_labels, train_text))
test_dataset = list(zip(test_labels, test_text))

# convert pd series to list
train_text = train_text.tolist()
test_text = test_text.tolist()

After splitting the data into training and test set, build the corpus vocabulary by tokenizing all texts and assigning each word to a unique index.

In [210]:
tokenizer = get_tokenizer("basic_english")

def tokenize(datasets):
    for dataset in datasets:
        for text in dataset:
            yield tokenizer(text)

vocab = build_vocab_from_iterator(tokenize([train_text, test_text]), min_freq=1, specials=["<UNK>"])
vocab.set_default_index(vocab["<UNK>"])

In [211]:
# example
tokens = tokenizer("This is an example.")
index = vocab(tokens)
index

[0, 1286, 659, 5401, 1]

Once the vocabulary is built, we create batches of text sequences and map the tokens to indices. We also pad the sequence of words so all are of the same length. This returns a tensor of the sequence length and batch size. 

In [212]:
target_classes = ["0", "1"]
max_words = 100

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X] # map tokens to index using vocab
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] # pad sequences

    return torch.tensor(X, dtype=torch.int32), torch.tensor(Y)

In [213]:
train_loader = DataLoader(train_dataset, batch_size=100, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=100, collate_fn=vectorize_batch)

In [214]:
for X, Y in test_loader:
    print(X.shape, Y.shape)
    break

torch.Size([23, 100]) torch.Size([23])


Next, we build a class for our LSTM classifier. For this tutorial we use an architecture with a 200-dimensional embedding layer, 3 hidden layers with 80 input features. The size of embedding and hidden layers, as well as the number of hidden layers are all hyper parameters. We set these values arbitrarily, but they can be optimzed by hyperparameter tuning techniques. 

In [215]:
# define hyperparameters
embed_len = 200
hidden_dim = 80
n_layers = 3

class LSTMClassifier(nn.Module):
    def __init__(self):
        super(LSTMClassifier, self).__init__()
        self.hidden_dim = hidden_dim
        self.embed_len = embed_len
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(target_classes))

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim), torch.randn(n_layers, len(X_batch), hidden_dim)
        output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
        return self.linear(output[:,-1])

    def init_hidden(self):
      return (
               torch.zeros(n_layers, 1, self.hidden_dim, device=device),
               torch.zeros(n_layers, 1, self.hidden_dim, device=device)
            )

In [216]:
lstm_classifier = LSTMClassifier()
lstm_classifier

LSTMClassifier(
  (embedding_layer): Embedding(12320, 200)
  (lstm): LSTM(200, 80, num_layers=3, batch_first=True)
  (linear): Linear(in_features=80, out_features=2, bias=True)
)

In [217]:
for layer in lstm_classifier.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print()

Layer : Embedding(12320, 200)
Parameters : 
torch.Size([12320, 200])

Layer : LSTM(200, 80, num_layers=3, batch_first=True)
Parameters : 
torch.Size([320, 200])
torch.Size([320, 80])
torch.Size([320])
torch.Size([320])
torch.Size([320, 80])
torch.Size([320, 80])
torch.Size([320])
torch.Size([320])
torch.Size([320, 80])
torch.Size([320, 80])
torch.Size([320])
torch.Size([320])

Layer : Linear(in_features=80, out_features=2, bias=True)
Parameters : 
torch.Size([2, 80])
torch.Size([2])



In [218]:
out = lstm_classifier(torch.randint(0, len(vocab), (1024, max_words)))
out.shape

torch.Size([1024, 2])

We then create a function which trains that model and saves the weights of the model with the lowest validation loss.

In [219]:
def evaluate(model, loss_fn, val_loader):
    with torch.no_grad():
        Y_shuffled, Y_preds, losses = [],[],[]
        for X_test, Y_test in val_loader:
            preds = model(X_test)
            loss = loss_fn(preds, Y_test)
            losses.append(loss.item())

            Y_shuffled.append(Y_test)
            Y_preds.append(preds.argmax(dim=-1))

        Y_shuffled = torch.cat(Y_shuffled)
        Y_preds = torch.cat(Y_preds)

        print("Valid Loss : {:.3f}".format(torch.tensor(losses).mean()))
        print("Valid Acc  : {:.3f}".format(accuracy_score(Y_shuffled.detach().numpy(), Y_preds.detach().numpy())))

        return loss

def train(model, loss_fn, optimizer, train_loader, val_loader, epochs=10):
    best_val_loss = 0.0  
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):
            Y_preds = model(X)

            loss = loss_fn(Y_preds, Y) 
            losses.append(loss.item())

            optimizer.zero_grad() 
            loss.backward() 
            optimizer.step() 

        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        val_loss = evaluate(model, loss_fn, val_loader)

        if val_loss > best_val_loss: # save model with the best accuracy
            best_val_loss = val_loss
            torch.save(model.state_dict(), './models/lstm_saved_weights.pt')
            

In [220]:
from torch.optim import Adam

epochs = 20
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifier()
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

train(lstm_classifier, loss_fn, optimizer, train_loader, test_loader, epochs)

100%|██████████| 1/1 [00:00<00:00,  2.32it/s]


Train Loss : 0.679
Valid Loss : 0.684
Valid Acc  : 0.565


100%|██████████| 1/1 [00:00<00:00,  3.53it/s]


Train Loss : 0.675
Valid Loss : 0.682
Valid Acc  : 0.565


100%|██████████| 1/1 [00:00<00:00,  3.73it/s]


Train Loss : 0.670
Valid Loss : 0.681
Valid Acc  : 0.565


100%|██████████| 1/1 [00:00<00:00,  3.79it/s]


Train Loss : 0.665
Valid Loss : 0.679
Valid Acc  : 0.565


100%|██████████| 1/1 [00:00<00:00,  3.84it/s]


Train Loss : 0.659
Valid Loss : 0.676
Valid Acc  : 0.565


100%|██████████| 1/1 [00:00<00:00,  3.84it/s]


Train Loss : 0.651
Valid Loss : 0.672
Valid Acc  : 0.609


100%|██████████| 1/1 [00:00<00:00,  3.83it/s]


Train Loss : 0.639
Valid Loss : 0.667
Valid Acc  : 0.609


100%|██████████| 1/1 [00:00<00:00,  3.72it/s]


Train Loss : 0.625
Valid Loss : 0.661
Valid Acc  : 0.609


100%|██████████| 1/1 [00:00<00:00,  3.92it/s]


Train Loss : 0.605
Valid Loss : 0.653
Valid Acc  : 0.652


100%|██████████| 1/1 [00:00<00:00,  3.97it/s]


Train Loss : 0.579
Valid Loss : 0.645
Valid Acc  : 0.652


100%|██████████| 1/1 [00:00<00:00,  3.98it/s]


Train Loss : 0.547
Valid Loss : 0.635
Valid Acc  : 0.609


100%|██████████| 1/1 [00:00<00:00,  4.01it/s]


Train Loss : 0.505
Valid Loss : 0.626
Valid Acc  : 0.609


100%|██████████| 1/1 [00:00<00:00,  3.97it/s]


Train Loss : 0.455
Valid Loss : 0.618
Valid Acc  : 0.609


100%|██████████| 1/1 [00:00<00:00,  3.98it/s]


Train Loss : 0.397
Valid Loss : 0.615
Valid Acc  : 0.652


100%|██████████| 1/1 [00:00<00:00,  3.92it/s]


Train Loss : 0.333
Valid Loss : 0.620
Valid Acc  : 0.609


100%|██████████| 1/1 [00:00<00:00,  3.95it/s]


Train Loss : 0.269
Valid Loss : 0.643
Valid Acc  : 0.565


100%|██████████| 1/1 [00:00<00:00,  3.96it/s]


Train Loss : 0.209
Valid Loss : 0.682
Valid Acc  : 0.652


100%|██████████| 1/1 [00:00<00:00,  3.97it/s]


Train Loss : 0.159
Valid Loss : 0.732
Valid Acc  : 0.652


100%|██████████| 1/1 [00:00<00:00,  3.96it/s]


Train Loss : 0.121
Valid Loss : 0.786
Valid Acc  : 0.696


100%|██████████| 1/1 [00:00<00:00,  3.95it/s]

Train Loss : 0.095
Valid Loss : 0.853
Valid Acc  : 0.652





### C.3 Transfer Learning with BERT and Other Variants

In the last four years, there have be great improvements in using neural network architectures for creating text representations. One new NLP architecture that has recently gained traction is **Transformers**. These are networks that handle long-distance dependence in sequence data using self-attention.

With the `transformers` library, we can import a wide range of transformer-based pre-trained models. For the last part of the Modeling section, we will use a pre-trained **Bidirectional Encoder Representations from Transformers** or **BERT** model and fine-tune it for a text classification task. This technique is called **Transfer Learning** wherein a deep learning model trained from a very large dataset is used  “off-the-shelf” to perform similar tasks on another dataset. There are 3 different fine-tuning techniques:

* Train the entire architecture which updates all pre-trained weights based on the new dataset
* Train the pretrained model partially and freeze the weights of the initial layer and retrain only on the higher levels
* Freeze entire architecture and attach a few neural network layers to train the new model.

For this tutorial, we will implement the last approach by freezing all the layers of the BERT model and attaching a few layers to train this new model. Note that the weights of only the attached layers will be updated during model training.

An example of the ideal and full pipeline using transfer learning with BERT is shown below.

<p align="center">
    <img src="./img/bert_pipeline.png" width="700" height="320">
<p>

The dataset is first split into train and test similar to what was used in LSTM.

In [16]:
# split data into train and test sets
train_text, test_text, train_labels, test_labels = train_test_split(df['text_clean'], df['label.132'], 
                                                                    random_state=2018, 
                                                                    test_size=0.3, 
                                                                    stratify=df['label.132'])

##### Import BERT Model and BERT Tokenizer

BERT is trained from BookCorpus (800M words) and English Wikipedia (2.5B words). There are many variants of the BERT model including the `bert-base-uncased`, which has 12 layers, 110M parameters, and trained on lower-cased English texts. Meanwhile, a variant called `bert-large-uncased` has 345M parameters.


In [19]:
# bert-base-uncased
bert = AutoModel.from_pretrained('bert-base-uncased', return_dict=False)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', return_dict=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


`legal-bert-base-uncased` is another model variant with 12 layers and 110M parameters. It is pre-trained using 12 GB of diverse English legal text from several fields such as EU legislation, UK legislation, European Court of Justice, USA court cases, and US contracts from EDGAR, the database of US Securities and Exchange Commission.

For the purpose of this tutorial, we will use legalBERT given that our documents are DMA public consultations that were written with the context of legal texts.

In [20]:
# legal-bert-base-uncased
bert = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased", return_dict=False)
tokenizer = BertTokenizerFast.from_pretrained("nlpaueb/legal-bert-base-uncased", return_dict=False)

Some weights of the model checkpoint at nlpaueb/legal-bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


##### Tokenize the Sentences

In a nutshell, BERT reads the input sequence and generates meaningful text representations, which it feeds into the encoder. This can then be augmented with additional neural network layers to fit a classification task. Unlike other deep learning NLP models, BERT has three embedding layers: token embedding layer, segment embedding layer, and position embedding layer. The element-wise sum of these three layers gives the final input representation.

<p align="center">
    <img src="./img/bert_embeddings.png" width="250" height="320">
</p>

Let's check how the BERT tokenizer works using a sample text.

In [21]:
# example
text = ["we will fine-tune a bert model", "we will implement transfer learning"]
sent_id = tokenizer.batch_encode_plus(text, padding=True)
print(sent_id)

{'input_ids': [[101, 532, 261, 2178, 116, 21558, 145, 219, 188, 190, 1955, 102], [101, 532, 261, 2397, 439, 5793, 102, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]}


The output is a dictionary of three items.

* input_ids -  token indices, numerical representations of tokens building the sequences that will be used as input by the model
* token_type_ids - a binary mask identifying the two types of sequence in the model to do classification on pairs of sentences
* attention_mask - a binary tensor indicating the position of the padded indices so that the model does not attend to them

More details [here](https://huggingface.co/docs/transformers/glossary).

In [22]:
# tokenize and encode sequences in the training set
tokens_train = tokenizer.batch_encode_plus(
    train_text.tolist(),
    max_length = 25,
    pad_to_max_length=True,
    truncation=True
)

# tokenize and encode sequences in the test set
tokens_test = tokenizer.batch_encode_plus(
    test_text.tolist(),
    max_length = 25,
    pad_to_max_length=True,
    truncation=True
)



Next, we will convert the integer sequences to tensors.

In [23]:
# convert lists to tensors
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())

In [24]:
batch_size = 2
num_workers = 2

# dataLoader for train set
train_data = TensorDataset(train_seq, train_mask, train_y)
train_dataloader = DataLoader(train_data, num_workers=num_workers, shuffle=True, batch_size=batch_size)

# dataLoader for test set
test_data = TensorDataset(test_seq, test_mask, test_y)
test_dataloader = DataLoader(test_data, num_workers=num_workers, shuffle=True, batch_size=batch_size)

##### BERT Model Architecture

We will define a class for our BERT model architecture. As mentioned earlier, we will freeze all layers of the BERT pre-trained model, `legal-bert-base-uncased`, and attach a few neural network layers and a softmax layer in the end to convert the output to probabilities for our binary classification task.

In [25]:
class BERT_Arch(nn.Module):

    def __init__(self, bert):
      
      super(BERT_Arch, self).__init__()

      self.bert = bert 
      self.dropout = nn.Dropout(0.1)
      self.relu =  nn.ReLU()
      self.fc1 = nn.Linear(768,512)
      self.fc2 = nn.Linear(512,2)
      self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, sent_id, mask):

      #pass the inputs to the model  
      _, cls_hs = self.bert(sent_id, attention_mask=mask)
      x = self.fc1(cls_hs)
      x = self.relu(x)
      x = self.dropout(x)
      x = self.fc2(x)
      x = self.softmax(x)

      return x

We define the method to freeze all layers and keep the weights of the pre-trained BERT model.

In [26]:
# method to freeze all the parameters if freeze = T
def set_parameter_requires_grad(model, freeze):
    if freeze:
        for param in model.parameters():
            param.requires_grad = False

In [27]:
set_parameter_requires_grad(model=bert, freeze=True)
bert_classifier = BERT_Arch(bert)
bert_classifier = bert_classifier.to(device)

##### Fine-tuning BERT

Finally, we will define our methods to train (fine-tune) and evaluate the model. 

In [29]:
def train(model, dataloader, criterion, optimizer):
  model.train()
  total_loss = 0
  total_preds=[]
  
  for inputs in tqdm(dataloader):
    
    # push to gpu
    inputs = [r.to(device) for r in inputs]
    sent_id, mask, labels = inputs

    # zero the parameter gradients
    model.zero_grad()        

    # forward + backward + optimize 
    preds = model(sent_id, mask)
    loss = criterion(preds, labels)
    total_loss += loss.item()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) #prevent exploding gradient problem
    optimizer.step()
    preds=preds.detach().cpu().numpy()   

    # append the model predictions
    total_preds.append(preds)

  # epoch loss and model predictions
  epoch_loss = total_loss / len(dataloader)
  total_preds  = np.concatenate(total_preds, axis=0)

  return epoch_loss, total_preds


In [30]:
def evaluate(model, dataloader, criterion):
  model.eval()
  total_loss = 0
  total_preds = []

  for inputs in tqdm(dataloader):
    
    # push to gpu
    inputs = [t.to(device) for t in inputs]
    sent_id, mask, labels = inputs

    with torch.no_grad():
      preds = model(sent_id, mask)
      loss = criterion(preds,labels)
      total_loss += loss.item()
      preds = preds.detach().cpu().numpy()
      total_preds.append(preds)

  # epoch loss and model predictions
  epoch_loss = total_loss / len(dataloader)
  total_preds  = np.concatenate(total_preds, axis=0)

  return epoch_loss, total_preds

In [31]:
def fit(model, criterion, train_loader, val_loader, epochs):
    best_valid_loss = float('inf')

    train_losses=[]
    valid_losses=[]

    for epoch in range(epochs):
        
        print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
        train_loss, _ = train(model, train_loader, criterion, optimizer)
        valid_loss, _ = evaluate(model, val_loader, criterion)
        
        # save best model
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), './models/bert_saved_weights.pt')
        
        # append training and validation loss
        train_losses.append(train_loss)
        valid_losses.append(valid_loss)
        
        print(f"Train Loss: {train_loss:.2f}")
        print(f"Validation Loss: {valid_loss:.2f}")

We will use AdamW as our optimizer which is an [improved version](https://arxiv.org/abs/1711.05101) of the Adam optimizer.

In [32]:
from transformers import AdamW

epochs = 20
learning_rate = 1e-5

optimizer = AdamW(bert_classifier.parameters(), lr = learning_rate)
criterion  = nn.NLLLoss() 



In [33]:
fit(bert_classifier, criterion, train_dataloader, test_dataloader, epochs)


 Epoch 1 / 20
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  6.70it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.08it/s]


Train Loss: 0.70
Validation Loss: 0.69

 Epoch 2 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  6.73it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  4.43it/s]


Train Loss: 0.69
Validation Loss: 0.68

 Epoch 3 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:04<00:00,  5.21it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:03<00:00,  3.91it/s]


Train Loss: 0.68
Validation Loss: 0.69

 Epoch 4 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  7.36it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.68it/s]


Train Loss: 0.67
Validation Loss: 0.69

 Epoch 5 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  6.50it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.46it/s]


Train Loss: 0.67
Validation Loss: 0.69

 Epoch 6 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  6.99it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.89it/s]


Train Loss: 0.66
Validation Loss: 0.70

 Epoch 7 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  8.56it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.40it/s]


Train Loss: 0.67
Validation Loss: 0.71

 Epoch 8 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:04<00:00,  5.24it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:04<00:00,  2.87it/s]


Train Loss: 0.66
Validation Loss: 0.69

 Epoch 9 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  6.81it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  4.41it/s]


Train Loss: 0.66
Validation Loss: 0.72

 Epoch 10 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  8.00it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.37it/s]


Train Loss: 0.68
Validation Loss: 0.69

 Epoch 11 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  7.83it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.75it/s]


Train Loss: 0.65
Validation Loss: 0.72

 Epoch 12 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:02<00:00,  8.76it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  4.65it/s]


Train Loss: 0.68
Validation Loss: 0.72

 Epoch 13 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  6.77it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.10it/s]


Train Loss: 0.67
Validation Loss: 0.70

 Epoch 14 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  6.57it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  4.82it/s]


Train Loss: 0.69
Validation Loss: 0.69

 Epoch 15 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:04<00:00,  5.60it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  4.06it/s]


Train Loss: 0.65
Validation Loss: 0.69

 Epoch 16 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  6.88it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:02<00:00,  5.70it/s]


Train Loss: 0.67
Validation Loss: 0.69

 Epoch 17 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  8.00it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:01<00:00,  6.68it/s]


Train Loss: 0.68
Validation Loss: 0.71

 Epoch 18 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  7.77it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:01<00:00,  6.37it/s]


Train Loss: 0.64
Validation Loss: 0.71

 Epoch 19 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  7.50it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:01<00:00,  6.35it/s]


Train Loss: 0.68
Validation Loss: 0.70

 Epoch 20 / 20


  0%|          | 0/26 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 26/26 [00:03<00:00,  8.32it/s]
  0%|          | 0/12 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


100%|██████████| 12/12 [00:01<00:00,  7.11it/s]

Train Loss: 0.63
Validation Loss: 0.70





## D. Model Evaluation
In this section of the tutorial, we compare the results of three types of text classification models.

##### Logistic Regression
The baseline logitic regression model, which used TF-IDF weighted GloVe embeddings, gives a good starting point to evaluate how good the final BERT model is. The accuracy of the logistic regression is 57%, with a precision score of 59% and recall score 77%. 


In [161]:
# Load the model from the file
with open('./models/logistic.pkl', 'rb') as f:
  clf = pickle.load(f)

y_pred = clf.predict(l_test_text)

# Generate the classification report
report = classification_report(l_test_labels, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.50      0.30      0.37        10
           1       0.59      0.77      0.67        13

    accuracy                           0.57        23
   macro avg       0.54      0.53      0.52        23
weighted avg       0.55      0.57      0.54        23



##### LSTM
The LSTM model uses 3 hidden layers, with 80 input features from a 200-dimensional embedding layer. Even without hyperparameter tuning, the accuracy score of the RNN-based model (65%) is higher than that of the logistic regression. Unlike the logistic regression model, the LSTM has a higher precision score (73%) compared to recall (62%). Results were optimized at the 14th out of 20 epochs. After that, the model started overfitting to the training data. 

In [222]:
# load weights of best model
path = './models/lstm_saved_weights.pt'
lstm_classifier.load_state_dict(torch.load(path))

# get predictions for test data
with torch.no_grad():
  preds = lstm_classifier.forward(X.to(device))
  preds = preds.detach().cpu().numpy()

preds = np.argmax(preds, axis = 1)
print(classification_report(Y, preds))

              precision    recall  f1-score   support

           0       0.58      0.70      0.64        10
           1       0.73      0.62      0.67        13

    accuracy                           0.65        23
   macro avg       0.66      0.66      0.65        23
weighted avg       0.66      0.65      0.65        23



##### BERT
The BERT model performed just as well as the logistic regression and a slightly poorer than LSTM in terms of accuracty. However, it outperformed both models in terms of recall. It correctly classifies as "agree" a high percentage of documents whose authors actually agree with the legislation. 

In [35]:
path = './models/bert_saved_weights.pt'
bert_classifier.load_state_dict(torch.load(path))

with torch.no_grad():
  preds = bert_classifier(test_seq.to(device), test_mask.to(device))
  preds = preds.detach().cpu().numpy()

preds = np.argmax(preds, axis = 1)
print(classification_report(test_y, preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        10
           1       0.57      1.00      0.72        13

    accuracy                           0.57        23
   macro avg       0.28      0.50      0.36        23
weighted avg       0.32      0.57      0.41        23



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


<a name="results-and-discussion"></a>
# Results & Discussion

In the context of the public consultations of the European Commission, a document clasifier can be used as a pre-screening tool that facilitates an efficient and streamlined review of information from stakeholders. Instead of staff members going through each submission individually, the classifier can categorize the documents according to whether a stakeholder agrees or disagrees with a proposed law. 

Documents classified as "**agree**" can be archived and taken as is, while documents classified as "**disagree**" will need to be manually checked by staff for in order to understand the negative sentiment of the stakeholder. These comments and insights will be valuable for improving legislative proposals. 

With this in mind, it is important to ensure that our document classifier not only makes accurate predictions, but also minimizes false positives and maximizes precision. Out of all documents classified by the algorthim as agreeing with the legislation, the number who *actually* agree should be as high as possible. This way, more valuable information can be integrated into the EC's policy and decision making process. 

Based on the results of the 3 models, the RNN-based LSTM classifier currently works the best both in terms of accuracy and precision. We note, however, that the results may vary when the models are refined further (as discussed in the **Next Steps** section).



## Limitations

This tutorial is focused on education and learning. Its main purpose is to present different models and approaches that can be implemented to classify unstructured documents. The following limitations should be kept in mind: 

*   The models can satisfactorily classify DMA public consultation documents, but they are not enough to understand how different stakeholders in digital markets respond to the proposal. 
*   The use of a small dataset yielded results that are not optimal. Thus, the findings of our models cannot be directly used to infer policy implications.
* The models are trained on DMA public consultation documents, which likely contain terminologies specific to digital markets. As such, there is no guarantee that the models can generalize to public consultations in other industries.

Nonetheless, this tutorial provides a structured project pipeline for classifying unstructured documents using machine learning and NLP-based models, which can be adopted by future researchers to a similar project.


## Next Steps

For future research and studies involving the classification of unstructured document and/or analyzing EU public consultation documents, we recommend the following:
1. Data collection
*   First, collect more textual data from different public consultations in the EU or any similar dataset. We highlight the importance of using a large enough dataset to train and fine-tune a pre-trained model to reduce overfitting.
*   Another limitation of our current data is that our labels initially comprised of three classes but we lumped the answers "No" and "Not applicable" together and regarded them as not agreeing to the DMA proposal. We advise to rethink this approach and the assess whether the these two responses are different. 
*   Not all datasets especially unstructured documents come with labels. We highly recommend to allot time to annotating labels as early as possible to answer specific research questions that you have in mind.
2. Pre-processing techniques
*   Our parsing and pre-processing approach only utilized a simple parsing technique which reads all the texts in a document together. Other NLP methods like document layout analysis can help identify different sections of the document including headings, body, tables, footnotes. This way we can tag or keep the relevant parts or texts to include in the model. 
* Techniques such as Named Entity Recognition can also be as an advanced pre-processing step. Names of businesses and companies can be taken out of the text that is fed into the algorithms, given that proper nouns will likely not provide any information regarding stakeholder sentiments.
*   Image-to-text method for scanned pdfs as well as language translation for non-English documents can be used to avoid reducing the number of the training data.
3. Modeling
*   The entire architechture of the pre-trained NLP model can be trained to update all pre-trained weights based on the new (large) dataset. 
*   Moreover, other BERT variants can be utilized but LegalBERT seems to be the most appropriate so far when it comes to training data involving legal texts.
*   Another best-practice is to define a validation dataset to conduct hyperparameter tuning and cross validation such as grid search and parameter search.
*   Future research can also explore the use of unsupervised machine learning and deep learning models to identify clusters and textual features from the documents. Analysis including topic models such as Latent Dirichlet Analysis, Structural Topic Modeling, and BERTtopic can be conducted to understand the relationship of type of stakeholder (e.g., large companies, MSMEs, public institutions), the proposed bill, and other variables to the position and/or emotions of the stakeholder to a given law.

<a name="references"></a>
# References

* Hugging Face LEGAL-BERT https://huggingface.co/nlpaueb/legal-bert-base-uncased
* Hugging Face Transformers https://huggingface.co/docs/transformers/glossary
* Transfer Learning for NLP: Fine-tuning BERT for Text Classification https://www.analyticsvidhya.com/blog/2020/07/transfer-learning-for-nlp-fine-tuning-bert-for-text-classification/
* LSTM for Text Classification https://www.kaggle.com/code/mehmetlaudatekman/lstm-text-classification-pytorch/notebook


## Acknowledgement

The transfer learning with BERT tutorial was inspired by Analytics Vidhya tutorial on [Transfer Learning for NLP: Fine-tuning BERT for Text Classification](https://www.analyticsvidhya.com/blog/2020/07/transfer-learning-for-nlp-fine-tuning-bert-for-text-classification/).