<a href="https://colab.research.google.com/github/Vikadie/AI-repo/blob/master/NER_and_Entity_Linking_in_legal_documents_in_Bulgarian_language.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
%matplotlib inline

In [3]:
import os
from random import randint, seed
from time import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import pickle

!pip install nose

from nose.tools import *

Collecting nose
[?25l  Downloading https://files.pythonhosted.org/packages/15/d8/dd071918c040f50fa1cf80da16423af51ff8ce4a0f2399b7bf8de45ac3d9/nose-1.3.7-py3-none-any.whl (154kB)
[K     |██▏                             | 10kB 14.7MB/s eta 0:00:01[K     |████▎                           | 20kB 11.5MB/s eta 0:00:01[K     |██████▍                         | 30kB 7.7MB/s eta 0:00:01[K     |████████▌                       | 40kB 7.2MB/s eta 0:00:01[K     |██████████▋                     | 51kB 4.5MB/s eta 0:00:01[K     |████████████▊                   | 61kB 4.9MB/s eta 0:00:01[K     |██████████████▉                 | 71kB 4.8MB/s eta 0:00:01[K     |█████████████████               | 81kB 5.2MB/s eta 0:00:01[K     |███████████████████             | 92kB 4.8MB/s eta 0:00:01[K     |█████████████████████▏          | 102kB 5.2MB/s eta 0:00:01[K     |███████████████████████▎        | 112kB 5.2MB/s eta 0:00:01[K     |█████████████████████████▍      | 122kB 5.2MB/s eta 0:00:01

In [6]:
#!pip install -q -U tf-models-official
!pip install bert-for-tf2
!pip install -q -U tensorflow-text

[K     |████████████████████████████████| 3.4MB 5.6MB/s 
[?25h

In [7]:
import tensorflow as tf

import tensorflow_hub as hub
import tensorflow_text as text  # A dependency of the preprocessing model
from bert import bert_tokenization

#from official.nlp import bert # needed for tokenizer

# NER and Entity Linking in unstructured legal documents in Bulgarian language

##### Final exam report

*Viktor Belchev - student*

*Deep Learning - Software University*

*February 2021*

## Abstract

Abstract

In all legal documents there is usage of citation of different laws or other juridical terms often hidden behind some abbreviations. While this aims to make the text shorter and clearer it is mostly causing troubles in understanding and translation to simple language not only to regular persons, but sometimes even lawyers and people with juridical background feel lost. Therefore, it often requires an additional research in the legal litterature. At this stage another problem might occur - the correct decoding of abbreviated terms can become obstacle on top of the the overall understanding of the information.

In this paper, I will try to use the modern approach of Deep Learning to create a helpful tool that overcomes these problems. Using the state-of-the-art available models in the field of Natual Language Processing like BERT and an abbreviation list available at this stage, I will try to achieve an acceptable accuracy in this task for Bulgarian language that is known as combination of two different tasks: Named Entity Recognition and Entity Linking. 

## Introduction

Generally, in Natural Language Processing (further, NLP) the process of disambiguation of terms is known as Entity Linknig (further, EL), which goes hand in hand with another operation called Named Entity Recognition (further, NER). As explained by Iva Marinova in her **"Reconstructing NER Corpora: a Case Study on Bulgarian"** while in the field of Deep Learning 
these two related tasks are considered to be well covered in
NLP for Germanic, Romance and other language groups,
they are still under-resourced for the Slavic languages, especially from a multilingual perspective.

Usually, the order of application of both tasks is by starting with NER.

The purpose of NER is to tag words in a sentences based on some predefined tags, in order to extract some important information of the sentence, like for instanse names, geographical locations, dates, currency etc.
In NER, each token in the sentence will get tagged with a label, the label will tell the specific meaning of the token. In that way, through NER, we can analyze the sentence with more details and extract the important information.

There are two popular approaches for NER:
- multi-class classification based where NER is treated as a multi-class classification process, and we can use some text classification method to label the token.
- Conditional Random Field(CRF) based method labels the token taking context into account, then predicts sequences of labels for sequences of sentence token then get the most reasonable one. It is a probabilistic graphical model.

The identification of named entity mentions in texts is often implemented using a sequence tagger, where each token is labeled with an BIO tag, indicating whether the token begins a named entity — (B-), whether it is inside of a named entity (I-), or outside of a named entity (O-). This type of annotation has been proposed for the first time at CoNLL-2003 dataset created for NER (Tjong Kim Sang and De Meulder, 2003). There are other tag notation types. For instance, each token can be predicted with a tag indicated by B-(begin), I-(inside), E-(end), S-(singleton) of a named entity with its type, or O-(outside) of named entities. But, I will stick to BIO format of representation for simplicity.

Entity linking can be applied rigth after the NER task is performed althought in some papers on this topic there is proposal to do it in parallel (jointly) for each token, so that each subtask benefits from the partial output of the other subtask, and thus alleviate error propagations that are unavoidable in pipeline settings. 
Generally, EL is the task of mapping words from text (e.g. names of persons, locations and organizations) to entities from the target knowledge base. For this pupose I use a document containing most of the existing abbreviations used in legal documents.

## NER

Usually, no matter the specific task, Deep Learning models creation is based on big data for training, validation and test. For Bulgarian language generally such data could be available if we start scraping web pages, which is huge amount of work. But this is only one side of the hidden obstacles - the effectiveness of the model created for the task is a real challenge on its own. 

Luckily, after the publication of the famous paper called "Attention is all you need" by Vaswani and the "appearance" of *Transformer*, there is a huge advancement in the model creation compared to previous usage of recurrence (RNN), Bidirectional Lont-Short Term Memory units (BiLSTM), convolutions (CNN) and CRF. Transformer utilizes stacked self-attention and pointwise, fully connected layers to build basic blocks for encoder and decoder.  Experiments on various tasks show Transformers to be superior in quality while requiring significantly less time to train.

Based mostly on transformer, it already exists pre-trained models that provide results pretty close to humans on some general tasks. Some of the most used methods are ELMo(Embeddings from Language Models), OpenAI GPT (Generative Pre-trained Transformer), BERT (Bidirectional Encoder Representations from Transformers)... 

It is important to underline that these state-of-the-art models use specific representation of the text, called embeddings, usually so called *hybrid representation* of text in low dimensional real-valued dense vectors. It is called *hybrid* as it uses *Word-level* and *Character-level* representation along with some additional features, where each dimensions represents a latent feature. This way it also captures the semantic and syntactic properties of words, but also the context for each word.

In recent years, the advancements of NLP in general and NER in particular has been greatly influenced by deep transfer learning methods capable of creating contextual representations of words, to the extent that many of the state-of-the-art NER systems mainly differ from one another on the basis of how these contextual representations are created. Using such models, sequence tagging tasks are often approached one sentence at a time, essentially discarding any information available in the broader surrounding context, and there is only little recent study on the use of cross-sentence context – sentences around the sentence of interest – to improve sequence tagging performance.

Precisely for the fact of using this cross-sentence context, but also with the advantage to be pre-trained on Bulgarian texts, in this notebook, I focus on the recent BERT deep transfer learning models based on self-attention and the transformer architecture. BERT uses a fixed-size window that limits the amount of text that can be input to the model at one time. The model maximum window size, or maximum sequence length, is fixed during pre-training, with 512 wordpieces a common choice. This window fits dozens of typical sentences of input at a time, allowing the inclusion of extensive sentence context.

There are many advantages that pushed me towards usage of BERT. To enumerate some, I would say that it provides:
1. quicker development
2. overcome the problem of missing data for training, which is generally the case for Bulgarian
3. state-of-the-art better results - BERT is built on top of a number of clever ideas considered top in NLP community in latest years – including but not limited to Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer (Vaswani et al).

BERT is also one of the preferred model giving the best results used by Ilias Chalkidis et al. when dealing with  Large-Scale Multi-Label Text Classification (LMTC) in the legal domain (EU legislation).

On the other hand, there are some disadvantages, like:
1. it is very large. The LARGE version of BERT would provide better results, but unfortunately that would require bigger computational ability and time.
2. Even when using the BASE version, it remains slow for fine-tuning.
3. The multilingual version that I need to use cannot be disitilled - the vocabulary used for fine-tuning of BERT must remain the original one.
4. It uses a specific and a bit complicated jargon (domain-specific language), meaning that the tokenization with BERT should be done with BERT Tokenizer.

The last two disadvantages represent in fact the specifity and maybe the strength of BERT. Its vocabulary is indeed fixed, but it has the capability to break down the unknown word into subwords and makes a token out of each subword (if subword exists in the vocabulary). In case the subword do not exist in the vocabulary it can continue spliting it into subwords down to a character level. To recognize the subword it prepends it with "##" flag, exceot for the first subword.

On its turn the subword split would create a problem with labeling. Generally, in the test and train part each word is tagged. If an unknown word is splitted to subwords, a specific tag should be used for it, that would indicate that the tag valid for this word (the initial whole word) would be the one given to the first subword (original word) and a specific tag would be assigned to subwords after the first one.

### Dataset creation

Before getting to the problem of tags given to subwords, we need a dataset, big enough, that can be used to fine-tune our BERT model. This dataset should implement the following requirements:
   - it must be created for a NER task;
   - it must be in Bulgarian;
   - it must contain special annotations (tags) for recognition of legal phrases;
   - it must be big enough to train deep network model;
   - ideally it should have a train, validation and test datasets.

Well, the first four requirements need to be mandatory fulfilled. After all if the dataset is big enough there are ways and methods to make a consise split for train, validation and test datasets. 

But it is hard task to implement all four requirements. In fact, I was not able to find such dataset on Internet. Luckily, there is one dataset recently created for NER task, which was in Bulgarian - the dataset done by Iva Marinova et al. pesented in May 2020. The dataset is available at https://github.com/usmiva/bg-ner. With it, I could cover half of the requirements for my task. Unfortunately, as it was not created for utilization on legal texts, there were not a specific tag for legal phrases inside. Still, it was the best one I could find. Therefore, I decided to use it as a base, a starting point, and add to it the required information covering the legal part gathered by me.

But before start adding information, let me reveal what and how is implemented inside, in order to decide at what level it will suit me and how to add the missing information.

The original Bulgarian corpus consists of 916 text files extracted from various news websites. The training dataset contains information on two topics – Brexit and the trial of the Pakistani Christian Asia Bibi, accused of blasphemy, while the main subjects for the test data are the Nord Stream 2 project and the recent developments in RyanAir’s business history.

The type of annotation used inside followed the format used for the first time at CoNLL-2003 but used only the first and the last column (ommitting the part-of-speech tag and synctatic chunk tag) - meaning that the input files were segmented into sentences and tokens per line (first column), and each token was combined with its corresponding Named Entity tag (the second column). The NE tags were of type person (PER), organization (ORG), location (LOC), product (PRO), and event (EVT) and each of them had a prefix using the BIO format. Like in most NER tasks, NEs are considered to be non-recursive, non- overlapping, and whenever one NE is embedded in another NE, only the top-most entity is annotated.

The 2 files available for download were 2 text files (.txt) - one train file with 220 700 lines and one test file with ~65 000 lines.

Well, armed with this information, it was obvious that the missing part was for legal phrases, thus missing tag for tham. I decided that I could simply add a NE tag LAW. After a quick review there were only few word that could match this new tag in the existing dataset.

Therefore, I added to the training file 117 documents taken from the "Decision Register" of The Administrative Court of Sofia City (ASCS) - http://www.admincourtsofia.bg/Default.aspx?alias=www.admincourtsofia.bg/en representing  the first 5 working days of year 2021 (from 4.01 to 8.01.2021). Each of this document was transformed to text, the sentences containing legal mentions were extracted and transformed in a file following the format of the original dataset using a simple Python script. The tag were than manually reviewed and annotated as correctly as possible not forgeting the initially available tags for person, organization etc. with the BIO prefixes.

In that way the train document grew up to 347 642 lines. 

The original test documant were split in two - one for validation and one for test datasets. 40 documents  from the same source (court decisions published from 11.01.2021) were annotated and splitted the same way it was done for the train part. With that operation the validation file consisted of 56 880 lines and the test file of 56 908 lines. 

In that way the ration train vs. validation was assured to be at the reasonable 84% / 14% level.

### Reading the Dataset

Now it is time to read the documents in order to prepare the train, validation and test datasets. For that I need to create a function that I will use afterwards for preparing each of the datasets to attributes and labels.

In [21]:
def read_data(filename):
  """"reading the file and returning a list of attributes and a list of corresponding labels"""
  data, sentence, label = [], [], []
  num_sentense, max_sentence_length = 0, 0

  with open(filename, 'r', encoding='utf-8') as f:
    for line in f:
      if len(line) == 0 or line[0] == '\n':
        if len(sentence) > 0:
          data.append((sentence, label))
          if len(sentence) > max_sentence_length:
            max_sentence_length = len(sentence)
          sentence, label = [], []
          num_sentense += 1
        continue
      word, lab = line.rstrip('\n').split('\t')
      sentence.append(word)
      label.append(lab)

    if len(sentence) > 0:
      data.append((sentence, label))
      num_sentense += 1
      if len(sentence) > max_sentence_length:
            max_sentence_length = len(sentence)

    #attributes, labels = data
    print("Number of sentences:", num_sentense)
    print("Maximum token lenght of a sentence:", max_sentence_length)
  return data # attributes, labels

In [30]:
class InputExample:
    """A single training/test example for simple sequence classification."""

    def __init__(self, id, text, label=None):
        """Constructs an InputExample.

        Args:
            id: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.id = id
        self.text = text
        self.label = label

In [31]:
def create_examples(lines, set_type):
    examples = []
    for i, (sentence, label) in enumerate(lines):
        id = f"{set_type}-{i}"
        text = ' '.join(sentence)
        label = label
        examples.append(InputExample(
            id=id, text=text, label=label))
    return examples

In [32]:
path = '/content/drive/MyDrive/Colab Notebooks/data/'

file = os.path.join(path, 'val_NER_BG.txt')

readed_data = read_data(file)

print(readed_data[:10])
#print(val_y[:10])

Number of sentences: 1695
Maximum token lenght of a sentence: 240
[(['Газопроводът', 'Северен', 'поток', '2', ',', 'който', 'по', 'план', 'ще', 'пренася', 'ежегодно', '55', 'милиарда', 'кубични', 'метра', 'природен', 'газ', 'от', 'Русия', 'към', 'ЕС', 'през', 'Балтийско', 'море', ',', 'вече', 'бе', 'одобрен', 'от', 'Германия', 'и', 'Финландия', '.'], ['O', 'B-PRO', 'I-PRO', 'I-PRO', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-ORG', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-LOC', 'O']), (['САЩ', ',', 'в', 'отговор', 'заявиха', ',', 'че', 'тръбопроводът', 'ще', 'повиши', 'зависимостта', 'на', 'Европа', 'от', 'руския', 'газ', '.'], ['B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O']), (['Списание', '"', 'Foreign', 'policy', '"', 'цитира', 'три', 'източника', 'близки', 'до', 'въпроса', ',', 'които', 'твърдят', 'че', 'администрацията', 'на', 'САЩ', 'е', 'близо', 'до', 'налагането', 'н

In [33]:
validation = create_examples(readed_data, "val")

In [35]:
print("id =", validation[0].id)
print("text =", validation[0].text)
print("label = ", validation[0].label)

id = val-0
text = Газопроводът Северен поток 2 , който по план ще пренася ежегодно 55 милиарда кубични метра природен газ от Русия към ЕС през Балтийско море , вече бе одобрен от Германия и Финландия .
label =  ['O', 'B-PRO', 'I-PRO', 'I-PRO', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-ORG', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'B-LOC', 'O']


In [None]:
#####################################################################

In [36]:
EPOCHS = 3  # Total number of training epochs to perform
BATCH_SIZE = 32  # Total batch size for training.
WARM_UP_PROPORTION = 0.1  # Proportion of training to perform linear learning rate warmup for. e.g., 0.1 = 10% of training.
LEARNING_RATE = 5e-5  # The initial learning rate for Adam.
WEIGHT_DECAY = 0.01  # Weight decay if we apply some
ADAM_EPSILON = 0.01  # Epsilon for Adam optimizer
MAX_SEQ_LENGTH = 256  # the length of the biggest sentence

In [None]:
num_train_optimization_steps = int(len(train_examples) / BATCH_SIZE) * EPOCHS

warmup_steps = int(WARM_UP_PROPORTION * num_train_optimization_steps)

learning_rate_fn = tf.keras.optimizers.schedules.PolynomialDecay(initial_learning_rate=LEARNING_RATE,
                                                                         decay_steps=num_train_optimization_steps,
                                                                         end_learning_rate=0.0)

if warmup_steps:
  # from optimization import AdamWeightDecay, WarmUp
  learning_rate_fn = WarmUp(initial_learning_rate=LEARNING_RATE,
                            decay_schedule_fn=learning_rate_fn,
                            warmup_steps=warmup_steps)
  
optimizer = AdamWeightDecay(
            learning_rate=learning_rate_fn,
            weight_decay_rate=WEIGHT_DECAY,
            beta_1=0.9,
            beta_2=0.999,
            epsilon=ADAM_EPSILON,
            exclude_from_weight_decay=['layer_norm', 'bias'])

In [None]:
#####################################################################

After all, I need to uplad the BERT model. The choosen model by is:
- the BERT Multilanguage version (in order to have a pre-trained model that has already seen Bulgarian language and Bulgarian words are part of it vocabulary);
- the Cased version, meaning that whether the words contains capital letter or not matter to the model;
- BASE version, as using the LARGE model would take too many ressources for training without such significal improvement in the outcome.

When using BERT one must be aware that the tokenization with BERT should be done with BERT Tokenizer.

There are two ways to upload the model.

The first one is by loading the model from TensoFlow Hub:

In [19]:
bert_model_name = 'bert_multi_cased_L-12_H-768_A-12' 
map_name_to_handle = {'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3'}

map_model_to_preprocess = {'bert_multi_cased_L-12_H-768_A-12':
        'https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/2',}

tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocessing model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3
Preprocessing model auto-selected: https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/2


In [20]:
bert_preprocess = hub.load(tfhub_handle_preprocess)
tok = bert_preprocess.tokenize(tf.constant(['Hello TensorFlow!']))
print(tok)





























<tf.RaggedTensor [[[31178], [16411, 28919, 11565, 27863], [106]]]>


Second possibility is to download the required version to a directory from where it can be directly loaded. I did that in my Google Disk drive:

In [5]:
gs_folder_bert = '/content/drive/MyDrive/Colab Notebooks/bert_model/'

tf.io.gfile.listdir(gs_folder_bert)

['.ipynb_checkpoints',
 'bert_model.ckpt.data-00000-of-00001',
 'bert_config.json',
 'bert_model.ckpt.index',
 'bert_model.ckpt.meta',
 'vocab.txt']

In [9]:
# Set up tokenizer to generate Tensorflow dataset
tokenizer = bert_tokenization.FullTokenizer(
    vocab_file=os.path.join(gs_folder_bert, "vocab.txt"),
     do_lower_case=False)

print("Vocab size:", len(tokenizer.vocab))

Vocab size: 119547


By tokenizing a sentence, one can see that the upper case matters:

In [18]:
tokens = tokenizer.tokenize("Hello TensorFlow!")
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

['Hello', 'Ten', '##sor', '##F', '##low', '!']
[31178, 16411, 28919, 11565, 27863, 106]


## Entity Linking

The Entity Linking (EL) process transforms amiguous textual mention to a unique identifier by looking at the context in which the mention occurs. Thus it can be looked as 2 step process after the NER:
1. Creation of Entity Linker - list of candidates for each mention generation
2. Reduce the list to the final ID that represents the correct name.

This is generally the method used in `spacy` module.

Another option used for this is used in `deeppavlov` module (http://docs.deeppavlov.ai/en/master/features/models/entity_linking.html)where:
1. NER is fed to tf-idf Vectorizer and the resulting sparse vector is converted to dense vector.
2. A library called Faiss (https://github.com/facebookresearch/faiss) is used to find the k-nearest neighbours for tf-idf vector in the matrix where each row is a tf-idf vectors of words in entity titles.
3. entities are ranked by number of relations in Wikidata (number of outgoing edges of nodes in the knowledge graph).

## Conclusion and Future Work

Conclusion

## Ressources

**Reconstructing NER Corpora: a Case Study on Bulgarian** - Iva Marinova, Laska Laskova, Petya Osenova, Kiril Simov, Alexander Popov - Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4647–4652, Marseille, 11–16 May 2020

**Tuning Multilingual Transformers for Named Entity Recognition on
Slavic Languages** - Mikhail Arkhipov, Maria Trofimova, Yuri Kuratov, Alexey Sorokin - Neural Networks and Deep Learning Laboratory, Moscow Institute of Physics and Technology, Faculty of Mathematics and Mechanics, Moscow State University - Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 89–93, Florence, Italy, 2 August 2019. - https://www.aclweb.org/anthology/W19-3712.pdf


**BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding** - Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova - Google AI Language, 24 May 2019 - https://arxiv.org/pdf/1810.04805.pdf


**Attention Is All You Need** - Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin, 6 Dec 2017 - https://arxiv.org/pdf/1706.03762.pdf

**A Survey on Deep Learning for Named Entity Recognition** - Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li - IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 18 Mar 2020 - https://arxiv.org/pdf/1812.09449v3.pdf

**Zero-Resource Cross-Domain Named Entity Recognition** - Zihan Liu, Genta Indra Winata, Pascale Fung - Center for Artificial Intelligence Research (CAiRE), Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, 19 May 2020 - https://arxiv.org/pdf/2002.05923.pdf

**Exploring Cross-sentence Contexts for Named Entity Recognition with BERT** - Jouni Luoma, Sampo Pyysalo - Turku NLP group, University of Turku, Finland, 2 Jun 2020 - https://arxiv.org/pdf/2006.01563v1.pdf

**NER with BERT in Action** - Bill Huang - July 30, 2019- https://medium.com/@yingbiao/ner-with-bert-in-action-936ff275bc73

**The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)** - Jay Alammar blog - http://jalammar.github.io/illustrated-bert/

**Deep contextualized word representations** - Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer - Allen Institute for Artificial Intelligence and Paul G. Allen School of Computer Science & Engineering, University of Washington, 22 Mar 2018 - https://arxiv.org/pdf/1802.05365.pdf


**Improving Language Understanding by Generative Pre-Training** - Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever - Open AI - https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

**Introduction to the conll-2003 shared task: Language independent named entity recognition.** - Tjong Kim Sang, E. F. and De Meulder, F. (2003) - https://arxiv.org/pdf/cs/0306050.pdf

**Large-Scale Multi-Label Text Classification on EU Legislation** - Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis,
Ion Androutsopoulos - Department of Informatics, Athens University of Economics and Business, Greece (June 2019) - https://arxiv.org/pdf/1906.02192v1.pdf