# Transformers Walkthrough
## Author: Brady Lamson

This notebook is intended to showcase a basic workflow for finetuning a transformers LLM for `named entity recognition (NER)`. There are many resources online for this but the information I need for this specific workflow is scattered throughout many sources. As such, I thought to consolidate them into something more digestable. 

This notebook is a companion piece to a more proper writeup that will be hosted on my main site. I won't be including all of my code there so hopefully this notebook will serve as a good reference as you move through the walkthrough.

# The Dataset

We'll be using the [CONLL2003 dataset](https://huggingface.co/datasets/conll2003).

> The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.
> The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

This dataset conveniently contains all of the features I need for this tutorial. Obviously in the real world you'll need to put in a sizable amount of effort to getting labeled data and putting it into this format, but that's outside the scope of this tutorial. I at least hope that by showing what I'm working with and explaining what all the various features represent that this will be a less overwhelming task.

## Loading the Data

We'll be using the huggingface `datasets` library a lot here. It's built specifically for being used with `transformers` and also comes with many datasets we can load in for demonstrations such as this. 

In [13]:
from datasets import load_dataset

dataset = load_dataset("conll2003")
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

What we can see here is a `DatasetDict` object. This outside object is fairly straightforward, it's a fancy dictionary that can contain any number of `Dataset` objects. For the sake of this demonstration, these `Datasets` are all different splits, or parts of the whole overall dataset.

Normally you'd need to split your data and load it into a dataset dictionary yourself, so the organization of this object is helpful to develop some familiarity with so you know what you'll be building towards.

First, let's check out an individual row from the training split.

In [10]:
dataset["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [25]:
dataset["train"].features.keys()

dict_keys(['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'])

Okay, so what do each of these features represent?

- id: Simple index, easy.
- tokens: A list containing each token making up a string. 
- pos_tags: Nothing relevant to our problem set
- chunk_tags: Nothing relevant to our problem set
- ner_tags: List of integers pertaining to different named entities. This is the one we care about. Index in the list corresponds to the index in tokens. So the first `ner_tag` gives the tag for the first `token`. What do the tags mean?

In [26]:
dataset["train"].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)