##### Environment check

Before we begin, let's just confirm our current environment:

In [1]:
!transformers-cli env


Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.49.0
- Platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
- Python version: 3.9.21
- Huggingface_hub version: 0.28.1
- Safetensors version: 0.5.2
- Accelerate version: 1.3.0
- Accelerate config: 	not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.5.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: Tesla T4



In [2]:
from torch import cuda

device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cuda


----

# Custom Named Entity Recognition with BERT

From [here](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Custom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb#scrollTo=MyETdB-dkBsX)

In [3]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizerFast, BertConfig, BertForTokenClassification

In [4]:
model_checkpoint = 'bert-base-uncased'

### Downloading and preprocessing the data

In [5]:
#data = pd.read_csv("ner_datasetreference.csv", encoding='unicode_escape')
data = pd.read_csv("ner-dataset.zip", encoding='unicode_escape')

print(data.head())
print(data.shape)

    Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1          NaN             of   IN   O
2          NaN  demonstrators  NNS   O
3          NaN           have  VBP   O
4          NaN        marched  VBN   O
(1048575, 4)


In [6]:
data.count()

Sentence #      47959
Word          1048565
POS           1048575
Tag           1048575
dtype: int64

In [7]:
print("Number of tags: {}".format(len(data.Tag.unique())))
frequencies = data.Tag.value_counts()
frequencies

Number of tags: 17


Tag
O        887908
B-geo     37644
B-tim     20333
B-org     20143
I-per     17251
B-per     16990
I-org     16784
B-gpe     15870
I-geo      7414
I-tim      6528
B-art       402
B-eve       308
I-art       297
I-eve       253
B-nat       201
I-gpe       198
I-nat        51
Name: count, dtype: int64

In [8]:
tags = {}
for tag, count in zip(frequencies.index, frequencies):
    if tag != "O":
        if tag[2:5] not in tags.keys():
            tags[tag[2:5]] = count
        else:
            tags[tag[2:5]] += count
    continue

print(sorted(tags.items(), key=lambda x: x[1], reverse=True))

[('geo', 45058), ('org', 36927), ('per', 34241), ('tim', 26861), ('gpe', 16068), ('art', 699), ('eve', 561), ('nat', 252)]


In [9]:
entities_to_remove = ["B-art", "I-art", "B-eve", "I-eve", "B-nat", "I-nat"]
data = data[~data.Tag.isin(entities_to_remove)]

print(data.head())
print(data.shape)

    Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1          NaN             of   IN   O
2          NaN  demonstrators  NNS   O
3          NaN           have  VBP   O
4          NaN        marched  VBN   O
(1047063, 4)


> We create 2 dictionaries: one that maps individual tags to indices,
> and one that maps indices to their individual tags. This is necessary in
> order to create the labels (as computers work with numbers = indices,
> rather than words = tags) - see further in this notebook.

In [10]:
labels_to_ids = {k: v for v, k in enumerate(data.Tag.unique())}

labels_to_ids

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'I-per': 8,
 'I-gpe': 9,
 'I-tim': 10}

In [11]:
ids_to_labels = {v: k for v, k in enumerate(data.Tag.unique())}

ids_to_labels

{0: 'O',
 1: 'B-geo',
 2: 'B-gpe',
 3: 'B-per',
 4: 'I-geo',
 5: 'B-org',
 6: 'I-org',
 7: 'B-tim',
 8: 'I-per',
 9: 'I-gpe',
 10: 'I-tim'}

N.B.


> <span style="background-color:#9fff33;">FutureWarning: `DataFrame.fillna` with 'method' is deprecated and will raise in a future version. Use `obj.ffill()` or `obj.bfill()` instead.</span>

Please see [`pandas.DataFrame.ffill`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html#pandas-dataframe-ffill).

In [12]:
print(f"Before: {data.head()}")
print()

# pandas has a very handy "forward fill" function to fill missing values based on the last upper non-nan value
#data = data.fillna(method='ffill')
data = data.ffill()

print(f"After: {data.head()}")

Before:     Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1          NaN             of   IN   O
2          NaN  demonstrators  NNS   O
3          NaN           have  VBP   O
4          NaN        marched  VBN   O

After:     Sentence #           Word  POS Tag
0  Sentence: 1      Thousands  NNS   O
1  Sentence: 1             of   IN   O
2  Sentence: 1  demonstrators  NNS   O
3  Sentence: 1           have  VBP   O
4  Sentence: 1        marched  VBN   O


In [13]:
data.columns

Index(['Sentence #', 'Word', 'POS', 'Tag'], dtype='object')

> Now, we have to ask ourself the question: what is a training example in the case of NER,
> which is provided in a single forward pass? A training example is typically a sentence,
> with corresponding IOB tags. Let's group the words and corresponding tags by `sentence`...

Using some Pandas magic, we are going to first concatenate all of the `Word` column values grouped by `Sentence #`, delimited by a whsp char.

We then do nearly the same for `Tag` columns values again grouped by `Sentence #`, but delimited by a comma char.

In [14]:
%%time

# let's create a new column called "sentence" which groups the words by sentence 
data['sentence'] = (
    data[['Sentence #','Word','Tag']]
        .groupby(['Sentence #'])['Word']
            .transform(lambda x: ' '.join(x))
)

# let's also create a new column called "word_labels" which groups the tags by sentence 
data['word_labels'] = (
    data[['Sentence #','Word','Tag']]
        .groupby(['Sentence #'])['Tag']
            .transform(lambda x: ','.join(x))
)

print(data.head())

    Sentence #           Word  POS Tag  \
0  Sentence: 1      Thousands  NNS   O   
1  Sentence: 1             of   IN   O   
2  Sentence: 1  demonstrators  NNS   O   
3  Sentence: 1           have  VBP   O   
4  Sentence: 1        marched  VBN   O   

                                            sentence  \
0  Thousands of demonstrators have marched throug...   
1  Thousands of demonstrators have marched throug...   
2  Thousands of demonstrators have marched throug...   
3  Thousands of demonstrators have marched throug...   
4  Thousands of demonstrators have marched throug...   

                                         word_labels  
0  O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-...  
1  O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-...  
2  O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-...  
3  O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-...  
4  O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-...  
CPU times: user 8.02 s, sys: 144 ms, total: 8.17 s
Wall time: 8.17 s


> Let's only keep the "sentence" and "word_labels" columns, and drop duplicates:

In [15]:
data = (
    data[["sentence", "word_labels"]]
        .drop_duplicates()
        .reset_index(drop=True)
)

print(data.head())
print(data.shape)

                                            sentence  \
0  Thousands of demonstrators have marched throug...   
1  Families of soldiers killed in the conflict jo...   
2  They marched from the Houses of Parliament to ...   
3  Police put the number of marchers at 10,000 wh...   
4  The protest comes on the eve of the annual con...   

                                         word_labels  
0  O,O,O,O,O,O,B-geo,O,O,O,O,O,B-geo,O,O,O,O,O,B-...  
1  O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,B-per,O,O,...  
2                O,O,O,O,O,O,O,O,O,O,O,B-geo,I-geo,O  
3                      O,O,O,O,O,O,O,O,O,O,O,O,O,O,O  
4  O,O,O,O,O,O,O,O,O,O,O,B-geo,O,O,B-org,I-org,O,...  
(47571, 2)


In [16]:
for w,l in zip(data.iloc[41].sentence.split(), data.iloc[41].word_labels.split(',')):
    print(w,l)

Bedfordshire B-gpe
police O
said O
Tuesday B-tim
that O
Omar B-per
Khayam I-per
was O
arrested O
in O
Bedford B-geo
for O
breaching O
the O
conditions O
of O
his O
parole O
. O


### Preparing the dataset and dataloader

In [17]:
MAX_LEN = 128
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 1
LEARNING_RATE = 1e-05
MAX_GRAD_NORM = 10

tokenizer = BertTokenizerFast.from_pretrained(model_checkpoint)

#### Implement a PyTorch [`Dataset`](https://pytorch.org/docs/stable/data.html#dataset-types)

> The most important argument of `DataLoader` constructor is `dataset`,
> which indicates a dataset object from which to load data. PyTorch supports two different types of dataset:<br/>
> * [Map-style](https://pytorch.org/docs/stable/data.html#map-style-datasets) datasets that implement the `__getitem__` and `__len__` protocols
> * [Iterable-style](https://pytorch.org/docs/stable/data.html#iterable-style-datasets) datasets that implement the `__iter__` protocol

Here, we implement a map-style PyTorch `torch.utils.data.Dataset` class that will transform examples of the given dataframe to PyTorch tensors.

1. each sentence gets tokenized
2. special tokens that BERT expects are added
3. the token sequences are padded or truncated based on the max length of the model
4. the attention mask is created
5. labels are created based on the dictionary which we defined above.

Word pieces that should be ignored have a label of `-100`, the default `ignore_index` of PyTorch's [`CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html#torch-nn-functional-cross-entropy).

In [18]:
class dataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, index):
        # step 1: get the sentence and word labels 
        sentence = self.data.sentence[index].strip().split()  
        word_labels = self.data.word_labels[index].split(",") 

        # step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
        # BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
        encoding = self.tokenizer(
            sentence,
            #is_pretokenized=True,
            is_split_into_words=True,
            return_offsets_mapping=True, 
            padding='max_length', 
            truncation=True, 
            max_length=self.max_len
        )
        
        # step 3: create token labels only for first word pieces of each tokenized word
        labels = [labels_to_ids[label] for label in word_labels] 
        # code based on https://huggingface.co/transformers/custom_datasets.html#tok-ner
        # create an empty array of -100 of length max_length
        encoded_labels = np.ones(len(encoding["offset_mapping"]), dtype=int) * -100
        
        # set only labels whose first offset position is 0 and the second is not 0
        i = 0
        for idx, mapping in enumerate(encoding["offset_mapping"]):
            if mapping[0] == 0 and mapping[1] != 0:
                # overwrite label
                encoded_labels[idx] = labels[i]
                i += 1

        # step 4: turn everything into PyTorch tensors
        item = {key: torch.as_tensor(val) for key, val in encoding.items()}
        item['labels'] = torch.as_tensor(encoded_labels)

        return item

    def __len__(self):
        return self.len

> Now, based on the class we defined above, we can create 2 datasets, one for training and one for testing. Let's use a 80/20 split

In [19]:
train_size = 0.8
train_dataset = data.sample(frac=train_size,random_state=200)
test_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = dataset(train_dataset, tokenizer, MAX_LEN)
testing_set = dataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (47571, 2)
TRAIN Dataset: (38057, 2)
TEST Dataset: (9514, 2)


In [20]:
training_set[0]

{'input_ids': tensor([  101, 23564, 21030,  2099,  4967,  2001,  9388,  1011,  6109,  2005,
          2634,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

In [21]:
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["input_ids"]), training_set[0]["labels"]):
    print('{0:10}  {1}'.format(token, label))

[CLS]       -100
za          3
##hee       -100
##r         -100
khan        8
was         0
mar         0
-           -100
93          -100
for         0
india       1
.           0
[SEP]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[PAD]       -100
[

> Now, let's define the corresponding PyTorch dataloaders:

In [22]:
train_params = {
    'batch_size': TRAIN_BATCH_SIZE,
    'shuffle': True,
    'num_workers': 0
}

test_params = {
    'batch_size': VALID_BATCH_SIZE,
    'shuffle': True,
    'num_workers': 0
}

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

#### Defining the model

> ... we define the model, [`BertForTokenClassification`](https://huggingface.co/docs/transformers/v4.49.0/en/model_doc/bert#transformers.BertForTokenClassification),
> and load it with the pretrained weights of `bert-base-uncased`. The only thing we need to additionally specify
> is the number of labels (as this will determine the architecture of the classification head)...
>
> ... only the base layers are initialized with the pretrained weights. The token classification head of top has
> just randomly initialized weights, which we will train, together with the pretrained weights, using our labelled
> dataset. This is also printed as a warning when you run the code cell below...
>
> (don't forget that) we move the model to the GPU...

In [23]:
model = BertForTokenClassification.from_pretrained(
    model_checkpoint, 
    num_labels=len(labels_to_ids)
)

model.to(device)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

#### Training the model

In [24]:
inputs = training_set[2]
input_ids = inputs["input_ids"].unsqueeze(0)
attention_mask = inputs["attention_mask"].unsqueeze(0)
labels = inputs["labels"].unsqueeze(0)

input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)
labels = labels.to(device)

outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
initial_loss = outputs[0]
initial_loss

tensor(2.4655, device='cuda:0', grad_fn=<NllLossBackward0>)

> This looks good. Let's also verify that the logits of the neural network have a shape of (`batch_size`, `sequence_length`, `num_labels`):

In [25]:
tr_logits = outputs[1]
tr_logits.shape

torch.Size([1, 128, 11])

> Next, we define the optimizer. Here, we are just going to use Adam with a default learning rate.
> One can also decide to use more advanced ones such as AdamW (Adam with weight decay fix), which is
> included in the Transformers repository, and a learning rate scheduler, but we are not going to do that here.
>
> _... and why not?_

In [26]:
optimizer = torch.optim.Adam(
    params=model.parameters(), 
    lr=LEARNING_RATE
)

> Now let's define a regular PyTorch training function. It is partly based on [a really good repository about multilingual NER](https://github.com/chambliss/Multilingual_NER/blob/master/python/utils/main_utils.py#L344).

In [27]:
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['input_ids'].to(device, dtype = torch.long)
        mask = batch['attention_mask'].to(device, dtype = torch.long)
        labels = batch['labels'].to(device, dtype = torch.long)

        #loss, tr_logits = model(input_ids=ids, attention_mask=mask, labels=labels)
        outputs = model(input_ids=ids, attention_mask=mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += labels.size(0)
        
        if idx % 100==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 100 training steps: {loss_step}")
           
        # compute training accuracy
        flattened_targets = labels.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        
        # only compute accuracy at active labels
        active_accuracy = labels.view(-1) != -100 # shape (batch_size, seq_len)
        #active_labels = torch.where(active_accuracy, labels.view(-1), torch.tensor(-100).type_as(labels))
        
        labels = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_labels.extend(labels)
        tr_preds.extend(predictions)

        tmp_tr_accuracy = accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

... and we're off!

In [28]:
%%time

for epoch in range(EPOCHS):
    print(f"Training epoch: {epoch + 1}")
    train(epoch)

Training epoch: 1
Training loss per 100 training steps: 2.3968489170074463


RuntimeError: The size of tensor a (512) must match the size of tensor b (128) at non-singleton dimension 0