In [1]:
import pandas as pd
import ast
import numpy as np

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertConfig, BertForTokenClassification

In [3]:
data = pd.read_csv(r"C:\Users\Abhinav\OneDrive\Documents\Git Projects\Custom-NER\NER_Dataset.csv")

## Types of NER training

### 1. Dictionary Based: 
This is the simplest NER approach. Here we will be having a dictionary that contains a collection of vocabulary. In this approach, basic string matching algorithms are used to check whether the entity is occurring in the given text to the items in the vocabulary. The method has limitations as it is required to update and maintain the dictionary used for the system.
### 2. Rule-based SystemsHere, the model uses a pre-defined set of rules for information extraction. Mainly two types of rules are used, Pattern-based rules, which depend upon the morphological pattern of the words used, and context-based rules, which depend upon the context of the word used in the given text document. A simple example for a context-based rule is “If a person’s title is followed by a proper noun, then that proper noun is the name of a person”.
###### Tokenization, Dependency Parser Tree, Pos-Tags, Stemming etc
### 3. ML Based System
The ML-based systems use statistical-based models for detecting the entity names. These models try to make a feature-based representation of the observed data. By this approach, a lot of limitations of dictionary and rule-based approaches are solved by recognizing an existing entity name, even with small spelling variations.
### 4. Deep Learning based
In recent years, deep learning-based models are being used for building state-of-the-art systems for NER. There are many advantages of using DL techniques over the previously discussed approaches. Using the DL approach, the input data is mapped to a non-linear representation. This approach helps to learn complex relations that are present in the input data. Another advantage is that we can avoid a lot of time and resources spent on feature engineering, which is required for the other traditional approaches.






In [4]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cuda


In [5]:
import torch
torch.cuda.is_available()

True

In [6]:
data['Sentence'] = data["Word"].apply(lambda l: " ".join(eval(l)))

### Types of tag
Let's have a look at an example. If you have a sentence like \
"Barack Obama was born in Hawaï", \
[B-PERS, I-PERS, O, O, O, B-GEO]. \
B-PERS means that the word "Barack" is the **beginning** of a person, \
I-PERS means that the word "Obama" is **inside** a person,\
"O" means that the word "was" is **outside a named entity**, and so on. So one typically has as many tags as there are words in a sentence.\
So if you want to train a deep learning model for NER, it requires that you have your data in this **IOB format(Inside-Outside-Beginning) (or similar formats such as BILOU)**. There exist many annotation tools which let you create these kind of annotations automatically (such as Spacy's Prodigy, Tagtog or Doccano). You can also use Spacy's biluo_tags_from_offsets function to convert annotations at the character level to IOB format.

Here, we will use a NER dataset from Kaggle that is already in IOB format. One has to go to this web page, download the dataset, unzip it, and upload the csv file to this notebook. Let's print out the first few rows of this csv file:



In [7]:
data["Tag"] = data["Tag"].apply(ast.literal_eval)
frequencies= data["Tag"].explode().value_counts()
frequencies

Tag
O        887908
B-geo     37644
B-tim     20333
B-org     20143
I-per     17251
B-per     16990
I-org     16784
B-gpe     15870
I-geo      7414
I-tim      6528
B-art       402
B-eve       308
I-art       297
I-eve       253
B-nat       201
I-gpe       198
I-nat        51
Name: count, dtype: int64

In [8]:
#Checking Unique tags and count
#we can remove nat, eve or art tags, as the count is very less for training
tags = {}
for tag, count in zip(frequencies.index, frequencies):
    if tag != "O":
        if tag[2:5] not in tags.keys():
            tags[tag[2:5]] = count
        else:
            tags[tag[2:5]] += count
    continue

print(sorted(tags.items(), key=lambda x: x[1], reverse=True))

[('geo', 45058), ('org', 36927), ('per', 34241), ('tim', 26861), ('gpe', 16068), ('art', 699), ('eve', 561), ('nat', 252)]


In [9]:
#remove unwanted tags and fill NA with forward fill
to_remove = ["B-art", "I-art", "B-eve", "I-eve", "B-nat", "I-nat"]
data['Tag'] = data['Tag'].apply(lambda x: [i if i not in to_remove else np.NaN for i in x])


In [10]:
data['Tag'] = data['Tag'].apply(lambda x: list(pd.Series(x).fillna(method='ffill')))
data['Tag'] = data['Tag'].apply(lambda x: list(pd.Series(x).fillna(method='bfill')))

  data['Tag'] = data['Tag'].apply(lambda x: list(pd.Series(x).fillna(method='ffill')))
  data['Tag'] = data['Tag'].apply(lambda x: list(pd.Series(x).fillna(method='bfill')))


In [11]:
pd.Series(data['Tag'].iloc[22]).fillna(method='bfill')

  pd.Series(data['Tag'].iloc[22]).fillna(method='bfill')


0         O
1         O
2     B-per
3     I-per
4         O
5         O
6         O
7         O
8         O
9         O
10        O
11        O
12        O
13        O
14        O
15        O
16        O
17        O
18    B-tim
19        O
20        O
21        O
22        O
23        O
24        O
25    B-geo
26        O
27        O
28    B-geo
29        O
dtype: object

In [12]:
label2id  = {str(k): v for v, k in enumerate(data['Tag'].explode().unique())}
id2label = {v: str(k) for v, k in enumerate(data['Tag'].explode().unique())}

### Preparing the dataset and dataloaders

Iterations: the number of batches needed to complete one Epoch \
Batch Size: The number of training samples used in one e iteration \
Epoch: one full cycle through the training dataset. A cycle is composed of many iterationsio\
Number of Steps per Epoch = (Total Number of Training Samples) / (Batch Size)

In [13]:
MAX_LEN = 128
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 1
LEARNING_RATE = 1e-05
#Will be used in Gradient Clipping, to restrict gradient value below MAX_GRAD_NORM
MAX_GRAD_NORM = 10
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')



In [14]:
def tokenize_and_preserve_labels(sentence, text_labels, tokenizer):
    """
    Word piece tokenization and extending its label if token getting split
    """

    tokenized_sentence = []
    labels = []

    sentence = sentence.strip()

    for word, label in zip(sentence.split(), text_labels):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels

### Preparing Pytorch Dataset and DataLoader

coverting data to pytorch tensors which can be feeded in Model.
torch.utils.data.Dataset is an abtract class which should be inherited in every dataset class, to confirm if the class has
1. __init__ - The __init__ function is run once when instantiating the Dataset object. We initialize the directory containing the images, the annotations file, and both transforms
2. __getitem__ -  returns a sample from the dataset at the given index idx, with label, in class we dont need to call object, can use direct slicing  [] (indexer) in object
3. __len__ - so that len(dataset) returns the size of the dataset.


Step to do:
1. **wordpiece tokenization** : A tricky part of NER with BERT is that BERT relies on wordpiece tokenization, rather than **word tokenization**. This means that we should also define the labels at the wordpiece-level, rather than the word-level!

For example, if you have word like "Washington" which is labeled as "b-gpe"\
, but it gets tokenized to "Wash", "##ing", "##ton", then we will have to propagate the word’s original label to all of it\
s wordpieces: "b-gpe", "b-gpe", "b-gpe". The model should be able to produce the correct labels for each individual wordpiece. The function below (taken from here) implements this.


2. **Add '[CLS]' and '[SEP]' tags at front and end of every example** \
[SEP] is for separating sentences for the next sentence prediction task\
CLS stands for classification, we usually add these token at front and end of every sentence


3. **Truncating and Padding** each example to MAX_Length, so that each row is equal, if one example length smaller than max_length, add padding with tag [PAD] with label 'O'. If example lenght is greater than max_lenght, truncate.

4. **Add Attention mask** (for each token add 0(no attention) or 1(attention)) we can skip this, training will still be done, but model might not perform accurately, as it is taking padding layers in prediction, to avoid this, add attention mask to non-padded tokens, so that model pays more attention to non-padded words  

5. **Tokens to Ids** in Tokenizer vocabulary each word is given one id, map each word ids in an example to its tokenizer id

In [15]:
class dataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.len = len(dataframe)
        
    def __getitem__(self, index):
        # step 1: tokenize (and adapt corresponding labels)
        sentence = self.data['Sentence'].iloc[index]  
        word_labels = self.data['Tag'].iloc[index]  
        tokenized_sentence, labels = tokenize_and_preserve_labels(sentence, word_labels, self.tokenizer)

        # step 2: add special tokens (and corresponding labels)
        tokenized_sentence = ["[CLS]"] + tokenized_sentence + ["[SEP]"] # add special tokens
        labels.insert(0, "O") # add outside label for [CLS] token
        labels.insert(-1, "O") # add outside label for [SEP] token

        # step 3: truncating/padding
        maxlen = self.max_len

        if (len(tokenized_sentence) > maxlen):
          # truncate
          tokenized_sentence = tokenized_sentence[:maxlen]
          labels = labels[:maxlen]
        else:
          # pad
          tokenized_sentence = tokenized_sentence + ['[PAD]'for _ in range(maxlen - len(tokenized_sentence))]
          labels = labels + ["O" for _ in range(maxlen - len(labels))]

        # step 4: obtain the attention mask
        attn_mask = [1 if tok != '[PAD]' else 0 for tok in tokenized_sentence]
        # step 5: convert tokens to input ids
        ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)
        label_ids = [label2id[label] for label in labels]
        # the following line is deprecated
        #label_ids = [label if label != 0 else -100 for label in label_ids]
        
        return {
              'ids': torch.tensor(ids, dtype=torch.long),
              'mask': torch.tensor(attn_mask, dtype=torch.long),
              #'token_type_ids': torch.tensor(token_ids, dtype=torch.long),
              'targets': torch.tensor(label_ids, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

### Splitting Training and test set

In [29]:
train_size = 0.8
train_dataset = data.sample(frac=train_size,random_state=200)
test_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(data.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = dataset(train_dataset, tokenizer, MAX_LEN)
testing_set = dataset(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (47959, 5)
TRAIN Dataset: (38367, 5)
TEST Dataset: (9592, 5)


In [17]:
training_set[1]

{'ids': tensor([  101,  1996,  2088,  2740,  3029,  2758,  8168,  1997,  1996,  9252,
          1044,  2629,  2078,  2487, 10178,  2579,  2013,  2048,  4743, 19857,
          5694,  1999,  4977,  2265,  1037,  7263,  7403,  2689,  1999,  1996,
          7865,  1011,  1037,  3696,  2009,  2089,  2022, 14163, 29336,  2075,
          2046,  1037,  2529, 19857,  7865,  2008,  2071,  3102,  8817,  1012,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,  

### Check labels and data

In [30]:
# print the first 30 tokens and corresponding labels
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[1]["ids"][:30]), training_set[1]["targets"][:30]):
  print('{0:10}  {1}'.format(token, id2label[label.item()]))

[CLS]       O
the         O
world       B-org
health      I-org
organization  I-org
says        O
samples     O
of          O
the         O
deadly      O
h           O
##5         O
##n         O
##1         O
strain      O
taken       O
from        O
two         O
bird        O
flu         O
victims     O
in          O
turkey      B-geo
show        O
a           O
slight      O
genetic     O
change      O
in          O
the         O


### Setting Dataloader

Dataloaders are iterables over the dataset. So when you iterate over it, it will return B randomly from the dataset collected samples (including the data-sample and the target/label), where B is the batch-size.


In [31]:
train_params = {'batch_size': 2,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': 2,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

In [32]:
training_loader

<torch.utils.data.dataloader.DataLoader at 0x18a4176a4b0>

In [21]:
model = BertForTokenClassification.from_pretrained('bert-base-uncased', 
                                                   num_labels=len(id2label),
                                                   id2label=id2label,
                                                   label2id=label2id)
model.to(device)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

### Sanity Check on Model

The initial loss of your model should be close to -ln(1/number of classes) = -ln(1/1) = 2.39\
But WHY?
In the beginning, the weights are random, so the probability distribution for all of the classes for a given token will be uniform, meaning that the probability for the correct class will be near 1/11. The loss for a given token will thus be -ln(1/11).


**unsqueeze**: a new dimension of size 1 is inserted at the specified position, Always an unsqueeze operation increases the dimension of the output tensor. Here at d=0, new dimension is added

**squeeze**: a new dimension of size 1 is removed at the specified position, Always an squeeze operation reduces the dimension of the output tensor. Here at d=0, dimension will be removed

torch.to is used to send an object to same device, cuda or CPU


In [33]:
ids = training_set[0]["ids"].unsqueeze(0)
mask = training_set[0]["mask"].unsqueeze(0)
targets = training_set[0]["targets"].unsqueeze(0)
ids = ids.to(device)
mask = mask.to(device)
targets = targets.to(device)
outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
initial_loss = outputs[0]
initial_loss

tensor(1.9531, device='cuda:0', grad_fn=<NllLossBackward0>)

### Set Optimizer

Adjust the parameters by the gradients collected in the backward pass

In [34]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

### How to do training, line by line explanation

1. **Putting model in training mode**: This allows model to change inner layers like Droput of Batchnorm to update their gradients and weights
2. **Looping through training_loader** which is DataLoader object by pytorch, it is an iterable which iterates in batches
3. **Objects in Same Device** : send all objects to GPU or CPU
4. **Get Predictions** on first batch, store in outputs
5. **get loss value** and add it outside loop, also add number of batches in nb_tr_steps later
6. **Print Loss** for every 100 batches calculate loss
7. **Calculate Training Accuracy**

In [36]:
# Defining the training function on the 80% of the dataset for tuning the bert model
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['ids'].to(device, dtype = torch.long)
        mask = batch['mask'].to(device, dtype = torch.long)
        targets = batch['targets'].to(device, dtype = torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
        loss, tr_logits = outputs.loss, outputs.logits
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += targets.size(0)
        
        if idx % 100==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 100 training steps: {loss_step}")
           
        # compute training accuracy
        flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
        active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
        targets = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_preds.extend(predictions)
        tr_labels.extend(targets)
        
        tmp_tr_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")

In [37]:
for epoch in range(EPOCHS):
    print(f"Training epoch: {epoch + 1}")
    train(epoch)

Training epoch: 1
Training loss per 100 training steps: 1.9192566871643066
Training loss per 100 training steps: 0.3785911900830446
Training loss per 100 training steps: 0.26043745762302506
Training loss per 100 training steps: 0.20875351297734088
Training loss per 100 training steps: 0.18239561040442148
Training loss per 100 training steps: 0.16085621369482128
Training loss per 100 training steps: 0.14644514254221502
Training loss per 100 training steps: 0.13562645023547845
Training loss per 100 training steps: 0.12526230064375365
Training loss per 100 training steps: 0.11714277927816477
Training loss per 100 training steps: 0.11137009966843106
Training loss per 100 training steps: 0.10535901875692343
Training loss per 100 training steps: 0.10044827782031347
Training loss per 100 training steps: 0.09668423340369164
Training loss per 100 training steps: 0.09316675889936955
Training loss per 100 training steps: 0.09001713857508582
Training loss per 100 training steps: 0.0867760267793936

In [39]:
def valid(model, testing_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):
            
            ids = batch['ids'].to(device, dtype = torch.long)
            mask = batch['mask'].to(device, dtype = torch.long)
            targets = batch['targets'].to(device, dtype = torch.long)
            
            outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
            loss, eval_logits = outputs.loss, outputs.logits
            
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += targets.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # compute evaluation accuracy
            flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
            active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
            targets = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(targets)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy
    
    #print(eval_labels)
    #print(eval_preds)

    labels = [id2label[id.item()] for id in eval_labels]
    predictions = [id2label[id.item()] for id in eval_preds]

    #print(labels)
    #print(predictions)
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions

In [40]:
labels, predictions = valid(model, testing_loader)


Validation loss per 100 evaluation steps: 0.0006900187581777573
Validation loss per 100 evaluation steps: 0.026479551454301174
Validation loss per 100 evaluation steps: 0.025055006116784156
Validation loss per 100 evaluation steps: 0.025769941882309435
Validation loss per 100 evaluation steps: 0.02606673402014781
Validation loss per 100 evaluation steps: 0.025266974010295536
Validation loss per 100 evaluation steps: 0.02546787863304444
Validation loss per 100 evaluation steps: 0.025639767960001985
Validation loss per 100 evaluation steps: 0.025875212580350228
Validation loss per 100 evaluation steps: 0.025828052442327692
Validation loss per 100 evaluation steps: 0.025343479142897272
Validation loss per 100 evaluation steps: 0.02491398005335309
Validation loss per 100 evaluation steps: 0.02464526916829511
Validation loss per 100 evaluation steps: 0.02482229084847723
Validation loss per 100 evaluation steps: 0.024937600665665523
Validation loss per 100 evaluation steps: 0.025000778285623

In [41]:
from seqeval.metrics import classification_report

print(classification_report([labels], [predictions]))

              precision    recall  f1-score   support

         geo       0.81      0.90      0.85     11331
         gpe       0.95      0.93      0.94      3380
         org       0.73      0.59      0.66      6642
         per       0.77      0.80      0.79      5366
         tim       0.88      0.81      0.84      4439

   micro avg       0.81      0.81      0.81     31158
   macro avg       0.83      0.81      0.81     31158
weighted avg       0.81      0.81      0.81     31158



### Comparing with Existing and Fine-Tuned Model

In [42]:
Untrained_model = BertForTokenClassification.from_pretrained('bert-base-uncased', 
                                                   num_labels=len(id2label),
                                                   id2label=id2label,
                                                   label2id=label2id)
Untrained_model.to(device)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [43]:
from transformers import pipeline

pipe = pipeline(task="token-classification", model=Untrained_model.to(device), tokenizer=tokenizer, aggregation_strategy="simple")
pipe("My name is Niels and New York is a city")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity_group': 'geo',
  'score': 0.13686256,
  'word': 'my',
  'start': None,
  'end': None},
 {'entity_group': 'tim',
  'score': 0.12575148,
  'word': 'name',
  'start': None,
  'end': None},
 {'entity_group': 'geo',
  'score': 0.13910821,
  'word': 'is',
  'start': None,
  'end': None},
 {'entity_group': 'tim',
  'score': 0.17786944,
  'word': 'niels and',
  'start': None,
  'end': None},
 {'entity_group': 'geo',
  'score': 0.13867724,
  'word': 'new',
  'start': None,
  'end': None},
 {'entity_group': 'tim',
  'score': 0.14464562,
  'word': 'york',
  'start': None,
  'end': None},
 {'entity_group': 'tim',
  'score': 0.21768254,
  'word': 'is',
  'start': None,
  'end': None},
 {'entity_group': 'tim',
  'score': 0.12822677,
  'word': 'a',
  'start': None,
  'end': None},
 {'entity_group': 'gpe',
  'score': 0.15759417,
  'word': 'city',
  'start': None,
  'end': None}]

In [44]:
pipe = pipeline(task="token-classification", model=model.to(device), tokenizer=tokenizer, aggregation_strategy="simple")
pipe("My name is Niels and New York is a city")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity_group': 'per',
  'score': 0.673212,
  'word': 'ni',
  'start': None,
  'end': None},
 {'entity_group': 'per',
  'score': 0.70676476,
  'word': '##els',
  'start': None,
  'end': None},
 {'entity_group': 'geo',
  'score': 0.9735999,
  'word': 'new york',
  'start': None,
  'end': None}]