# NER Classificiation

### Method used: fine-tuned transformers model - BERT

Group 37 - Text Mining Course 2024 | VU Universiteit Amsterdam


-------------

First we will make all the imports required

In [1]:
import pandas as pd
from sklearn.metrics import classification_report
from simpletransformers.ner import NERModel

  from .autonotebook import tqdm as notebook_tqdm


We import the fine-tuned model using simpletransformers

In [4]:
model = NERModel(
        model_type="bert",
        model_name="dslim/bert-base-NER",
        use_cuda = False)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Now we will upload the test set

In [5]:
test = pd.read_csv('NER-test.tsv', sep='\t')

Let's see how the test set looks like

In [6]:
test

Unnamed: 0,sentence id,token id,token,BIO NER tag
0,0,0,I,O
1,0,1,would,O
2,0,2,n't,O
3,0,3,be,O
4,0,4,caught,O
...,...,...,...,...
188,9,10,HOOOKED,O
189,9,11,from,O
190,9,12,the,O
191,9,13,beginning,O


We remove the useless columns

In [7]:
test.drop(columns=['sentence id', 'token id'], inplace=True)


We use the model to predict the NER tags

In [8]:
predictions, raw_outputs = model.predict(test['token'])

100%|██████████| 1/1 [00:02<00:00,  2.66s/it]
Running Prediction: 100%|██████████| 2/2 [00:07<00:00,  3.54s/it]


Now we arrange the predictions into a convenient dataframe with the same columns as the test set with the correct tags so we can run a classification report to analyze the performance of the model.

In [9]:
tokens = []
tags = []

# Extract tokens and tags
for token_list in predictions:
    for token_dict in token_list:
        for token, tag in token_dict.items():
            tokens.append(token)
            tags.append(tag)

# Create DataFrame
df_predictions = pd.DataFrame({
    'token': tokens,
    'BIO NER tag': tags
})

df_predictions

Unnamed: 0,token,BIO NER tag
0,I,O
1,would,O
2,n't,O
3,be,O
4,caught,O
...,...,...
188,HOOOKED,O
189,from,O
190,the,O
191,beginning,O


Now we generate the report

In [10]:
report = classification_report(test['BIO NER tag'], df_predictions['BIO NER tag'])
print(report)

               precision    recall  f1-score   support

       B-DATE       0.00      0.00      0.00         1
        B-LOC       0.00      0.00      0.00         0
       B-MISC       0.00      0.00      0.00         0
        B-ORG       0.50      0.67      0.57         3
        B-PER       0.33      1.00      0.50         3
     B-PERSON       0.00      0.00      0.00         3
B-WORK_OF_ART       0.00      0.00      0.00         4
       I-DATE       0.00      0.00      0.00         1
        I-ORG       0.00      0.00      0.00         6
        I-PER       0.00      0.00      0.00         1
     I-PERSON       0.00      0.00      0.00         2
I-WORK_OF_ART       0.00      0.00      0.00         9
            O       0.90      0.99      0.94       160

     accuracy                           0.85       193
    macro avg       0.13      0.20      0.16       193
 weighted avg       0.76      0.85      0.80       193



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
