# **Textual Data Analysis - Exercise - 5**


---


## **Name: Ayesha Zafar**
## **Date: 04/02/2025**


---

### NER inference using a sequence labeling model

In this exercise, your task is to extract named entities from the Finnish/English news data collection using fine-tuned sequence labeling model, investigate its predictions, and calculate simple NE statistics.

The Finnish/English news data collection is available here: http://dl.turkunlp.org/TKO_8964_2023/news-*.jsonl.

If you do the exercise using Finnish data, the suggested fine-tuned model is https://huggingface.co/Kansallisarkisto/finbert-ner. For English, there are many options, but e.g. https://huggingface.co/dslim/bert-base-NER is a reasonable choice.

The specific tasks are:

1) Read the model page to figure out which datasets were used to train the model, and which entities the model includes.

2) Run inference on the news data, and verify whether the model produces invalid label sequences (hint: it does if you run on some amount of data). Here you do not need to take into account subwords to tokens -mapping, but you can directly check the label sequence of subwords (raw predictions). Print statistics for the most common invalid transitions. Hint: If you run the inference using pipeline, it may hide some of the predictions from you. Set the pipeline parameters so that you get access to raw predictions.

3) Read about the ´aggregation_strategy´ parameter for token classification pipelines (sometimes source code is the best place to get information...). Based on your reading, select a suitable parameter (or in case you run the inference without using pipelines, write a simple function to implement some simple aggregation strategy), run the inference, and collect predicted named entities. What is the most common entity type in your data and what are the most common entities?

It's totally fine to downsample the data, e.g. 50 documents is more than enough and can be easily done on CPU. With GPU runtime, one can run substantial amount of data.

---



Step 1. Installing required libraries

In [15]:
!pip install transformers datasets jsonlines



Step 2. Importing necessary libraries

In [30]:
import json
import jsonlines
import requests
import pandas as pd
from collections import Counter
from transformers import pipeline

Step 3. Fetching the English news dataset

In [31]:
url = "http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl"
response = requests.get(url)
data = response.text.splitlines()

Step 4. Parsing json lines with check for malformed lines and printing the first document

In [32]:
documents = []
for line in data:
    obj = json.loads(line)
    documents.append(obj["text"])

documents = documents[:50]

print(f"Total documents: {len(documents)}")
print("First document:")
print(documents[0])

Total documents: 50
First document:
Finland's government is pushing ahead with plans to introduce a Covid pass, following a meeting of ministers at the House of the Estates in Helsinki on Thursday afternoon. 
 "There are still many open questions that need to be answered. At this point, it is impossible to promise that the pass will come or when it will come," Prime Minister  Sanna Marin  (SDP) told the media following the conclusion of the meeting. 
 "The government has given the green light to the Covid pass and preparations will continue," Marin added. 
 Minister of Economic Affairs  Mika Lintilä  (Cen) told reporters immediately after the meeting that there was broad agreement between the coalition parties over the need for the certificate. 
 "It [the pass] is an important tool so that we will not need restrictions any more," Lintilä said. 
 The government also decided at Thursday afternoon's meeting to offer coronavirus vaccines to all 12- to 15-year-olds, starting as early as nex

Step 5. Loading NER pipeline with aggregation disabled to get raw predictions

In [33]:
ner_pipeline = pipeline(
    "token-classification",
    model="dslim/bert-base-NER",
    aggregation_strategy="none",
)

raw_predictions = ner_pipeline(documents)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Step 6. Checking for invalid label sequences and printing most common invalid transitions

In [34]:
invalid_transitions = []

for doc_preds in raw_predictions:
    labels = [pred["entity"] for pred in doc_preds]
    for i in range(1, len(labels)):
        prev_label, curr_label = labels[i - 1], labels[i]

        if prev_label == "O" and curr_label.startswith("I-"):
            invalid_transitions.append((prev_label, curr_label))
        elif prev_label.startswith("B-") and curr_label.startswith("I-"):
            prev_entity = prev_label.split("-")[1]
            curr_entity = curr_label.split("-")[1]
            if prev_entity != curr_entity:
                invalid_transitions.append((prev_label, curr_label))

transition_counter = Counter(invalid_transitions)
print("Most common invalid transitions:")
for transition, count in transition_counter.most_common():
    print(f"{transition}: {count}")

Most common invalid transitions:
('B-MISC', 'I-ORG'): 6
('B-PER', 'I-ORG'): 4
('B-ORG', 'I-LOC'): 2
('B-LOC', 'I-ORG'): 1
('B-MISC', 'I-LOC'): 1
('B-ORG', 'I-PER'): 1


Step 7. Reloading pipeline with aggregation strategy as 'simple'

In [35]:
ner_pipeline = pipeline(
    "token-classification",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple",
)

aggregated_predictions = ner_pipeline(documents)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Step 8. Collecting predicted named entities and printing most frequent types and entities

In [36]:
entities = []
for doc_preds in aggregated_predictions:
    for pred in doc_preds:
        entities.append((pred["entity_group"], pred["word"]))

entity_counter = Counter([entity[0] for entity in entities])
print("Most common entity types:")
for entity_type, count in entity_counter.most_common():
    print(f"{entity_type}: {count}")

entity_word_counter = Counter([entity[1] for entity in entities])
print("\nMost common entities:")
for entity_word, count in entity_word_counter.most_common(10):
    print(f"{entity_word}: {count}")

Most common entity types:
ORG: 358
PER: 324
LOC: 291
MISC: 144

Most common entities:
Finland: 93
Finnish: 34
Helsinki: 29
Covid: 19
Yle: 14
Y: 13
Afghanistan: 12
Afghan: 11
A: 10
V: 10
