Submission for<br>
Exercise task 5<br>
of UTU course TKO_8964-3006<br>
Textual Data Analysis<br>
by Botond Ortutay<br>

---

**Instructions:**

NER inference using a sequence labeling model

In this exercise, your task is to extract named entities from the Finnish/English news data collection using fine-tuned sequence labeling model, investigate its predictions, and calculate simple NE statistics.

The Finnish/English news data collection is available here: http://dl.turkunlp.org/TKO_8964_2023/news-*.jsonl.

If you do the exercise using Finnish data, the suggested fine-tuned model is https://huggingface.co/Kansallisarkisto/finbert-ner. For English, there are many options, but e.g. https://huggingface.co/dslim/bert-base-NER is a reasonable choice.

The specific tasks are:

1) Read the model page to figure out which datasets were used to train the model, and which entities the model includes.

2) Run inference on the news data, and verify whether the model produces invalid label sequences (hint: it does if you run on some amount of data). Here you do not need to take into account subwords to tokens -mapping, but you can directly check the label sequence of subwords (raw predictions). Print statistics for the most common invalid transitions. Hint: If you run the inference using pipeline, it may hide some of the predictions from you. Set the pipeline parameters so that you get access to raw predictions.

3) Read about the ´aggregation_strategy´ parameter for token classification pipelines (sometimes source code is the best place to get information...). Based on your reading, select a suitable parameter (or in case you run the inference without using pipelines, write a simple function to implement some simple aggregation strategy), run the inference, and collect predicted named entities. What is the most common entity type in your data and what are the most common entities?

It's totally fine to downsample the data, e.g. 50 documents is more than enough and can be easily done on CPU. With GPU runtime, one can run substantial amount of data.

---

**Library & environment tomfoolery:**

<mark>**NOTE:**</mark> We assume that whoever runs this has `jsonl` installed in his/her python environment which is kind of a rare library.<br> (Installation: `pip3 install py-jsonl`)

In [1]:
# Libraries
import copy                         # For deep copying
import jsonl                        # For loading & handling jsonl files
import random                       # For random prints
import torch                        # For GPU compatibility
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# checking GPU availability
print(torch.cuda.is_available())    # should output True

2025-02-04 23:30:30.987436: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1738704631.004981  129249 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1738704631.010582  129249 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-04 23:30:31.027429: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


True


**Downloading, exploring & importing data:**

Downloading (using command line tools):

In [2]:
#NOTE: bash code here
!echo Downloading data from TurkuNLP
!echo
!wget http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl
!echo
!echo Printing all files in current directory to check data has downloaded
!echo
!ls

Downloading data from TurkuNLP



7[1A[1G[27G[Files: 0  Bytes: 0  [0 B/s] Re]87[2A[1G[27G[http://dl.turkunlp.org/TKO_896]87[1S[3A[1G[0JSaving 'news-en-2021.jsonl'
Printing all files in current directory to check data has downloaded

empty.ipynb	      exercise_task_3.ipynb  news-en-2021.jsonl
exercise_task_2.md    exercise_task_3.pdf    short_en_data.txt.utf-8
exercise_task_2.pdf   exercise_task_5.ipynb
exercise_task_3.html  fin_data.txt.utf-8


Exploring dataset to get a better idea of what we're dealing with (using command line tools):

In [3]:
#NOTE: bash code here
# Printing a random sample from the dataset to look what kind of data is there
!echo A random json object from \"news-en-2021.jsonl\":
!shuf -n 1 news-en-2021.jsonl
!echo
!echo ---
!echo
# Determining amount of json objects in news-en-2021.jsonl
!echo Amount of json objects \(lines\) in \"news-en-2021.jsonl\":
!<news-en-2021.jsonl wc -l 

A random json object from "news-en-2021.jsonl":

---

Amount of json objects (lines) in "news-en-2021.jsonl":
1059


Importing to Python:

In [4]:
newsEnIterator = jsonl.load("news-en-2021.jsonl")
"""
    newsEnIterator now contains:
    a generator object (https://wiki.python.org/moin/Generators)

    basically what you need to be aware of to understand the rest of this code:
    a jsonl file has several json objects in a single file.
    the file news-en-2021.jsonl specifically has 1059 json objects (see wc -l output above)
    basically newsEnIterator allows us to iterate through all the json objects in news-en-2021.jsonl one by one
    this is similar to how iterators work in C++ (https://www.w3schools.com/cpp/cpp_iterators.asp)
    (I'm comparing this to that because I've worked a bunch with C++ iterators in the past)
    So basically: when calling next(newsEnIterator) the pointer gets moved forward
    this is doable 1059 times until we run out of data
    when we run out of data we get the "StopIteration" error
    to print the next value simply do print(next(newsEnIterator)) 
    to demonstrate how this works I'll print the very first json object with Python and with Bash
    afterwards I'll have to reload the jsonfile, because (to my knowledge) there is no way to move the iterator backwards, (no backwards poiner)
"""
print(next(newsEnIterator))                          # printing first json object in Python
newsEnIterator = jsonl.load("news-en-2021.jsonl")    # reloading jsonfile

{'summary': 'The decisions follow a meeting of government ministers at the House of the Estates on Thursday afternoon.', 'tags': ['Kotimaan uutiset'], 'text': 'Finland\'s government is pushing ahead with plans to introduce a Covid pass, following a meeting of ministers at the House of the Estates in Helsinki on Thursday afternoon. \n "There are still many open questions that need to be answered. At this point, it is impossible to promise that the pass will come or when it will come," Prime Minister  Sanna Marin  (SDP) told the media following the conclusion of the meeting. \n "The government has given the green light to the Covid pass and preparations will continue," Marin added. \n Minister of Economic Affairs  Mika Lintilä  (Cen) told reporters immediately after the meeting that there was broad agreement between the coalition parties over the need for the certificate. \n "It [the pass] is an important tool so that we will not need restrictions any more," Lintilä said. \n The governme

In [5]:
#NOTE: bash code here
# Using bash to print the first json object
!head -1 news-en-2021.jsonl

{"summary": "The decisions follow a meeting of government ministers at the House of the Estates on Thursday afternoon.", "tags": ["Kotimaan uutiset"], "text": "Finland's government is pushing ahead with plans to introduce a Covid pass, following a meeting of ministers at the House of the Estates in Helsinki on Thursday afternoon. \n \"There are still many open questions that need to be answered. At this point, it is impossible to promise that the pass will come or when it will come,\" Prime Minister  Sanna Marin  (SDP) told the media following the conclusion of the meeting. \n \"The government has given the green light to the Covid pass and preparations will continue,\" Marin added. \n Minister of Economic Affairs  Mika Lintilä  (Cen) told reporters immediately after the meeting that there was broad agreement between the coalition parties over the need for the certificate. \n \"It [the pass] is an important tool so that we will not need restrictions any more,\" Lintilä said. \n The gov

As one can see both methods accessed the same json object. Thus I have demonstrated the jsonl library and its very oldschool way of iterating through a jsonl file. I'll use this to loop through the jsonl while performing classifications.

**Functions:**

In [6]:
"""
Runs sentence through pipeline and prints relevant data
---
In:
sentence    str                      sentence run through pipeline
pipe        transformers.pipeline    pipeline to run sentence through    tested with "dslim/bert-base-NER" (model & tokenizer); not guaranteed to work elsewhere
"""
def pipePrint(sentence, pipe):
    print(sentence + ":")
    for entity in pipe(sentence):
        # entity format with nerPipe: {'entity_group': str, 'score': float, 'index': int, 'word': str, 'start': int, 'end': int}
        # Only printing whatever I'm interested in
        print("Found: " + entity["word"] + ", category: " + entity["entity_group"] + ", score:" "{:.4f}".format(entity["score"]))
    print()

---

**Task 1.** Read the model page to figure out which datasets were used to train the model, and which entities the model includes.

**Written answer:** I am going to use the [base-bert-NER](https://huggingface.co/dslim/bert-base-NER) model and work on the English-language news items, as suggested by the exercise instructions.

According to [their huggingface page](https://huggingface.co/dslim/bert-base-NER) the model has been trained on the [CoNLL-2003 Named Entity Recognition](https://aclanthology.org/W03-0419.pdf) dataset and is able to recognize four types of entities: locations (LOC), organizations (ORG), persons (PER), and miscellaneous (MISC).

---

**Task 2.** Run inference on the news data, and verify whether the model produces invalid label sequences (hint: it does if you run on some amount of data). Here you do not need to take into account subwords to tokens -mapping, but you can directly check the label sequence of subwords (raw predictions). Print statistics for the most common invalid transitions. Hint: If you run the inference using pipeline, it may hide some of the predictions from you. Set the pipeline parameters so that you get access to raw predictions.

In [8]:
# At this point the data has already been downloaded at the "Downloading, importing & exploring data" -stage
# Data accessible via newsEnIterator

# Pipeline setup
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nerPipe = pipeline("ner", model, tokenizer=tokenizer, device=0, aggregation_strategy="simple")
"""
  device=0 -> use gpu

  aggregation_strategy="simple"    This is relevant for task 3!
  
  output format fot nerPipe(str): {'entity_group': str, 'score': float, 'index': int, 'word': str, 'start': int, 'end': int}
"""
print("")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0





In [9]:
# Testing with a few examples
"""
Sentence one = "Maria studies Italian culture at the Columbia University in New York City"
This includes:
LOC: New York City
ORG: Columbia University
PER: Maria
MISC: Italian    

(It was very hard to come up with a MISC word, so I looked at the original CoNLL paper (https://aclanthology.org/W03-0419.pdf) and they used this as an example)
"""
pipePrint("Maria studies Italian culture at the Columbia University in New York City", nerPipe)

"""
Sentence two = "One of my classmates, Toby, copied this video and sent it to our class's WhatsApp group, claiming he made it himself."
This includes:
PER: Toby
    
WhatsApp is either ORG or MISC
"""
pipePrint("One of my classmates, Toby, copied this video and sent it to our class's WhatsApp group, claiming he made it himself.", nerPipe)

Maria studies Italian culture at the Columbia University in New York City:
Found: Maria, category: PER, score:0.9958
Found: Italian, category: MISC, score:0.9997
Found: Columbia University, category: ORG, score:0.9979
Found: New York City, category: LOC, score:0.9994

One of my classmates, Toby, copied this video and sent it to our class's WhatsApp group, claiming he made it himself.:
Found: Toby, category: PER, score:0.9990
Found: WhatsApp, category: ORG, score:0.9694



At this point my pipeline appears to work and correctly predict everything. Let's test it on the news data.

In [10]:
newsItem = next(newsEnIterator)["text"]
for sentence in newsItem.split("."):
    pipePrint(sentence,nerPipe)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Finland's government is pushing ahead with plans to introduce a Covid pass, following a meeting of ministers at the House of the Estates in Helsinki on Thursday afternoon:
Found: Finland, category: LOC, score:0.9998
Found: Covid, category: MISC, score:0.9104
Found: House of the Estates, category: ORG, score:0.9978
Found: Helsinki, category: LOC, score:0.9994

 
 "There are still many open questions that need to be answered:

 At this point, it is impossible to promise that the pass will come or when it will come," Prime Minister  Sanna Marin  (SDP) told the media following the conclusion of the meeting:
Found: Sanna Marin, category: PER, score:0.8838
Found: SDP, category: ORG, score:0.9977

 
 "The government has given the green light to the Covid pass and preparations will continue," Marin added:
Found: Covid, category: MISC, score:0.8608
Found: Marin, category: PER, score:0.9995

 
 Minister of Economic Affairs  Mika Lintilä  (Cen) told reporters immediately after the meeting that th

Having run one single news item through the pipeline and having analyzed the output by hand I made the following observations:<br>
<br>
Firstly: There are no visible errors at least not of the kind that'd break the system. The instructions asked me to track "invalid transitions". I'm not sure what this means or how to collect these, but at least based on this single article it probably won't be by `try -except`...<br>
<br>
Secondly: there was one weird mistake. It relates to this piece of output:
```
Minister of Economic Affairs  Mika Lintilä  (Cen) told reporters immediately after the meeting that there was broad agreement between the coalition parties over the need for the certificate:
Found: Economic, category: ORG, score:0.6392
Found: Mika Lintilä, category: PER, score:0.9744
Found: Cen, category: ORG, score:0.9911

```
Here "Economic" is recognized as ORG which it isn't. It is a part of the job title "Minister of Economic Affairs". Whether job titles should be recognized as a named entity by a system such as this is a valid question. I found a (non-scientific) [article](https://medium.com/@ziprecruiter.engineering/named-entity-recognition-ner-of-short-unstructured-job-search-queries-6b265ec0fb) where they do NER, and do recognize job titles as named entities, but that NER-application was specifically built for analyzing job seeking queries, so that kind of thing is more relevant there. But even if we decide that job titles count here, then the correct categorization would be: `Minister of Economic Affairs: MISC`. Furthermore then stuff like "Prime minister" should also be categorized as MISC, which clearly isn't. So I'll call this a False Positive.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
However I'm unsure how to spot these kinds of mistakes when analyzing, say all the news from `news-en-2021.jsonl` automatically. One option could always be to look at the confidence score and save let's say everything below 0.80 seperately for analysis by hand, but I decided to not do it here.

Either way: my pipeline has been shown to work with news items mined from the jsonl no problem. So now: let's run through them all and save the generated results somewhere!

In [11]:
newsEnIterator = jsonl.load("news-en-2021.jsonl")           # Iterator reset
data = []
dataEntry = {"newsItem": "", "entities": []}                # Template for all data


while True:                                                 # Until iterator reaches end of data
    try:
        newsItem = next(newsEnIterator)["text"]             # Iterator forward
        
        currentEntry = copy.deepcopy(dataEntry)             # New dataentry
        
        # Performing NER for current news article
        minedEntities = []
        for sentence in newsItem.split("."):
            for entity in nerPipe(sentence):                # Every time an entity is found, save it to mined
                entity["score"] = float(entity["score"])    # To get rid of: Type Error: Object of type float32 is not JSON serializable; The original datatype of entity["score"] was numpy.float32
                minedEntities.append(entity)
        
        currentEntry["newsItem"] = newsItem                 # Saving current news item to entry
        currentEntry["entities"] = minedEntities            # Saving all entities mined from given news item
        
        data.append(currentEntry)                           # Saving current entry
    except StopIteration:
        break

Just to examine the results:
Let's print a few randomly!

In [13]:
for i in range(2):
    prindex = random.randint(0,1059)
    print(data[prindex]["newsItem"])
    for entity in data[prindex]["entities"]:
        print("Found: " + entity["word"] + ", category: " + entity["entity_group"] + ", score:" "{:.4f}".format(entity["score"]))

The Finnish Wildlife Agency has issued four exemptions to a ban on hunting wolves. 
 The decision will allow for the culling of a total of 18 wolves next year and is aimed at regulating the growth of Finland's wolf population, the agency said in a press release on Wednesday. 
 Figures provided by the Natural Resources Institute Finland (Luke) earlier this year revealed that the wolf population had hit  its highest level  for over 100 years. 
 The agency's exemptions will permit the hunting of eight wolves from the Kuhmo Saunajärvi region of eastern Finland and two from the Liminka-Lumijoki area on the western coast, as well as five wolves from the southwestern region of Kauhajoki-Karvia and three from the Somero, Nummi-Pusula and Tammela areas, also in the southwest. 
 Each region has received a separate exemption from the national ban on the hunting of the endangered species. 
 The exemptions will begin on 1 February 2022 and will be valid for 15 days, with each hunting group limited 

---

**Task 3.** Read about the ´aggregation_strategy´ parameter for token classification pipelines (sometimes source code is the best place to get information...). Based on your reading, select a suitable parameter (or in case you run the inference without using pipelines, write a simple function to implement some simple aggregation strategy), run the inference, and collect predicted named entities. What is the most common entity type in your data and what are the most common entities?

**On** `aggregation_strategy` **:**

Answer written based on the following sources:
 - https://discuss.huggingface.co/t/ner-tag-aggregation-stratergy/14199
 - https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/token_classification.py
... And some little experimentation

So basically: Aggregation strategies decide how different entities spanning multiple tokens get grouped together. There are technically  5 aggeregation strategies: "none", "simple", "first", "average" and "max". Basically the important thing to realize: "none" doesn't do aggregation, meaning that categories will have the location tag (B- or I-) in them rather than grouping every tag that belongs to the same as one. The rest group the tags together so that each named entity only gets one tag. This is much more simple to understand from outputs so here are some from my experiments fo demonstrate:<br>
<br>
`pipePrint` output with pipe using `aggregation_strategy="none"`:
```
One of my classmates, Toby, copied this video and sent it to our class's WhatsApp group, claiming he made it himself.:
Found: Toby, category: B-PER, score:0.9990
Found: What, category: B-ORG, score:0.9490
Found: ##s, category: I-ORG, score:0.9713
Found: ##A, category: I-ORG, score:0.9902
Found: ##pp, category: I-ORG, score:0.9669
```
<br>

`pipePrint` output with pipe using `aggregation_strategy="simple"`:
```
One of my classmates, Toby, copied this video and sent it to our class's WhatsApp group, claiming he made it himself.:
Found: Toby, category: PER, score:0.9990
Found: WhatsApp, category: ORG, score:0.9694
```

Based on my research & experimentation: I've chosen `aggregation_strategy="simple"` and performed NER on `news-en-2021.jsonl` (Task 2.). The results are saved in `data`

Counting entity types:

In [24]:
entityTypeCounter = {"LOC": 0, "ORG":0, "PER": 0, "MISC": 0}
entityCounter = {}

# Counting
for newsItem in data:
    for entity in newsItem["entities"]:
        entityTypeCounter[entity["entity_group"]] += 1
        if entity["word"] in entityCounter.keys():
            entityCounter[entity["word"]] += 1
        else:
            entityCounter[entity["word"]] = 1

# Results
print("All entity categories and their quantities:", end=" ")
print(entityTypeCounter)

All entity categories and their quantities: {'LOC': 9043, 'ORG': 10662, 'PER': 12219, 'MISC': 5761}
