It is most convenient to run this notebook in Google Colab

# Introduction

Let's assume we have a business task:

There are ski resorts and descriptions of these ski resorts. The descriptions mention nearby mountains, and we want to extract the names of these mountains. Then, we aim to map them in some way so that users can understand how close certain mountains are, and consequently, whether they can plan any hikes. As a result, the solution to our problem can be used to create a multitude of interesting features for our project.

Therefore, such a task can be divided into two subclasses: the first class when we know the specific region we want to work in, and the second subclass when we generally do not know any region at all. That is, we work globally.
Then, accordingly, we can use different approaches for working in these classes. Let's now talk about the class that relates to something where we know the approximate location, for example, we work with the Alps or the Carpathian Mountains.

Let's discuss how we can solve such tasks. If our region is quite small, we can try to find some state registers about what mountains are in this region. Such lists always exist because the state has its geographical registers, and we can try to find similar data in them. Then we have a complete list of mountains that we can find in this region, and we can just go through this entire list and see if these names are in our text (P.S. We implement this option).
There is also a more basic option where we can simply search for names like mountains, peak, hills, etc. Then take a couple of words that are before or after these words and maybe this will be enough to solve our problem.

As my teacher taught me: you don't need to try to use machine learning in tasks where we can get by with ordinary conditions that will be very easily business-interpretable, and we will immediately understand where our mistakes come from and what we lack in this model.

Unfortunately, in the case of global search, this option does not work, because at least it will be very difficult to find registries of all countries.

Next, after that, if we talk about an easier option - it's to try to find some pre-trained model that already performs our tasks, for example, try searching for models on Hugging Face.

It could be an NER model (P.S. We implement this option), a QA model (P.S. We implement this option), or an LLM that can already perform our task at a sufficient level.

The next step, if nothing suits us, is to take a pre-trained model from Hugging Face and then fine-tune it for our task (P.S. We implement this option).

Well, then, depending on whether the quality of these models will be sufficient for us, or perhaps we will not have enough speed of work, we can already train our models from scratch.

# Dataset creating

## Parse data from database

As mentioned earlier, I decided to try starting data preparation for this task by finding some database or a list of mountains in the region.

In solving this problem, I decided to limit myself to the Ukrainian mountains.

I expected that finding such a resource would be much easier, but I still found something: https://mountain.land.kiev.ua/list.html
Let's take data from this site (You can find a neater implementation of this code in this file: Data_collection/parse_ukr_peaks.py).

In [3]:
import requests
from bs4 import BeautifulSoup

import pandas as pd


"""
Parse the site with the names of most Ukrainian peaks and mountains
Site link: https://mountain.land.kiev.ua/list.html
"""
# URL of the webpage to scrape
url = 'https://mountain.land.kiev.ua/list.html'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, "lxml")

    # Find table rows in page content
    table = soup.find("table", class_="w100")
    table_rows = table.find("tbody").find_all("tr")

    # Extract the mountain name text from table
    mountain_names = []
    for row in table_rows:
        mount_name = row.find_all("td")[1].text
        mountain_names.append(mount_name)

    # Create a DataFrame from the collected mountain names
    result_dataset = pd.DataFrame(mountain_names, columns=['mountain_name'])
    result_dataset = result_dataset[result_dataset.mountain_name.apply(lambda x: "висота" not in x.lower())]
    result_dataset.to_csv("mountain_names_ukr.csv", index=False)
else:
    print(f"Failed to retrieve content: Status code {response.status_code}")

In [4]:
result_dataset

Unnamed: 0,mountain_name
0,Говерла
1,Бребенескул
2,Чорна Гора (Піп -Іван)
3,Петрос
4,Гутин Томнатек
...,...
1691,Яйла-Баш
1692,Каматра
1693,Сапун-гора
1694,Аганин-Бурун


I translated `mountain_names_ukr.csv` using the web version of the translator:

In [5]:
data_eng = pd.read_csv("mountain_names_eng.csv")

In [6]:
data_eng

Unnamed: 0,mountain_name
0,Hoverla
1,Brebeneskul
2,Chorna Hora (Pip Ivan)
3,Petros
4,Gutyn Tomnatyk
...,...
1343,Yayla-Bash
1344,Kamatra
1345,Sapun-Hora
1346,Aganyn-Burun


## Generate text for our names using openai api

We will create a program that will make requests to openai api, using a specific set of mountain names, and as a response from the chat, receive a list of texts with the names of these mountains.

In the code below I have changed the number of texts generated becouse this Notebook is just a demo.

(You can find implementation of this code in this file: Data_collection/create_dataset_by_gpt.py).

In [None]:
# ! pip install openai

In [13]:
from openai import OpenAI

import json
import pandas as pd

from typing import List


def create_prompt(names: List[str]) -> str:
    """Add our data to the instructions for generating a dataset"""

    prompt = """create a text of 2-3 sentences with meaningful context in English which should contain the mountain names.
    create such text for each mountain name that is in this list:
    """ + f"{names}" + """
    Return the results as a Json file: {
    "mountain name 1": "generated text 1",
    "mountain name 2": "generated text 2",
    "mountain name 3": "generated text 3",
    ....
    }
    don't use the word mountains in every sentence you make!"""

    return prompt


def main():
    """
        Creates the dataset in mountain name - generated text format
        containing the names of the mountains and the text that contains these names
        from list of the mountains names
    """
    # Get users api_key
    api_key = input("Enter OpenAI api_key: ")
    # Initialize OpenAI client
    client = OpenAI(
        # defaults to os.environ.get("OPENAI_API_KEY")
        api_key=api_key,
    )

    # Get set of mountain name
    name_set = pd.read_csv("mountain_names_eng.csv")
    name_set = name_set.mountain_name.values

    # Initialize lists for results
    name_list = []
    text_list = []

    for bucket_id in range(0, 1301, 500):
        # Extract a subset of 20 mountain names based on the current bucket_id every 50 name
        name_subset = name_set[bucket_id:bucket_id + 3]
        # Create prompt for GPT
        prompt = create_prompt(name_subset)
        # Request a chat completion from the OpenAI API
        chat_completion = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            model="gpt-3.5-turbo",
            temperature=0.6
        )
        # Extract the content of the response
        chat_result = chat_completion.choices[0].message.content
        # Parse the response from JSON
        data_json = json.loads(chat_result)
        for name in data_json:
            name_list.append(name)
            text_list.append(data_json[name])

    # Write the collected data
    data_df = pd.DataFrame(
        {
            "peak_name": name_list,
            "text": text_list,
        }
    )
    data_df.to_csv("dataset_NER_mountain.csv", index=False)


if __name__ == "__main__":
    main()

Enter OpenAI api_key: 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


In [14]:
data_after_gpt = pd.read_csv("dataset_NER_mountain.csv")

In [15]:
data_after_gpt

Unnamed: 0,peak_name,text
0,Hoverla,Hoverla is the highest peak in Ukraine and off...
1,Brebeneskul,Brebeneskul is a stunning mountain located in ...
2,Chorna Hora (Pip Ivan),"Chorna Hora, also known as Pip Ivan, is a maje..."
3,Hostryluv,Hostryluv is a majestic peak that offers breat...
4,Katran-Yakkan-Tepe,"The summit of Katran-Yakkan-Tepe stands tall, ..."
5,Halechky Velyki,Halechky Velyki is a hidden gem nestled amidst...
6,Biyuk-Guba-Tepe,"Biyuk-Guba-Tepe stands tall and majestic, offe..."
7,Vedmezha,"Vedmezha is a hidden gem, nestled amidst lush ..."
8,Litovyshche,"Litovyshche is a picturesque peak, adorned wit..."


I don't quite like the generated dataset, as the generated texts were quite monotonous, but since this is just a demonstration, I decided to leave it as is.

# Implementation of simple solutions

## Search of the mountains names match according to the given list

Since we have a list of all the mountains in our region, we can simply go through it and see if the names of these mountains are in our texts.

Let's implement this.

(You can find implementation of this code in this file: Solving_pipeline/pipeline1_search_name_from_given_set.py).

In [36]:
import pandas as pd

from typing import List


def find_mount_name_p1(sentence: str, mount_name_list: List[str]) -> List[str]:
    """
    This function searches for mountain names within a given sentence.

    Parameters:
    sentence (str): The sentence in which to search for mountain names.
    mount_name_list (List[str]): A list of mountain names to search for.

    Returns:
    List[str]: A list of mountain names found in the sentence.
    """
    result_mount_name = []
    # Iterate through each mountain name in the list
    for mount_name in mount_name_list:
        # If the mountain name is found in the sentence, add it to the result list
        if mount_name in sentence:
            result_mount_name.append(mount_name)
    return result_mount_name


def extract_entity_name_p1(sentence: str) -> List[str]:
    """
    Extracts mountain names from a sentence using a predefined list of names.

    Parameters:
    sentence (str): The sentence from which to extract mountain names.

    Returns:
    List[str]: A list of extracted mountain names.
    """
    mount_name_list = pd.read_csv("mountain_names_eng.csv")
    mount_name_list = mount_name_list.mountain_name.values
    result = find_mount_name_p1(sentence, mount_name_list)
    return result


def main_p1():
    # Prompt the user to input a sentence
    sentence = input("Enter sentence with name of the Ukrainian Carpathians peak: ")
    # Extract mountain names from the sentence
    result = extract_entity_name_p1(sentence)
    # Print results
    print(f"Found {len(result)} mountain peaks:")
    for mount_name in result:
        print(mount_name)


if __name__ == "__main__":
    main_p1()

Enter sentence with name of the Ukrainian Carpathians peak: Hoverla is the highest peak in Ukraine, offering breathtaking views of the Carpathian Mountains.
Found 1 mountain peaks:
Hoverla


In [17]:
main_p1()

Enter sentence with name of the Ukrainian Carpathians peak: Chorna Hora, also known as Pip Ivan, is a majestic peak in the Ukrainian Carpathians, famous for its abandoned observatory.
Found 5 mountain peaks:
Pip Ivan
Chorna
Hor
Hora
Chorna Hora


As we can see, the algorithm works, but to improve its results, it's necessary to clean up the general list of mountains, as well as to add some rules in order to filter out unnecessary words.

## Pretrained NER model

Implementation of the solution that first came to mind when I read the terms of the task)

Just take the models from huggingface and use it.

(You can find implementation of this code in this file: Solving_pipeline/pipeline2_NER_from_huggingface.py)

In [18]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

from typing import List


def convert_answer_to_words_p2(sentence: str, ner_results: List[dict]):
    """
       Converts NER results to a list of locations word.

       Parameters:
       sentence (str): The original sentence.
       ner_results (List[dict]): The results from NER model, containing identified entities.

       Returns:
       List[str]: A list of location names extracted from the sentence.
   """
    result = []
    current_entity = ""
    for entity in ner_results:
        # Check for the beginning of a location word
        if entity["entity"] == "B-LOC":
            # If a word already exists, add it to the result, and stand new word
            if current_entity:
                result.append(current_entity)
            current_entity = sentence[entity["start"]:entity["end"]]
        # If this is a continuation of the current word then add it to current word
        elif entity["entity"] == "I-LOC" and current_entity:
            if "#" not in entity["word"]:
                current_entity += " "
            current_entity += sentence[entity["start"]:entity["end"]]
    # Add the last found word to the result list
    if current_entity:
        result.append(current_entity)
    return result


def extract_entity_name_p2(sentence: str) -> List[str]:
    """
       Extracts location names from a sentence using a pretrained NER model.

       Parameters:
       sentence (str): The sentence from which to extract location names.

       Returns:
       List[str]: A list of extracted location names.
   """
    # Load a pretrained tokenizer and model for token classification
    tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
    model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
    # Create a NER pipeline using the loaded model and tokenizer
    ner_cls = pipeline("ner", model=model, tokenizer=tokenizer)
    # Get NER results from the sentence
    ner_results = ner_cls(sentence)
    # Convert NER results to a list of words
    result = convert_answer_to_words_p2(sentence, ner_results)
    return result


def main_p2():
    # Prompt the user to input a sentence
    sentence = input("Enter sentence with name of the Ukrainian Carpathians peak: ")
    # Extract mountain names from the sentence
    result = extract_entity_name_p2(sentence)
    # Print results
    print(f"Found {len(result)} locations:")
    for mount_name in result:
        print(mount_name)


if __name__ == "__main__":
    main_p2()


Enter sentence with name of the Ukrainian Carpathians peak: Hoverla is the highest peak in Ukraine, offering breathtaking views of the Carpathian Mountains.


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Found 4 locations:
Hoverla
Ukraine
Car
pathian Mountains


As we see, this model also works, but there is a small minus. This model is designed for recognizing any location, and not specifically for mountain locations. Therefore, they find anything even if there are any other locations in our data.

## Pretrained QA model

This option might be interesting because, without finetuning, it can specifically find the names of cities. However, there's a small downside: it can only find one name in the entire text.

(You can find implementation of this code in this file: Solving_pipeline/pipeline3_QA_from_huggingface.py).

P.S. Of course, we can try to devise various ways of breaking up texts to search for multiple options within the text. However, this may lead to a deterioration in the model's quality.

In [19]:
from transformers import pipeline


def extract_entity_name_p3(sentence: str) -> str:
    """
        Extracts peak names from a sentence using a question-answering model.

        Parameters:
        sentence (str): The sentence from which to extract peak names.

        Returns:
        str: The extracted peak name(s) from the sentence.
    """
    model_name = "deepset/roberta-base-squad2"
    # Initialize a pipeline for question-answering
    nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
    # Formulate the question-answering input with the question and the provided sentence as context
    qa_input = {
        'question': 'What peak names were used in this text?',
        'context': sentence,
    }
    # Get the answer from the model
    res = nlp(qa_input)
    # Return model answer
    return res["answer"]


def main_p3():
    # Prompt the user to input a sentence
    sentence = input("Enter sentence with name of the Ukrainian Carpathians peak: ")
    # Extract mountain names from the sentence
    result = extract_entity_name_p3(sentence)
    # Print results
    print(result)


if __name__ == "__main__":
    main_p3()


Enter sentence with name of the Ukrainian Carpathians peak: Hoverla is the highest peak in Ukraine, offering breathtaking views of the Carpathian Moun


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Hoverla


In [20]:
main_p3()

Enter sentence with name of the Ukrainian Carpathians peak: Chorna Hora, also known as Pip Ivan, is a majestic peak in the Ukrainian Carpathians, famous for its abandoned observatory.
Chorna Hora


# Fine-tune model from huggingface

I have a proper implementation of this pipeline (Fine_tune_NER/fine_tune_huggingface_NER_model.py), but for demonstration purposes, I deliberately split this code from one convenient good file into a readable format for Jupyter notebooks.

## Data preparation

Here, I was trying to transform my dataset into a standard dataset for NER tasks as quickly as possible.

Add text_list and target columns:

text_list - A list of words and punctuation marks.

target - A list indicating whether each word in the text set is part of the target entity.

In [None]:
! pip install datasets
! pip install -q evaluate seqeval
! pip install accelerate -U
! pip install transformers -U

In [2]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split

from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
import evaluate
from huggingface_hub import notebook_login

from typing import List, Tuple

In [3]:
data = pd.read_csv("dataset_NER_mountain.csv")

In [4]:
data

Unnamed: 0,peak_name,text
0,Hoverla,"Hoverla is the highest peak in Ukraine, offeri..."
1,Brebeneskul,Brebeneskul is a beautiful mountain located in...
2,Chorna Hora (Pip Ivan),"Chorna Hora, also known as Pip Ivan, is a maje..."
3,Petros,"Petros is a rugged mountain with steep cliffs,..."
4,Gutyn Tomnatyk,Gutyn Tomnatyk is a hidden gem among the Carpa...
...,...,...
535,Shuparka,Shuparka is a favorite destination for nature ...
536,Bychova,Bychova is a challenging peak that rewards hik...
537,Mahera,Mahera is a sacred mountain revered by locals ...
538,Koziy Hory,Koziy Hory is a paradise for outdoor enthusias...


In [5]:
def separate_words_punctuation(text: str) -> List[str]:
    """
        Splits a given text into words and punctuation marks.

        Parameters:
        text (str): The text to be split.

        Returns:
        List[str]: A list of words and punctuation marks.
    """
    # Using regular expression to separate words and punctuation
    return re.findall(r"[\w']+|[.,!?;]", text)


# Create a column containing text divided into individual words and punctuation murks
data["text_list"] = data.text.apply(separate_words_punctuation)

In [6]:
data

Unnamed: 0,peak_name,text,text_list
0,Hoverla,"Hoverla is the highest peak in Ukraine, offeri...","[Hoverla, is, the, highest, peak, in, Ukraine,..."
1,Brebeneskul,Brebeneskul is a beautiful mountain located in...,"[Brebeneskul, is, a, beautiful, mountain, loca..."
2,Chorna Hora (Pip Ivan),"Chorna Hora, also known as Pip Ivan, is a maje...","[Chorna, Hora, ,, also, known, as, Pip, Ivan, ..."
3,Petros,"Petros is a rugged mountain with steep cliffs,...","[Petros, is, a, rugged, mountain, with, steep,..."
4,Gutyn Tomnatyk,Gutyn Tomnatyk is a hidden gem among the Carpa...,"[Gutyn, Tomnatyk, is, a, hidden, gem, among, t..."
...,...,...,...
535,Shuparka,Shuparka is a favorite destination for nature ...,"[Shuparka, is, a, favorite, destination, for, ..."
536,Bychova,Bychova is a challenging peak that rewards hik...,"[Bychova, is, a, challenging, peak, that, rewa..."
537,Mahera,Mahera is a sacred mountain revered by locals ...,"[Mahera, is, a, sacred, mountain, revered, by,..."
538,Koziy Hory,Koziy Hory is a paradise for outdoor enthusias...,"[Koziy, Hory, is, a, paradise, for, outdoor, e..."


In [7]:
# Find start and end index of target sentence
data["peak_loc"] = data.apply(lambda x: (x[1].find(x[0]), x[1].find(x[0]) + len(x[0])), axis=1)

In [8]:
data

Unnamed: 0,peak_name,text,text_list,peak_loc
0,Hoverla,"Hoverla is the highest peak in Ukraine, offeri...","[Hoverla, is, the, highest, peak, in, Ukraine,...","(0, 7)"
1,Brebeneskul,Brebeneskul is a beautiful mountain located in...,"[Brebeneskul, is, a, beautiful, mountain, loca...","(0, 11)"
2,Chorna Hora (Pip Ivan),"Chorna Hora, also known as Pip Ivan, is a maje...","[Chorna, Hora, ,, also, known, as, Pip, Ivan, ...","(-1, 21)"
3,Petros,"Petros is a rugged mountain with steep cliffs,...","[Petros, is, a, rugged, mountain, with, steep,...","(0, 6)"
4,Gutyn Tomnatyk,Gutyn Tomnatyk is a hidden gem among the Carpa...,"[Gutyn, Tomnatyk, is, a, hidden, gem, among, t...","(0, 14)"
...,...,...,...,...
535,Shuparka,Shuparka is a favorite destination for nature ...,"[Shuparka, is, a, favorite, destination, for, ...","(0, 8)"
536,Bychova,Bychova is a challenging peak that rewards hik...,"[Bychova, is, a, challenging, peak, that, rewa...","(0, 7)"
537,Mahera,Mahera is a sacred mountain revered by locals ...,"[Mahera, is, a, sacred, mountain, revered, by,...","(0, 6)"
538,Koziy Hory,Koziy Hory is a paradise for outdoor enthusias...,"[Koziy, Hory, is, a, paradise, for, outdoor, e...","(0, 10)"


In [9]:
def add_word_starts(text: str, text_list: List[str]) -> List[Tuple[str, int]]:
    """
        Finds the starting index of each word in the text.

        Parameters:
        text (str): The original text.
        text_list (List[str]): A list of words to find in the text.

        Returns:
        List[Tuple[str, int]]: A list of tuples containing words and their starting indices.
    """
    result = []
    start = 0
    for word in text_list:
        start = text.find(word, start)
        result.append((word, start))
    return result


# Find the starting index of each word in the text.
data["text_set"] = data.apply(lambda x: add_word_starts(x[1], x[2]), axis=1)

In [10]:
data

Unnamed: 0,peak_name,text,text_list,peak_loc,text_set
0,Hoverla,"Hoverla is the highest peak in Ukraine, offeri...","[Hoverla, is, the, highest, peak, in, Ukraine,...","(0, 7)","[(Hoverla, 0), (is, 8), (the, 11), (highest, 1..."
1,Brebeneskul,Brebeneskul is a beautiful mountain located in...,"[Brebeneskul, is, a, beautiful, mountain, loca...","(0, 11)","[(Brebeneskul, 0), (is, 12), (a, 15), (beautif..."
2,Chorna Hora (Pip Ivan),"Chorna Hora, also known as Pip Ivan, is a maje...","[Chorna, Hora, ,, also, known, as, Pip, Ivan, ...","(-1, 21)","[(Chorna, 0), (Hora, 7), (,, 11), (also, 13), ..."
3,Petros,"Petros is a rugged mountain with steep cliffs,...","[Petros, is, a, rugged, mountain, with, steep,...","(0, 6)","[(Petros, 0), (is, 7), (a, 10), (rugged, 12), ..."
4,Gutyn Tomnatyk,Gutyn Tomnatyk is a hidden gem among the Carpa...,"[Gutyn, Tomnatyk, is, a, hidden, gem, among, t...","(0, 14)","[(Gutyn, 0), (Tomnatyk, 6), (is, 15), (a, 18),..."
...,...,...,...,...,...
535,Shuparka,Shuparka is a favorite destination for nature ...,"[Shuparka, is, a, favorite, destination, for, ...","(0, 8)","[(Shuparka, 0), (is, 9), (a, 12), (favorite, 1..."
536,Bychova,Bychova is a challenging peak that rewards hik...,"[Bychova, is, a, challenging, peak, that, rewa...","(0, 7)","[(Bychova, 0), (is, 8), (a, 11), (challenging,..."
537,Mahera,Mahera is a sacred mountain revered by locals ...,"[Mahera, is, a, sacred, mountain, revered, by,...","(0, 6)","[(Mahera, 0), (is, 7), (a, 10), (sacred, 12), ..."
538,Koziy Hory,Koziy Hory is a paradise for outdoor enthusias...,"[Koziy, Hory, is, a, paradise, for, outdoor, e...","(0, 10)","[(Koziy, 0), (Hory, 6), (is, 11), (a, 14), (pa..."


In [11]:
def add_target(peak_loc: Tuple[int], text_set: List[Tuple[str, int]]) -> List[int]:
    """
        Marks words in the text set as target entities based on their location.

        Parameters:
        peak_loc (Tuple[int]): The start and end indices of the target entity in the text.
        text_set (List[Tuple[str, int]]): A list of tuples containing words and their starting indices.

        Returns:
        List[int]: A list indicating whether each word in the text set is part of the target entity.
    """
    result = [0] * len(text_set)
    for i, word in enumerate(text_set):
        if peak_loc[0] <= word[1] < peak_loc[1]:
            result[i] = 1
    return result


# Create target markup for NER model
data["target"] = data.apply(lambda x: add_target(x[3], x[4]), axis=1)

In [12]:
data

Unnamed: 0,peak_name,text,text_list,peak_loc,text_set,target
0,Hoverla,"Hoverla is the highest peak in Ukraine, offeri...","[Hoverla, is, the, highest, peak, in, Ukraine,...","(0, 7)","[(Hoverla, 0), (is, 8), (the, 11), (highest, 1...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
1,Brebeneskul,Brebeneskul is a beautiful mountain located in...,"[Brebeneskul, is, a, beautiful, mountain, loca...","(0, 11)","[(Brebeneskul, 0), (is, 12), (a, 15), (beautif...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,Chorna Hora (Pip Ivan),"Chorna Hora, also known as Pip Ivan, is a maje...","[Chorna, Hora, ,, also, known, as, Pip, Ivan, ...","(-1, 21)","[(Chorna, 0), (Hora, 7), (,, 11), (also, 13), ...","[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,Petros,"Petros is a rugged mountain with steep cliffs,...","[Petros, is, a, rugged, mountain, with, steep,...","(0, 6)","[(Petros, 0), (is, 7), (a, 10), (rugged, 12), ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,Gutyn Tomnatyk,Gutyn Tomnatyk is a hidden gem among the Carpa...,"[Gutyn, Tomnatyk, is, a, hidden, gem, among, t...","(0, 14)","[(Gutyn, 0), (Tomnatyk, 6), (is, 15), (a, 18),...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...,...,...
535,Shuparka,Shuparka is a favorite destination for nature ...,"[Shuparka, is, a, favorite, destination, for, ...","(0, 8)","[(Shuparka, 0), (is, 9), (a, 12), (favorite, 1...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
536,Bychova,Bychova is a challenging peak that rewards hik...,"[Bychova, is, a, challenging, peak, that, rewa...","(0, 7)","[(Bychova, 0), (is, 8), (a, 11), (challenging,...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
537,Mahera,Mahera is a sacred mountain revered by locals ...,"[Mahera, is, a, sacred, mountain, revered, by,...","(0, 6)","[(Mahera, 0), (is, 7), (a, 10), (sacred, 12), ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
538,Koziy Hory,Koziy Hory is a paradise for outdoor enthusias...,"[Koziy, Hory, is, a, paradise, for, outdoor, e...","(0, 10)","[(Koziy, 0), (Hory, 6), (is, 11), (a, 14), (pa...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [13]:
# Fix mistake in one sentence
data["target"] = data.target.apply(
    lambda x: [1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    if x == [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    else x
)

In [14]:
# Lets split the data into training, validation, and testing sets
data_train, data_test = train_test_split(data, test_size=0.2)
data_train, data_valid = train_test_split(data_train, test_size=0.25)

In [15]:
def transform_df(data: pd.DataFrame) -> List[dict]:
    """
        Transforms a DataFrame into a format suitable for the Hugging Face datasets.

        Parameters:
        data (pd.DataFrame): The DataFrame to be transformed.

        Returns:
        List[dict]: A list of dictionaries with 'id', 'tokens', and 'ner_tags'.
    """
    result = []
    for id_, row in enumerate(data[["text_list", "target"]].values):
        result.append(
            {
                "id": id_,
                "tokens": row[0],
                "ner_tags": row[1]
            }
        )
    return result


# Transform the DataFrames to dicts
data_train_dict = transform_df(data_train)
data_test_dict = transform_df(data_test)
data_valid_dict = transform_df(data_valid)

In [16]:
# Convert the dictionaries into HuggingFace datasets
train_dataset = Dataset.from_list(data_train_dict)
test_dataset = Dataset.from_list(data_test_dict)
valid_dataset = Dataset.from_list(data_valid_dict)

# Create a DatasetDict to hold the datasets
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': valid_dataset,
    'test': test_dataset,
})

In [17]:
dataset_dict["train"]

Dataset({
    features: ['id', 'tokens', 'ner_tags'],
    num_rows: 324
})

In [18]:
dataset_dict["train"][0]

{'id': 0,
 'tokens': ["Koloska's",
  'majestic',
  'silhouette',
  'dominates',
  'the',
  'horizon',
  ',',
  'beckoning',
  'explorers',
  'to',
  'conquer',
  'its',
  'lofty',
  'heights',
  '.'],
 'ner_tags': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

Our data is ready for tokenization, so we can proceed to the training stage

## Train stage

In [19]:
# Define tag names for NER
tag_names = ["O", "Peak"]

In [20]:
# Initialize the tokenizer with a pretrained model
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

An important step in solving this task is the tokenization of our data for Transformer models. We need to tokenize the source texts and transfer our labels to the data after tokenization

In [21]:
def tokenize_and_align_tags(records, tokenizer):
    """
    Transfer word splitting and markup to new tokenizer

    source: https://github.com/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch2_Fine_Tuning_NER_Model.ipynb
    """
    # Tokenize the input words. This will break words into subtokens if necessary.
    # For instance, "ChatGPT" might become ["Chat", "##G", "##PT"].
    tokenized_results = tokenizer(records["tokens"], truncation=True, is_split_into_words=True)

    input_tags_list = []

    # Iterate through each set of tags in the records.
    for i, given_tags in enumerate(records["ner_tags"]):
        # Get the word IDs corresponding to each token. This tells us to which original word each token corresponds.
        word_ids = tokenized_results.word_ids(batch_index=i)

        previous_word_id = None
        input_tags = []

        # For each token, determine which tag it should get.
        for wid in word_ids:
            # If the token does not correspond to any word (e.g., it's a special token), set its tag to -100.
            if wid is None:
                input_tags.append(-100)
            # If the token corresponds to a new word, use the tag for that word.
            elif wid != previous_word_id:
                input_tags.append(given_tags[wid])
            # If the token is a subtoken (i.e., part of a word we've already tagged), set its tag to -100.
            else:
                input_tags.append(-100)
            previous_word_id = wid

        input_tags_list.append(input_tags)

    # Add the assigned tags to the tokenized results.
    # Hagging Face trasformers use 'labels' parameter in a dataset to compute losses.
    tokenized_results["labels"] = input_tags_list

    return tokenized_results


# Tokenize the dataset and align the tags
tokenized_dataset_dict = dataset_dict.map(
    lambda records: tokenize_and_align_tags(records, tokenizer=tokenizer),
    batched=True
)

Map:   0%|          | 0/324 [00:00<?, ? examples/s]

Map:   0%|          | 0/108 [00:00<?, ? examples/s]

Map:   0%|          | 0/108 [00:00<?, ? examples/s]

In [22]:
# Create a data collator for token classification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Create dictionaries to map IDs to labels and vice versa
id2label = dict(enumerate(tag_names))
label2id = dict(zip(id2label.values(), id2label.keys()))

In [23]:
# Initialize the model for token classification
model = AutoModelForTokenClassification.from_pretrained(
    "dslim/bert-base-NER",
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)

# Load the seqeval metric for evaluation
seqeval = evaluate.load("seqeval")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at dslim/bert-base-NER and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([2]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768]) 

In [24]:
# Login to Hugging Face Hub if pushing the model to the hub
# if push_to_hub:
#     notebook_login()

P.S. If you have error with accelerate and transformers, restart env

In [25]:
# Set training arguments for the Trainer
training_args = TrainingArguments(
    output_dir="my_model_example",
    logging_steps=10,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [28]:
def compute_metrics(p, tag_names, seqeval):
    """
    Calculate metrics on the validation dataset during model training

    source: https://github.com/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch2_Fine_Tuning_NER_Model.ipynb
    """
    # p is the results containing a list of predictions and a list of labels
    # Unpack the predictions and true labels from the input tuple 'p'.
    predictions_list, labels_list = p

    # Convert the raw prediction scores into tag indices by selecting the tag with the highest score for each token.
    predictions_list = np.argmax(predictions_list, axis=2)

    # Filter out the '-100' labels that were used to ignore certain tokens (like sub-tokens or special tokens).
    # Convert the numeric tags in 'predictions' and 'labels' back to their string representation using 'tag_names'.
    # Only consider tokens that have tags different from '-100'.
    true_predictions = [
        [tag_names[p] for (p, l) in zip(predictions, labels) if l != -100]
        for predictions, labels in zip(predictions_list, labels_list)
    ]
    true_tags = [
        [tag_names[l] for (p, l) in zip(predictions, labels) if l != -100]
        for predictions, labels in zip(predictions_list, labels_list)
    ]

    # Evaluate the predictions using the 'seqeval' library, which is commonly used for sequence labeling tasks like NER.
    # This provides metrics like precision, recall, and F1 score for sequence labeling tasks.
    results = seqeval.compute(predictions=true_predictions, references=true_tags)

    # Return the evaluated metrics as a dictionary.
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [29]:
# Initialize the Trainer with the model, training arguments, dataset, tokenizer, and metrics computation
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_dict["train"],
    eval_dataset=tokenized_dataset_dict["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=lambda p: compute_metrics(p, tag_names, seqeval),
)

In [30]:
# Start the training process
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.002,0.007955,0.981818,1.0,0.990826,0.998579




TrainOutput(global_step=41, training_loss=0.00752930832077698, metrics={'train_runtime': 212.4362, 'train_samples_per_second': 1.525, 'train_steps_per_second': 0.193, 'total_flos': 5756694171840.0, 'train_loss': 0.00752930832077698, 'epoch': 1.0})

I tried to extensively iterate through the model training parameters. However, due to the fact that I ended up with a not very good dataset, which turned out to be too monotonous, all my manipulations with the parameters did not lead to any result, and the best option was just to train for one epoch on the parameters you see above.

In my main solution, I also uploaded the model to Hub.

## Model Inference

In [31]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

from typing import List


def convert_answer_to_words_p4(sentence: str, ner_results: List[dict]) -> List[str]:
    """
      Reconstructs words from token information provided by NER results.

      This function iterates through tokens identified by the NER model and reconstructs
      the original words based on their start and end positions in the sentence.

      Parameters:
      sentence (str): The original sentence.
      ner_results (List[dict]): The results from NER model, containing identified tokens with their positions.

      Returns:
      List[str]: A list of reconstructed words from the sentence.
    """
    words = []
    current_word = ""
    last_end = -1
    # Iterate through each token information provided by NER
    for token_info in ner_results:
        start, end = token_info['start'], token_info['end']

        # Check if the current token follows immediately after the previous token
        if start == last_end:
            current_word += sentence[start:end]
        elif start == last_end + 1:
            # If the difference in indices is one, it represents a space
            current_word += ' ' + sentence[start:end]
        else:
            # If there is a separate word, add it to the list
            if current_word:
                words.append(current_word.strip())
            current_word = sentence[start:end]

        last_end = end

    # Add the last collected word if it exists
    if current_word:
        words.append(current_word.strip())

    return words


def extract_entity_name_p4(sentence: str) -> List[str]:
    """
        Extracts entity names from a sentence using a pretrained NER model.

        Parameters:
        sentence (str): The sentence from which to extract entity names.

        Returns:
        List[str]: A list of extracted entity names.
    """
    # Load a pretrained tokenizer and model for token classification
    tokenizer = AutoTokenizer.from_pretrained("ruba12/mountain_ner_test_quantum")
    model = AutoModelForTokenClassification.from_pretrained("ruba12/mountain_ner_test_quantum")
    # Create a NER pipeline using the loaded model and tokenizer
    ner_cls = pipeline("ner", model=model, tokenizer=tokenizer)
    # Get NER results from the sentence
    ner_results = ner_cls(sentence)
    # Convert NER results to a list of words
    result = convert_answer_to_words_p4(sentence, ner_results)
    return result


def main_p4():
    # Prompt the user to input a sentence
    sentence = input("Enter sentence with name of the Ukrainian Carpathians peak: ")
    # Extract mountain names from the sentence
    result = extract_entity_name_p4(sentence)
    # Print results
    print(f"Found {len(result)} locations:")
    for mount_name in result:
        print(mount_name)


if __name__ == "__main__":
    main_p4()


Enter sentence with name of the Ukrainian Carpathians peak: Hoverla is the highest peak in Ukraine, offering breathtaking views of the Carpathian Moun


tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/783 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

Found 2 locations:
Hoverla
Mo


In [32]:
main_p4()

Enter sentence with name of the Ukrainian Carpathians peak: Chorna Hora, also known as Pip Ivan, is a majestic peak in the Ukrainian Carpathians, famous for its abandoned observatory.
Found 2 locations:
Chorna Hora
Pip Ivan


# Comparison of results

Lets get a predict of each our pipelines and measure their operating time for validation dataset

In [33]:
test_data = pd.read_csv("/content/valid_data.csv")

In [34]:
test_data

Unnamed: 0,peak_name,text,text_list,peak_loc,text_set,target
0,Vesnyarka,"Vesnyarka is a tranquil haven, where wildflowe...","['Vesnyarka', 'is', 'a', 'tranquil', 'haven', ...","(0, 9)","[('Vesnyarka', 0), ('is', 10), ('a', 13), ('tr...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,Tarkes-Oba,Tarkes-Oba is a scenic mountain with rolling h...,"['Tarkes', 'Oba', 'is', 'a', 'scenic', 'mounta...","(0, 10)","[('Tarkes', 0), ('Oba', 7), ('is', 11), ('a', ...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
2,Khamysh,The challenging ascent of Khamysh rewards clim...,"['The', 'challenging', 'ascent', 'of', 'Khamys...","(26, 33)","[('The', 0), ('challenging', 4), ('ascent', 16...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,Khalyh-Buruk,Khalyh-Buruk's rugged terrain and untamed beau...,"['Khalyh', ""Buruk's"", 'rugged', 'terrain', 'an...","(0, 12)","[('Khalyh', 0), (""Buruk's"", 7), ('rugged', 15)...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,Matochyv,Matochyv's rolling hills and tranquil valleys ...,"[""Matochyv's"", 'rolling', 'hills', 'and', 'tra...","(0, 8)","[(""Matochyv's"", 0), ('rolling', 11), ('hills',...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
...,...,...,...,...,...,...
103,Grobost,"Grobost is a majestic peak, standing tall agai...","['Grobost', 'is', 'a', 'majestic', 'peak', ','...","(0, 7)","[('Grobost', 0), ('is', 8), ('a', 11), ('majes...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"
104,Vilkhovata,Vilkhovata is a stunning mountain that boasts ...,"['Vilkhovata', 'is', 'a', 'stunning', 'mountai...","(0, 10)","[('Vilkhovata', 0), ('is', 11), ('a', 14), ('s...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
105,Lysynka,Lysynka mountain's tranquil slopes and untouch...,"['Lysynka', ""mountain's"", 'tranquil', 'slopes'...","(0, 7)","[('Lysynka', 0), (""mountain's"", 8), ('tranquil...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
106,Bilyklya,"Bilyklya is a tranquil mountain, where one can...","['Bilyklya', 'is', 'a', 'tranquil', 'mountain'...","(0, 8)","[('Bilyklya', 0), ('is', 9), ('a', 12), ('tran...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


In [46]:
%time test_data["res_pipe1"] = test_data.text.apply(extract_entity_name_p1)

CPU times: user 446 ms, sys: 13 ms, total: 459 ms
Wall time: 952 ms


In [47]:
%time test_data["res_pipe2"] = test_data.text.apply(extract_entity_name_p2)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. ini

CPU times: user 2min 26s, sys: 13.1 s, total: 2min 39s
Wall time: 3min 5s


In [49]:
%time test_data["res_pipe3"] = test_data.text.apply(extract_entity_name_p3)

CPU times: user 2min 59s, sys: 4.43 s, total: 3min 3s
Wall time: 3min 21s


In [51]:
%time test_data["res_pipe4"] = test_data.text.apply(extract_entity_name_p4)

CPU times: user 2min 14s, sys: 3.12 s, total: 2min 18s
Wall time: 2min 38s


In [52]:
test_data

Unnamed: 0,peak_name,text,text_list,peak_loc,text_set,target,res_pipe1,res_pipe2,res_pipe3,res_pipe4
0,Vesnyarka,"Vesnyarka is a tranquil haven, where wildflowe...","['Vesnyarka', 'is', 'a', 'tranquil', 'haven', ...","(0, 9)","[('Vesnyarka', 0), ('is', 10), ('a', 13), ('tr...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[Vesnyarka],"[V, esnyarka]",Vesnyarka,[Vesnyarka]
1,Tarkes-Oba,Tarkes-Oba is a scenic mountain with rolling h...,"['Tarkes', 'Oba', 'is', 'a', 'scenic', 'mounta...","(0, 10)","[('Tarkes', 0), ('Oba', 7), ('is', 11), ('a', ...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",[Tarkes-Oba],"[Tarkes, Oba]",Tarkes-Oba,[Tarkes-Oba]
2,Khamysh,The challenging ascent of Khamysh rewards clim...,"['The', 'challenging', 'ascent', 'of', 'Khamys...","(26, 33)","[('The', 0), ('challenging', 4), ('ascent', 16...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[Khamysh],[Khamysh],Khamysh,[Khamysh]
3,Khalyh-Buruk,Khalyh-Buruk's rugged terrain and untamed beau...,"['Khalyh', ""Buruk's"", 'rugged', 'terrain', 'an...","(0, 12)","[('Khalyh', 0), (""Buruk's"", 7), ('rugged', 15)...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[Khalyh-Buruk],"[Khalyh, Buruk]",adrenaline,[Khalyh-Buruk]
4,Matochyv,Matochyv's rolling hills and tranquil valleys ...,"[""Matochyv's"", 'rolling', 'hills', 'and', 'tra...","(0, 8)","[(""Matochyv's"", 0), ('rolling', 11), ('hills',...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[Matochyv],[],valleys,[Matochyv]
...,...,...,...,...,...,...,...,...,...,...
103,Grobost,"Grobost is a majestic peak, standing tall agai...","['Grobost', 'is', 'a', 'majestic', 'peak', ','...","(0, 7)","[('Grobost', 0), ('is', 8), ('a', 11), ('majes...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",[Grobost],[Grobost],Grobost,[Grobost]
104,Vilkhovata,Vilkhovata is a stunning mountain that boasts ...,"['Vilkhovata', 'is', 'a', 'stunning', 'mountai...","(0, 10)","[('Vilkhovata', 0), ('is', 11), ('a', 14), ('s...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[Vilkhovata, Vilkhova]",[Vilkhovata],Vilkhovata,[Vilkhovata]
105,Lysynka,Lysynka mountain's tranquil slopes and untouch...,"['Lysynka', ""mountain's"", 'tranquil', 'slopes'...","(0, 7)","[('Lysynka', 0), (""mountain's"", 8), ('tranquil...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",[Lysynka],"[L, ysynka]",Lysynka mountain,[Lysynka]
106,Bilyklya,"Bilyklya is a tranquil mountain, where one can...","['Bilyklya', 'is', 'a', 'tranquil', 'mountain'...","(0, 8)","[('Bilyklya', 0), ('is', 9), ('a', 12), ('tran...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",[Bilyklya],[Bilyklya],Bilyklya,[Bilyklya]


Lets write function for calculating precision, recall, and F1 score for our results

In [60]:
def calculate_metrics(y_true, y_pred):
    # Initialize counts for true positives, false positives, and false negatives
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    # Convert y_pred to a flat list if it's a list of lists
    y_pred_flat = []
    for pred in y_pred:
        if isinstance(pred, list):
            # If pred is a non-empty list, take the first element as the prediction
            y_pred_flat.append(pred[0] if pred else None)
        else:
            y_pred_flat.append(pred)

    # Calculate true positives, false positives, and false negatives
    for true, pred in zip(y_true, y_pred_flat):
        if true == pred:
            true_positives += 1
        elif pred is not None:
            false_positives += 1
        else:
            false_negatives += 1

    # Calculate precision, recall, and F1 score
    precision = true_positives / (true_positives + false_positives) if true_positives + false_positives > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if true_positives + false_negatives > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0

    return precision, recall, f1

In [61]:
precision, recall, f1 = calculate_metrics(test_data['peak_name'], test_data['res_pipe1'])
print(precision, recall, f1)

0.9074074074074074 1.0 0.9514563106796117


In [62]:
precision, recall, f1 = calculate_metrics(test_data['peak_name'], test_data['res_pipe2'])
print(precision, recall, f1)

0.5882352941176471 0.684931506849315 0.6329113924050633


In [63]:
precision, recall, f1 = calculate_metrics(test_data['peak_name'], test_data['res_pipe3'])
print(precision, recall, f1)

0.7592592592592593 1.0 0.8631578947368421


In [64]:
precision, recall, f1 = calculate_metrics(test_data['peak_name'], test_data['res_pipe4'])
print(precision, recall, f1)

0.9907407407407407 1.0 0.9953488372093023


# Conclusions

As the final results we have:

- **First Pipeline**: Based on the knowledge of the region in which the mountains are located, we aimed to identify all mountains in that region. We then scanned the entire text to find all mountain names present in this region.
- **Second Pipeline**: We employed a pre-trained NER model from Hugging Face.
- **Third Pipeline**: Similarly, we used a Question-Answering model to address this task.
- **Fourth Pipeline**: We fine-tuned a NER model from Hugging Face on our dataset.


name   time  precision    recall    f1

pipe1 952 ms   0.90   1.00   0.951

pipe2 3min 5s  0.59   0.68   0.632

pipe3 3min 21s 0.76   1.00   0.863

pipe4 2min 38s 0.99   1.00   0.995

P.S. It's necessary to consider the bias in the evaluations since my dataset turned out to be quite specific.

As we can see, the best result is achieved by model number four, which is quite logical. However, I would also like to point out the results of model number one, because if we consider its operating time, it is incredibly faster than all the other models. Additionally, I would like to mention model number three, as it can operate without fine-tuning and still produces good results, especially when looking at the recall metric.

Let's go over the pros and cons of the models again and discuss how they can be improved.

**First Pipeline**

pros: incredibly fast

cons: doesn't work on data which haven't seen

improve: Add more mountain names in this region, as well as remove unnecessary words in the generation, and clean the source dataset of names.

**Second Pipeline**

improve: we already used the fine tune of this model to get model 4

**Third Pipeline**

pros: works well from the box

cons: doesn't work on when text has more than one mountain names

improve: We can pre-clean the input datasets to remove unwanted triggers for certain words. Also, we can fine-tune this model.

**Fourth Pipeline**

pros: the best result

cons: sometimes tears words into pieces

improve: It’s hard to say due to the bias of my dataset


P.S. We could also try to implement a solution to this problem using OpenAI and GPT-4.