In [8]:
import pandas as pd
from transformers import pipeline, RobertaForTokenClassification, RobertaTokenizerFast

In [29]:
def read_words_from_file(file_path):
    with open(file_path, 'r') as file:
        return set(line.strip() for line in file)

def delete_all_stop_words(text, stop_words):
    # Keep only alphanumeric characters and split by spaces
    words = ''.join(char if char.isalnum() or char.isspace() else ' ' for char in text).split()
    return [word for word in words if word.lower() not in stop_words]

def process_texts_through_pipeline(texts, pipeline):
    stop_words = read_words_from_file("stop_words.txt")
    
    for i, text in enumerate(texts):
        print(f"Text {i+1}:")
        tokens_output = pipeline(delete_all_stop_words(text, stop_words))
        
        for token in tokens_output:
            label = "Non-mountain" if token[0]['entity_group'] != 'LABEL_1' else "Mountain"
            print(f"{token[0]['word']}, class: {label}")
        print("")

In [13]:
model = RobertaForTokenClassification.from_pretrained("UkrKreuzritter/NER_mountain")
tokenizer = RobertaTokenizerFast.from_pretrained("UkrKreuzritter/NER_mountain")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer,aggregation_strategy="simple")

config.json:   0%|          | 0.00/710 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/295 [00:00<?, ?B/s]

Device set to use cuda:0


In [35]:
texts = [
    "The Rocky Plains lie near the Appalachian Mountains.",
    "Climbing the corporate ladder was her Mount Everest.",
    "He visited the Grand Canyon and hiked the Alps.",
    "The Andes, Himalayas, and Rockies form vast mountain ranges.",
    "Mount Fuji near Tokyo symbolizes Japan.",
    "The Carpathians sustain diverse life.",
    "Mount Rainier looms over the valley, hidden by clouds.",
    "His determination mirrored the steadfast Rockies.",
    "I read about mountains on book a time ago",
    "The Great Plains lie between the Rockies and the Appalachian Mountains.",
    "I like Mount Hoverla!!!!!",
    "You live in US so long, have you ever visited Rushmore?"
]

In [36]:
process_texts_through_pipeline(texts, ner_pipeline)

Text 1:
 Rocky, class: Non-mountain
 Plains, class: Mountain
 lie, class: Non-mountain
 near, class: Non-mountain
 Appalachian, class: Mountain
 Mountains, class: Mountain

Text 2:
 Climbing, class: Non-mountain
 corporate, class: Non-mountain
 ladder, class: Non-mountain
 Mount, class: Mountain
 Everest, class: Mountain

Text 3:
 visited, class: Non-mountain
 Grand, class: Non-mountain
 Canyon, class: Mountain
 hiked, class: Non-mountain
 Alps, class: Mountain

Text 4:
 Andes, class: Mountain
 Himalay, class: Mountain
 Rockies, class: Mountain
 form, class: Non-mountain
 vast, class: Non-mountain
 mountain, class: Non-mountain
 ranges, class: Non-mountain

Text 5:
 Mount, class: Mountain
 Fuji, class: Non-mountain
 near, class: Non-mountain
 Tokyo, class: Non-mountain
 symbolizes, class: Non-mountain
 Japan, class: Non-mountain

Text 6:
 Carpathians, class: Mountain
 sustain, class: Non-mountain
 diverse, class: Non-mountain
 life, class: Non-mountain

Text 7:
 Mount, class: Mountain


# <center> Conclusions <center>

## Pros:
1. Model effectively identifies widely recognized mountain names.
2. Correctly ignores generic mentions of mountains (e.g., "I read about mountains" is not classified as a mountain name).

## Cons:
1. Model struggles with local or lesser-known mountain names (e.g., Hoverla).