Let's demonstrate how the classifier works!

In [30]:
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

In [31]:
# large_model = AutoModelForTokenClassification.from_pretrained("./models/pretrained_model")
# tokenizer = AutoTokenizer.from_pretrained("dslim/bert-large-NER")
large_model = AutoModelForTokenClassification.from_pretrained("dieumerci/mountain-recognition-ner")
tokenizer = AutoTokenizer.from_pretrained("dieumerci/mountain-recognition-ner")
classifier = pipeline("ner", model=large_model, tokenizer=tokenizer)

In [32]:
def text_to_tags(txt):
    tags = []
    words = []
    label2tag = {'LABEL_0': 'other', 'LABEL_1': 'mountain'}
    res = classifier(txt)
    word = ''
    for elem in res:
        if elem['word'][0] != '#':
            tags.append(label2tag[elem['entity']])
            if word != '':
                words.append(word)
                word = ''
            word += elem['word']
        else:
            word += elem['word'][2:]
    words.append(word)
    return {word: tag for word, tag in zip(words, tags)}

We shall start from a simple example with a single mountain name in the sentence (with context that helps recognizing our mountain).

In [33]:
text = "Denali's snow-capped peak offers a breathtaking spectacle and beckons adventurers from around the globe."
text_to_tags(text)

{'Denali': 'mountain',
 "'": 'other',
 's': 'other',
 'snow': 'other',
 '-': 'other',
 'capped': 'other',
 'peak': 'other',
 'offers': 'other',
 'a': 'other',
 'breathtaking': 'other',
 'spectacle': 'other',
 'and': 'other',
 'beckons': 'other',
 'adventurers': 'other',
 'from': 'other',
 'around': 'other',
 'the': 'other',
 'globe': 'other',
 '.': 'other'}

As we see, the classifier did well on this simple example.

In [34]:
text = "I visited the Classification Mountains when I was a child."
text_to_tags(text)

{'I': 'other',
 'visited': 'other',
 'the': 'other',
 'Classification': 'mountain',
 'Mountains': 'mountain',
 'when': 'other',
 'was': 'other',
 'a': 'other',
 'child': 'other',
 '.': 'other'}

Here we have a name for a mountain range that I've made up for the purpose of the demo. Clearly no such name could have appeared in the training set, yet the model is not baffled.

In [35]:
text = "Everest is a mountain."
text_to_tags(text)

{'Everest': 'mountain',
 'is': 'other',
 'a': 'other',
 'mountain': 'other',
 '.': 'other'}

Note that in the previous example, 'Classification Mountains' referred to a name of the mountain range, so both words had to be labeled as 'mountain'. In this sentence, the word 'mountain' is no mountain name, referring to the category instead. So it is correct that the model does not label it as a 'mountain'.

In [14]:
text = "Just as other puppies, Kilimanjaro is very playful and funny."
text_to_tags(text)

{'Just': 'other',
 'as': 'other',
 'other': 'other',
 'puppies': 'other',
 ',': 'other',
 'Kilimanjaro': 'other',
 'is': 'other',
 'very': 'other',
 'playful': 'other',
 'and': 'other',
 'funny': 'other',
 '.': 'other'}

Here I use a famous mountain name to refer to something that is clearly not a mountain - namely, one can infer from the context that Kilimanjaro in this sentence is a name of a puppy. The model sees through my trickery.

In [22]:
text = "Mountains."
text_to_tags(text)

{'Mountains': 'mountain', '.': 'other'}

Now i finally managed to deceive the model. 'Mountains' can only refer to a mountain name when it goes along some other word, like the Rocky Mountains. Without context, we should infer that we simply speak of a category. The reason is likely in casing: below i show that the same sentence with 'mountains' decapitalized yields correct result.

In [23]:
text = "mountains."
text_to_tags(text)

{'mountains': 'other', '.': 'other'}

Another way the model has to solve the issue is, of course, context. In the sentence below, word 'Mountains' starts with a capital, but the provided context helps us understand that we are not speaking of a specific name, and the model sees it.

In [24]:
text = "Mountains can be of varying height."
text_to_tags(text)

{'Mountains': 'other',
 'can': 'other',
 'be': 'other',
 'of': 'other',
 'varying': 'other',
 'height': 'other',
 '.': 'other'}

A pair of other examples.

In [19]:
text = "Finally, at the top of our list is Mount Fuji in Japan."
text_to_tags(text)

{'Finally': 'other',
 ',': 'other',
 'at': 'other',
 'the': 'other',
 'top': 'other',
 'of': 'other',
 'our': 'other',
 'list': 'other',
 'is': 'other',
 'Mount': 'mountain',
 'Fuji': 'mountain',
 'in': 'other',
 'Japan': 'other',
 '.': 'other'}

In [16]:
text = "Next on our list is Denali Peak, also known as Mount McKinley, in Alaska."
text_to_tags(text)

{'Next': 'other',
 'on': 'other',
 'our': 'other',
 'list': 'other',
 'is': 'other',
 'Denali': 'mountain',
 'Peak': 'mountain',
 ',': 'other',
 'also': 'other',
 'known': 'other',
 'as': 'other',
 'Mount': 'mountain',
 'McKinley': 'mountain',
 'in': 'other',
 'Alaska': 'other',
 '.': 'other'}

In [28]:
text = "Everest, Aconcagua, Denali, Kilimanjaro, Massif, and Jaya are all very tall."
text_to_tags(text)

{'Everest': 'mountain',
 ',': 'mountain',
 'Aconcagua': 'mountain',
 'Denali': 'mountain',
 'Kilimanjaro': 'mountain',
 'Massif': 'mountain',
 'and': 'other',
 'Jaya': 'mountain',
 'are': 'other',
 'all': 'other',
 'very': 'other',
 'tall': 'other',
 '.': 'other'}

Here we see an issue with the tokenizer: for some reason, it only saw one of the multiple commas. That confused the model, leading it to correctly classify every word - but not the comma.

In [29]:
text = "Standing on the peak of Qwerty, I feel as though in the sky!"
text_to_tags(text)

{'Standing': 'other',
 'on': 'other',
 'the': 'other',
 'peak': 'other',
 'of': 'other',
 'Qwerty': 'mountain',
 ',': 'other',
 'I': 'other',
 'feel': 'other',
 'as': 'other',
 'though': 'other',
 'in': 'other',
 'sky': 'other',
 '!': 'other'}