# Data Science

## Task 1. Natural Language Processing. Named entity recognition


In this task, we need to train a named entity recognition (NER) model for the identification of mountain names inside the texts. For this purpose you need:  

* Find / create a dataset with labeled mountains.  

* Select the relevant architecture of the model for NER solving.  

* Train / finetune the model. 

* Prepare demo code / notebook of the inference results.  

  
The output for this task should contain:  

* **Jupyter notebook that explains the process of the dataset creation.**
* Dataset including all artifacts it consists of.
* Link to model weights.
* Python script (.py) for model training.
* Python script (.py) for model inference.
* Jupyter notebook with demo.


**Jupyter notebook that explains the process of the dataset creation.**

In [1]:
# Import the required libraries
import os
import spacy
import pickle

In [2]:
print(spacy.__version__)

3.7.2


In [3]:
# Initial data
mountain_names = set([
    "Everest", "K2", "Matterhorn", "Denali", "Kangchenjunga", "Annapurna",
    "Aconcagua", "Kilimanjaro", "Elbrus", "Mont Blanc", "Mount Fuji", "Mount McKinley",
    "Mount Rainier", "Mount Whitney", "Matterhorn", "Grand Teton", "Pikes Peak"
])

texts = [
    "Everest is the highest mountain in the world.",
    "Denali, also known as Mount McKinley, is the highest peak in North America.",
    "The stunning Matterhorn is located in the Pennine Alps on the border between Switzerland and Italy.",
    "Kangchenjunga is the third highest mountain in the world.",
    "Aconcagua is the highest peak in South America.",
    "Everest, standing proudly in the heart of the Himalayas, is a formidable challenge for mountaineers seeking to conquer its lofty peak and experience the breathtaking vistas from the top of the world.",
    "K2 is the second-highest mountain in the world, after Everest.",
    "The summit of Kilimanjaro is well-known for its breathtaking views.",
    "Elbrus is the highest mountain in Europe, situated in the western Caucasus mountain range.",
    "Mont Blanc is the highest mountain in the Alps, situated on the border between France and Italy.",
    "Mount Fuji, an iconic stratovolcano, is Japan's highest peak and a symbol of natural beauty.",
    "Mount McKinley, also known as Denali, is the tallest mountain in North America, located in Alaska.",
    "Mount Rainier, a massive stratovolcano, dominates the skyline in the state of Washington.",
    "Mount Whitney, the highest summit in the contiguous United States, is part of the Sierra Nevada mountain range.",
    "The Matterhorn, with its distinctive pyramid shape, is a famous peak in the Pennine Alps on the Swiss-Italian border.",
    "Grand Teton, part of the Teton Range in Wyoming, is known for its stunning and rugged mountain scenery.",
    "Pikes Peak, standing tall in the Rocky Mountains, is one of Colorado's most famous fourteeners."
]

In [4]:
# Loading an empty spaCy model
nlp = spacy.blank("en")

# Adding a component for recognizing named entities
ner = nlp.add_pipe("ner")

# Preparing training data
train_data = []

for text in texts:
    entities = []
    doc = nlp.make_doc(text)

    for mountain_name in mountain_names:
        start = text.find(mountain_name)
        end = start + len(mountain_name)
        
        # Formation of labels in IOB format
        if start != -1:
            entities.append((start, end, f"MOUNTAIN"))
            
    train_data.append((doc, {"entities": entities}))

In [5]:
# Path to the directory to save files
train_data_path = './data/'

# Create a catalog if it does not exist
os.makedirs(train_data_path, exist_ok=True)

# Saving training data with pickle
with open(os.path.join(train_data_path, 'train_data.pkl'), 'wb') as file:
    pickle.dump(train_data, file)