**The entire code is being run in the Kaggle environment due to the need for a powerful GPU to train the model. In this section, we prepare the dataset for training, train the model, and evaluate its performance.**

Here, we prepare the dataset for model training, train the model, and evaluate its performance.

In this section, we install several libraries needed for model training and evaluation:

- `bitsandbytes`: Optimizes memory usage for model training, especially useful when working with large models.
- `wandb`: A tool for tracking and visualizing experiments in real-time, which helps in monitoring model performance during training.
- `accelerate`: Facilitates distributed training, making it easier to leverage powerful GPUs on platforms like Kaggle.
- `transformers[torch]` and `datasets`: Provided by Hugging Face, these libraries give access to pre-trained models and diverse datasets, streamlining the process of fine-tuning models on custom data.
- `peft`: Enables parameter-efficient fine-tuning of large models, reducing computational costs while maintaining model quality.

These installations set up the necessary environment for the model preparation, training, and evaluation stages.

In [42]:
!pip install -U bitsandbytes
!pip install wandb
!pip install accelerate
! pip install -q transformers[torch] datasets
!pip install peft

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Import the `wandb` library and log in to our Weights & Biases account using an API key.

In [43]:
import wandb

wandb.login(key="61f57d09a2c684e4990e098e51c337c27a650b02")



True

Initializing a new `wandb` run logs training progress in the "mountain-recognition" project. 

In [44]:
wandb.init(
    # set the wandb project where this run will be logged
    project="mountain-recognition",

    # track hyperparameters and run metadata
    config={
    "learning_rate": 0.02,
    "architecture": "transformer",
    "dataset": "mountain_names_recognition",
    "epochs": 3,
    }
)

VBox(children=(Label(value='0.030 MB of 0.030 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

The code defines a function to load CoNLL files into pandas DataFrames, facilitating the preparation of data for model training.

- **`load_conll_file(file_path)`**: This function reads a CoNLL file and extracts tokens and their corresponding NER tags into two separate lists for each sentence. It creates a DataFrame with columns `tokens` and `ner_tags`, allowing for easy manipulation and analysis of the data.

- **File Reading**: The function processes the file line by line, using empty lines to identify new sentences. If the file doesn’t end with an empty line, the last sentence's tokens and tags are also included.

- **Loading Training and Testing Data**: The function is applied to the specified training and testing file paths, creating `train_coll` and `test_coll` DataFrames. This organization of data is crucial for effectively training and evaluating the model later in the project.

- **DataFrame Structure Check**: The code includes commented-out lines to check the structure of the loaded DataFrames to ensure they are in the correct format for further processing.

In [45]:
import pandas as pd

# Function to load a CoNLL file into a DataFrame
def load_conll_file(file_path):
    tokens = []
    ner_tags = []

    # Reading the file line by line
    with open(file_path, 'r', encoding='utf-8') as file:
        current_sentence_tokens = []
        current_sentence_tags = []

        for line in file:
            line = line.strip()
            if not line:  # Empty line indicates a new sentence
                if current_sentence_tokens:
                    tokens.append(current_sentence_tokens)
                    ner_tags.append(current_sentence_tags)
                    current_sentence_tokens = []
                    current_sentence_tags = []
            else:
                token, ner_tag = line.split()
                current_sentence_tokens.append(token)
                current_sentence_tags.append(ner_tag)

        # Adding the last sentence if the file does not end with an empty line
        if current_sentence_tokens:
            tokens.append(current_sentence_tokens)
            ner_tags.append(current_sentence_tags)

    # Creating DataFrame with sentences
    df = pd.DataFrame({'tokens': tokens, 'ner_tags': ner_tags})
    return df

# Apply function to the train and test files
train_file_path = '/kaggle/input/mountains-names-recognition/train_file.conll'
test_file_path = '/kaggle/input/mountains-names-recognition/test_file.conll'

train_coll = load_conll_file(train_file_path)
test_coll = load_conll_file(test_file_path)

# Check if they are in the correct format
# print(train_coll.head())
# print(test_coll.head())

In this section, we prepare the dataset for training by defining unique NER tags and converting the data into a format suitable for model processing.

- The `add_id` function is implemented to assign a unique identifier to each example in the dataset, which aids in tracking individual entries.

- The `encode_ner_tags` function transforms the NER tags into their integer representations using a `ClassLabel` mapping, ensuring the model can process these tags effectively.

- Unique NER tags are identified from both the training and testing datasets, creating a comprehensive set of labels for the model to recognize.

- Finally, we convert the pandas DataFrames into datasets compatible with our model training framework, adding the unique IDs and encoding the NER tags accordingly. This preparation is essential for efficient training and evaluation of the model.

In [46]:
from datasets import ClassLabel, Dataset

def add_id(example, idx):
    example["id"] = str(idx)
    return example

def encode_ner_tags(example):
    example["ner_tags"] = [ner_class_label.str2int(tag) for tag in example["ner_tags"]]
    return example

# Define all unique labels (tags) in the dataset
all_tags = set(tag for tags in train_coll["ner_tags"] for tag in tags)
all_tags.update(tag for tags in test_coll["ner_tags"] for tag in tags)

# Create ClassLabel for NER tags
ner_class_label = ClassLabel(names=list(all_tags))

train_dataset = Dataset.from_pandas(train_coll)
test_dataset = Dataset.from_pandas(test_coll)

train_dataset = train_dataset.map(add_id, with_indices=True)
test_dataset = test_dataset.map(add_id, with_indices=True)

# Convert NER tags to ClassLabel format
train_dataset = train_dataset.map(encode_ner_tags)
test_dataset = test_dataset.map(encode_ner_tags)

Map:   0%|          | 0/3716 [00:00<?, ? examples/s]

Map:   0%|          | 0/928 [00:00<?, ? examples/s]

Map:   0%|          | 0/3716 [00:00<?, ? examples/s]

Map:   0%|          | 0/928 [00:00<?, ? examples/s]

Define the features of our datasets to ensure they are structured correctly for model training. The `Features` object specifies the structure, including sequences for `tokens` and `ner_tags`, along with a unique identifier `id`. The `cast` method is then applied to both the training and testing datasets to transform them according to this defined structure, ensuring consistency for training and evaluation.

In [47]:
from datasets import Sequence, Features, Sequence
features = Features({
    'tokens': Sequence(feature=train_dataset.features['tokens'].feature),
    'ner_tags': Sequence(feature=ner_class_label),
    'id': train_dataset.features['id']
})

train_dataset = train_dataset.cast(features)
test_dataset = test_dataset.cast(features.copy())

Casting the dataset:   0%|          | 0/3716 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/928 [00:00<?, ? examples/s]

We load a pre-trained BERT tokenizer and model for Named Entity Recognition (NER). The `AutoTokenizer` prepares the text for input into the model, while `AutoModelForTokenClassification` initializes the NER model. Using the `"dslim/bert-base-NER"` model allows us to leverage existing training for efficient NER tasks, and setting `load_in_8bit=True` optimizes memory usage for model loading, enabling training on limited hardware.

In [48]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

ner_model = AutoModelForTokenClassification.from_pretrained(
    "dslim/bert-base-NER",
    load_in_8bit=True
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We implement a function to tokenize the dataset and align the NER labels accordingly. The `tokenize_and_align_labels` function uses the tokenizer to process the `tokens` field, ensuring that the inputs are properly prepared for the model. It handles special tokens and aligns the corresponding NER labels to the tokenized outputs, ensuring each token has the correct label or is marked as `-100` for tokens that don't correspond to any word.

The function is then applied to both the training and testing datasets in a batched manner for efficiency. Finally, we set the format of the datasets for PyTorch, specifying that the relevant columns (`input_ids`, `attention_mask`, and `labels`) should be formatted as tensors, enabling compatibility with the model during training.

In [49]:
from transformers import DataCollatorForTokenClassification

# Tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'],
        is_split_into_words=True,
        truncation=True,
        padding='max_length'  # or True for dynamic padding
    )

    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                # Special tokens have no word index
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # Label for the first token of the word
                label_ids.append(label[word_idx])
            else:
                # Label for the subsequent tokens in a word
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Apply the function to the datasets
train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
test_dataset = test_dataset.map(tokenize_and_align_labels, batched=True)

# Set the format for PyTorch tensors
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/3716 [00:00<?, ? examples/s]

Map:   0%|          | 0/928 [00:00<?, ? examples/s]

Now we can see model architecture

In [50]:
print(ner_model)

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear8bitLt(in_features=768, out_features=768, bias=True)
              (key): Linear8bitLt(in_features=768, out_features=768, bias=True)
              (value): Linear8bitLt(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear8bitLt(in_features=768, out_features=768, bias=True)
              (LayerNorm): Lay

We configure and fine-tune the NER model using LoRA (Low-Rank Adaptation) to improve performance while optimizing resource usage. The `LoraConfig` sets parameters such as rank, scaling, target layers, and dropout rate for the LoRA adapters.

The model is wrapped with these adapters, allowing for efficient training. A data collator is created to handle dynamic padding and batch formatting during training.

A custom logging callback is defined to log the training loss to Weights & Biases (WandB) at specified intervals, enabling real-time monitoring of the training process.

The `train_model` function initializes the training arguments, including batch sizes, number of epochs, and logging settings. It then sets up a `Trainer` instance with the model, datasets, and the custom data collator, and begins the training process. The training function is executed to fine-tune the model on the provided datasets.

In [52]:
from peft import LoraConfig, get_peft_model
from transformers import DataCollatorForTokenClassification, TrainingArguments, Trainer
from transformers import TrainerCallback

# Define LoRA configuration
lora_config = LoraConfig(
    r=4,  # LoRA rank
    lora_alpha=32,  # Scaling parameter
    target_modules=["query", "key", "value"],  # Layers for LoRA
    lora_dropout=0.1, # LoRA dropout rate
)

# Wrap the model with LoRA adapters
peft_model = get_peft_model(ner_model, lora_config)

# Create data_collator
data_collator = DataCollatorForTokenClassification(tokenizer)

class LoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and 'loss' in logs:
            # Логування train_loss в WandB кожні logging_steps
            wandb.log({"train_loss": logs["loss"]})

# Updated train_model function
def train_model(model, train_dataset, test_dataset):
    """Fine-tune a quantized model with LoRA adapters."""
    training_args = TrainingArguments(
        remove_unused_columns=False,
        output_dir='./kaggle/working/results',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=128,
        warmup_steps=100,
        weight_decay=0.01,
        logging_dir='./kaggle/working/logs',
        fp16=False,  
        logging_steps=50,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
        data_collator=data_collator,
        callbacks=[LoggingCallback()]
    )

    trainer.train()

# Run training function
train_model(peft_model, train_dataset, test_dataset)


Step,Training Loss
50,9.9719
100,7.945
150,2.4252
200,0.7381
250,0.3518
300,0.2347
350,0.1818
400,0.1576
450,0.1316
500,0.1266


We set up the model saving process by creating a directory to store the trained model and tokenizer on Kaggle. The `os.makedirs` function ensures that the directory is created if it doesn't already exist.

The trained LoRA model and the tokenizer are saved to the specified path, allowing for easy retrieval later. Additionally, the WandB configuration is saved in a YAML file, providing a record of the training settings used during the model's fine-tuning.

To facilitate downloading, we create a compressed ZIP archive of the saved model directory, making it easier to manage and transfer the files after training is complete.

In [54]:
import os
import yaml

# Path to save the model on Kaggle
model_save_path = "/kaggle/working/trained_model"
os.makedirs(model_save_path, exist_ok=True)

# Saving the model and tokenizer
peft_model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

# Saving the W&B configuration in a YAML file
config_path = os.path.join(model_save_path, "wandb_config.yaml")
with open(config_path, "w") as f:
    yaml.dump(dict(wandb.config), f)

# Creating an archive for easy downloading
!zip -r /kaggle/working/trained_model.zip /kaggle/working/trained_model


  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


updating: kaggle/working/trained_model/ (stored 0%)
updating: kaggle/working/trained_model/special_tokens_map.json (deflated 42%)
updating: kaggle/working/trained_model/vocab.txt (deflated 49%)
updating: kaggle/working/trained_model/README.md (deflated 66%)
updating: kaggle/working/trained_model/adapter_model.safetensors (deflated 7%)
updating: kaggle/working/trained_model/tokenizer.json (deflated 70%)
updating: kaggle/working/trained_model/wandb_config.yaml (deflated 61%)
updating: kaggle/working/trained_model/tokenizer_config.json (deflated 76%)
updating: kaggle/working/trained_model/adapter_config.json (deflated 52%)


We check for GPU availability using PyTorch. If a GPU is available, it is set as the device; otherwise, the CPU is used. This ensures that the model will leverage GPU acceleration if possible, enhancing training and inference speed.

The model is then moved to the selected device (GPU or CPU), allowing it to perform computations on the appropriate hardware. This step is crucial for optimizing performance, especially during model training and evaluation.

In [55]:
import torch

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Moving the model to GPU if available
model.to(device)


Using device: cuda


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): lora.Linear(
                (base_layer): Linear(in_features=768, out_features=768, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=4, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=

The model evaluation function computes the loss and accuracy on a given dataset. It uses a `DataLoader` to handle batching and leverages GPU for efficient computation. The evaluation process involves iterating through the dataset, predicting labels, and calculating the loss while ignoring special tokens.

The function tracks the total loss, correct predictions, and total predictions throughout the evaluation process. It also utilizes the `tqdm` library for visual progress updates and measures the total evaluation time.

Finally, the average loss and accuracy are calculated and displayed for both the training and test datasets, providing insights into the model's performance.

In [56]:
from tqdm import tqdm
import time

def evaluate_model(model, dataset, data_collator, batch_size=16):
    dataloader = DataLoader(dataset, collate_fn=data_collator, batch_size=batch_size)
    
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0

    # Visualization using tqdm and tracking time
    start_time = time.time()
    for batch in tqdm(dataloader, desc="Evaluating", unit="batch"):
        # Moving data to GPU
        inputs = {k: v.to(device) for k, v in batch.items() if k != "labels"}
        labels = batch["labels"].to(device)

        with torch.no_grad():
            outputs = model(**inputs, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            # Calculating Loss
            total_loss += loss.item()

            # Calculating Accuracy
            predictions = torch.argmax(logits, dim=-1)
            mask = labels != -100  # Ignoring special tokens
            correct_predictions += (predictions[mask] == labels[mask]).sum().item()
            total_predictions += mask.sum().item()

    # Calculating average loss and accuracy
    avg_loss = total_loss / len(dataloader)
    accuracy = correct_predictions / total_predictions * 100

    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f"Evaluation completed in {elapsed_time:.2f} seconds.")
    
    return avg_loss, accuracy

# Performing calculations for the training and test sets
train_loss, train_accuracy = evaluate_model(model, train_dataset, data_collator)
print(f"Train Loss: {train_loss:.4f}, Train Accuracy: {train_accuracy:.2f}%")

test_loss, test_accuracy = evaluate_model(model, test_dataset, data_collator)
print(f"Test Loss: {test_loss:.4f}, Test Accuracy: {test_accuracy:.2f}%")


Evaluating: 100%|██████████| 233/233 [01:02<00:00,  3.73batch/s]


Evaluation completed in 62.44 seconds.
Train Loss: 0.0599, Train Accuracy: 98.29%


Evaluating: 100%|██████████| 58/58 [00:15<00:00,  3.73batch/s]

Evaluation completed in 15.56 seconds.
Test Loss: 0.0601, Test Accuracy: 98.36%





The evaluation results showed that the model achieved an accuracy of 98% on both the training and test datasets. While this high level of accuracy indicates that the model performs exceptionally well, it also suggests that there may still be room for improvement. The model is not perfect, but it does not appear to be overfitted, as the performance is consistent across both datasets. This balance suggests a well-generalized model that can be effectively used for predicting mountain names in various scenarios. Further refinement and additional data could enhance its accuracy and robustness even more.

In [64]:
wandb.finish()

VBox(children=(Label(value='0.032 MB of 0.032 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁▂▂▃▃▄▄▅▅▆▆▇▇█
train/global_step,▁▁▂▂▂▂▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▇▇▇▇█
train/grad_norm,▄█▂▂▁▂▂▁▁▁▁▁▁
train/learning_rate,▄█▇▇▆▅▅▄▄▃▂▂▁
train/loss,█▇▃▁▁▁▁▁▁▁▁▁▁
train_loss,█▇▃▁▁▁▁▁▁▁▁▁▁

0,1
total_flos,2920695406202880.0
train/epoch,3.0
train/global_step,699.0
train/grad_norm,1.97479
train/learning_rate,0.0
train/loss,0.0859
train_loss,1.61764
train_runtime,541.5028
train_samples_per_second,20.587
train_steps_per_second,1.291


The `wandb.finish()` function is called to properly close the current W&B run. This finalizes the logging process, ensuring that all metrics, visualizations, and configurations are uploaded to the W&B server. It is essential for maintaining the integrity of the logged data and for organizing the experiment tracking efficiently. By finishing the run, you also prevent potential data loss and allow for easy access to the results in the W&B dashboard.