# NER Notebook


This notebook is designed to be able to train a pre-trained model on an Nepali NER dataset. 

##### Sections:

There are four sections in this notebook:

1. Installations: this is where we do installation for relevant dependencies
2. Imports: here, we perform imports for all the dependencies needed
3. Utility Classes and Functions: here, we define utility classes and functions that will help us train
4. Training: Here, the actual training process is done

### NB: Please run the entire cells in the notebooks as they are. The only section that can be modified is the training section. The parts of the code that can be modified are clearly explained in the the training section. 


### 1. Installations

In [1]:
!pip install transformers
!pip install seqeval
!pip install ptvsd

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 12.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 38.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 46.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 526 kB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting

### 2. Imports

In [2]:
import argparse
import glob
import logging
import os
import random
from collections import defaultdict, Counter

import torch
import numpy as np

from spacy import displacy
from scipy.sparse import save_npz, load_npz
from seqeval.metrics import f1_score, precision_score, recall_score
from torch import LongTensor
from torch import nn, optim
from torch.nn import CrossEntropyLoss
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler, TensorDataset
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange

from transformers import (
    WEIGHTS_NAME,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoConfig,
    AutoTokenizer,
    AutoModel,
    AdamW,
    BertConfig,
    BertForTokenClassification,
    BertTokenizer,
    CamembertConfig,
    CamembertForTokenClassification,
    CamembertTokenizer,
    DistilBertConfig,
    DistilBertForTokenClassification,
    DistilBertTokenizer,
    RobertaConfig,
    RobertaForTokenClassification,
    RobertaTokenizer,
    XLMRobertaConfig,
    XLMRobertaForTokenClassification,
    XLMRobertaTokenizer,
    get_linear_schedule_with_warmup,
    get_constant_schedule_with_warmup,
)
from transformers import pipeline

try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

logger = logging.getLogger("Nep_NER_Log")
logging.basicConfig(level=logging.DEBUG)

MODEL_CLASSES = {
    "bert": (BertConfig, BertForTokenClassification, BertTokenizer),
    "roberta": (RobertaConfig, RobertaForTokenClassification, RobertaTokenizer),
    "distilbert": (DistilBertConfig, DistilBertForTokenClassification, DistilBertTokenizer),
    "camembert": (CamembertConfig, CamembertForTokenClassification, CamembertTokenizer),
    "xlmroberta": (XLMRobertaConfig, XLMRobertaForTokenClassification, XLMRobertaTokenizer),
}

### 3. Utility classes and functions 

Here, we write utility classes and functions that we will use for training. You can just run all the cells below. 

**PLEASE DO NOT MAKE ANY CHANGES IN THIS SECTION** 

We begin by writing custom datasets for our NER task

In [3]:
class InputExample(object):
    """A single training/test example for token classification."""

    def __init__(self, guid, words, labels):
        """Constructs a InputExample.
        Args:
            guid: Unique id for the example.
            words: list. The words of the sequence.
            labels: (Optional) list. The labels for each word of the sequence. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.words = words
        self.labels = labels

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_ids):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_ids = label_ids


Next, we define the train and evaluation functions

In [4]:
def train(args, train_dataset, model, tokenizer, labels, pad_token_label_id):
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt")) and os.path.isfile(
            os.path.join(args.model_name_or_path, "scheduler.pt")
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if os.path.exists(args.model_name_or_path):
        # set global_step to gobal_step of last saved checkpoint from model path
        try:
            global_step = int(args.model_name_or_path.split("-")[-1].split("/")[0])
        except ValueError:
            global_step = 0
        epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
        steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

        logger.info("  Continuing training from checkpoint, will skip to saved global_step")
        logger.info("  Continuing training from epoch %d", epochs_trained)
        logger.info("  Continuing training from global step %d", global_step)
        logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)

    tr_loss, logging_loss = 0.0, 0.0
    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproductibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            model.train()
            batch = tuple(t.to(args.device) for t in batch)
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
            if args.model_type != "distilbert":
                inputs["token_type_ids"] = (
                    batch[2] if args.model_type in ["bert", "xlnet"] else None
                )  # XLM and RoBERTa don"t use segment_ids

            outputs = model(**inputs)
            loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

                scheduler.step()  # Update learning rate schedule
                optimizer.step()
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                            args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="dev")
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
                    if not os.path.exists(output_dir):
                        os.makedirs(output_dir)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

In [5]:
def evaluate(args, model, tokenizer, labels, pad_token_label_id, mode, prefix=""):
    eval_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode=mode)

    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly
    eval_sampler = SequentialSampler(eval_dataset) if args.local_rank == -1 else DistributedSampler(eval_dataset)
    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation %s *****", prefix)
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    preds = None
    out_label_ids = None
    model.eval()
    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        batch = tuple(t.to(args.device) for t in batch)

        with torch.no_grad():
            inputs = {"input_ids": batch[0], "attention_mask": batch[1], "labels": batch[3]}
            if args.model_type != "distilbert":
                inputs["token_type_ids"] = (
                    batch[2] if args.model_type in ["bert", "xlnet"] else None
                )  # XLM and RoBERTa don"t use segment_ids
            outputs = model(**inputs)
            tmp_eval_loss, logits = outputs[:2]

            if args.n_gpu > 1:
                tmp_eval_loss = tmp_eval_loss.mean()  # mean() to average on multi-gpu parallel evaluating

            eval_loss += tmp_eval_loss.item()
        nb_eval_steps += 1
        if preds is None:
            preds = logits.detach().cpu().numpy()
            out_label_ids = inputs["labels"].detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
            out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)

    eval_loss = eval_loss / nb_eval_steps
    preds = np.argmax(preds, axis=2)

    label_map = {i: label for i, label in enumerate(labels)}

    out_label_list = [[] for _ in range(out_label_ids.shape[0])]
    preds_list = [[] for _ in range(out_label_ids.shape[0])]

    for i in range(out_label_ids.shape[0]):
        for j in range(out_label_ids.shape[1]):
            if out_label_ids[i, j] != pad_token_label_id:
                out_label_list[i].append(label_map[out_label_ids[i][j]])
                preds_list[i].append(label_map[preds[i][j]])

    results = {
        "loss": eval_loss,
        "precision": precision_score(out_label_list, preds_list),
        "recall": recall_score(out_label_list, preds_list),
        "f1": f1_score(out_label_list, preds_list),
    }

    logger.info("***** Eval results %s *****", prefix)
    for key in sorted(results.keys()):
        logger.info("  %s = %s", key, str(results[key]))

    return results, preds_list


Next, we define functions that will help us load and preprocess the examples. 

In [6]:
def load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode):
    if args.local_rank not in [-1, 0] and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Load data features from cache or dataset file
    cached_features_file = os.path.join(
        args.data_dir,
        "cached_{}_{}_{}".format(
            mode, list(filter(None, args.model_name_or_path.split("/"))).pop(), str(args.max_seq_length)
        ),
    )
    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        features = torch.load(cached_features_file)
    else:
        logger.info("Creating features from dataset file at %s", args.data_dir)
        examples = read_examples_from_file(args.data_dir, mode)
        features = convert_examples_to_features(
            examples,
            labels,
            args.max_seq_length,
            tokenizer,
            cls_token_at_end=bool(args.model_type in ["xlnet"]),
            # xlnet has a cls token at the end
            cls_token=tokenizer.cls_token,
            cls_token_segment_id=2 if args.model_type in ["xlnet"] else 0,
            sep_token=tokenizer.sep_token,
            sep_token_extra=bool(args.model_type in ["roberta"]),
            # roberta uses an extra separator b/w pairs of sentences, cf. github.com/pytorch/fairseq/commit/1684e166e3da03f5b600dbb7855cb98ddfcd0805
            pad_on_left=bool(args.model_type in ["xlnet"]),
            # pad on the left for xlnet
            pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
            pad_token_segment_id=4 if args.model_type in ["xlnet"] else 0,
            pad_token_label_id=pad_token_label_id,
        )
        if args.local_rank in [-1, 0]:
            logger.info("Saving features into cached file %s", cached_features_file)
            torch.save(features, cached_features_file)

    if args.local_rank == 0 and not evaluate:
        torch.distributed.barrier()  # Make sure only the first process in distributed training process the dataset, and the others will use the cache

    # Convert to Tensors and build dataset
    all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_ids for f in features], dtype=torch.long)

    dataset = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
    return dataset


In [7]:
def read_examples_from_file(data_dir, mode):
    file_path = os.path.join(data_dir, "{}.txt".format(mode))
    guid_index = 1
    examples = []
    with open(file_path, encoding="utf-8") as f:
        words = []
        labels = []
        for line in f:
            line = line.strip()
            if len(line) < 2  or line == "\n":
                print(line, words)
                if words:
                    examples.append(InputExample(guid="{}-{}".format(mode, guid_index), words=words, labels=labels))
                    guid_index += 1
                    words = []
                    labels = []
            else:
                splits = line.split(" ")
                words.append(splits[0])
                if len(splits) > 1:
                    labels.append(splits[-1].replace("\n", ""))
                else:
                    # Examples could have no label for mode = "test"
                    labels.append("O")
        if words:
            examples.append(InputExample(guid="{}-{}".format(mode, guid_index), words=words, labels=labels))
    return examples


def convert_examples_to_features(
    examples,
    label_list,
    max_seq_length,
    tokenizer,
    cls_token_at_end=False,
    cls_token="[CLS]",
    cls_token_segment_id=1,
    sep_token="[SEP]",
    sep_token_extra=False,
    pad_on_left=False,
    pad_token=0,
    pad_token_segment_id=0,
    pad_token_label_id=-100,
    sequence_a_segment_id=0,
    mask_padding_with_zero=True,
):
    """ Loads a data file into a list of `InputBatch`s
        `cls_token_at_end` define the location of the CLS token:
            - False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
            - True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
        `cls_token_segment_id` define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
    """

    label_map = {label: i for i, label in enumerate(label_list)}

    features = []
    for (ex_index, example) in enumerate(examples):
        #print(ex_index, len(example.words))
        if ex_index % 10000 == 0:
            logger.info("Writing example %d of %d", ex_index, len(examples))

        tokens = []
        label_ids = []
        for word, label in zip(example.words, example.labels):
            word_tokens = tokenizer.tokenize(word)
            tokens.extend(word_tokens)
            # Use the real label id for the first token of the word, and padding ids for the remaining tokens
            label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))

        # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa.
        special_tokens_count = 3 if sep_token_extra else 2
        if len(tokens) > max_seq_length - special_tokens_count:
            tokens = tokens[: (max_seq_length - special_tokens_count)]
            label_ids = label_ids[: (max_seq_length - special_tokens_count)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids:   0   0  0    0    0     0       0   0   1  1  1  1   1   1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids:   0   0   0   0  0     0   0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambiguously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens += [sep_token]
        label_ids += [pad_token_label_id]
        if sep_token_extra:
            # roberta uses an extra separator b/w pairs of sentences
            tokens += [sep_token]
            label_ids += [pad_token_label_id]
        segment_ids = [sequence_a_segment_id] * len(tokens)

        if cls_token_at_end:
            tokens += [cls_token]
            label_ids += [pad_token_label_id]
            segment_ids += [cls_token_segment_id]
        else:
            tokens = [cls_token] + tokens
            label_ids = [pad_token_label_id] + label_ids
            segment_ids = [cls_token_segment_id] + segment_ids

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding_length = max_seq_length - len(input_ids)
        if pad_on_left:
            input_ids = ([pad_token] * padding_length) + input_ids
            input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask
            segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids
            label_ids = ([pad_token_label_id] * padding_length) + label_ids
        else:
            input_ids += [pad_token] * padding_length
            input_mask += [0 if mask_padding_with_zero else 1] * padding_length
            segment_ids += [pad_token_segment_id] * padding_length
            label_ids += [pad_token_label_id] * padding_length

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length
        try:
            assert len(label_ids) == max_seq_length
        except:
            continue

        if ex_index < 5:
            logger.info("*** Example ***")
            logger.info("guid: %s", example.guid)
            logger.info("tokens: %s", " ".join([str(x) for x in tokens]))
            logger.info("input_ids: %s", " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s", " ".join([str(x) for x in input_mask]))
            logger.info("segment_ids: %s", " ".join([str(x) for x in segment_ids]))
            logger.info("label_ids: %s", " ".join([str(x) for x in label_ids]))

        features.append(
            InputFeatures(input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids, label_ids=label_ids)
        )
    return features

def get_labels(path):
    if path:
        with open(path, "r") as f:
            labels = f.read().splitlines()
        if "O" not in labels:
            labels = ["O"] + labels
        return labels
    else:
        return ["O", "B-DATE", "I-DATE", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]

Next, we define a function to set the seed and the function to start the actual training

In [8]:
def set_seed(args):
    """Set seed for training"""
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    if args.n_gpu > 0:
        torch.cuda.manual_seed_all(args.seed)

In [9]:
def start_training(args):
    """
    Start the actual training process
    """
    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup distant debugging if needed
    if args.server_ip and args.server_port:
        # Distant debugging - see https://code.visualstudio.com/docs/python/debugging#_attach-to-a-local-script
        import ptvsd

        print("Waiting for debugger attach")
        ptvsd.enable_attach(address=(args.server_ip, args.server_port), redirect_output=True)
        ptvsd.wait_for_attach()

    # Setup CUDA, GPU & distributed training
    if args.local_rank == -1 or args.no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
        args.n_gpu = torch.cuda.device_count()
    else:  # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        torch.cuda.set_device(args.local_rank)
        device = torch.device("cuda", args.local_rank)
        torch.distributed.init_process_group(backend="nccl")
        args.n_gpu = 1
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )

    # Set seed
    set_seed(args)

    # Prepare CONLL-2003 task
    labels = get_labels(args.labels)
    num_labels = len(labels)
    # Use cross entropy ignore index as padding label id so that only real label ids contribute to the loss later
    pad_token_label_id = CrossEntropyLoss().ignore_index

    # Load pretrained model and tokenizer
    if args.local_rank not in [-1, 0]:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

    args.model_type = args.model_type.lower()
    config_class, model_class, tokenizer_class = AutoConfig, AutoModelForTokenClassification, AutoTokenizer #MODEL_CLASSES[args.model_type]

    config = config_class.from_pretrained(
    args.config_name if args.config_name else args.model_name_or_path,
    num_labels=num_labels,
    id2label={str(i): label for i, label in enumerate(labels)},
    label2id={label: i for i, label in enumerate(labels)},
    cache_dir=args.cache_dir if args.cache_dir else None,
    )
    tokenizer = tokenizer_class.from_pretrained(
        args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
        #do_lower_case=args.do_lower_case,
        cache_dir=args.cache_dir if args.cache_dir else None,
        #use_fast=args.use_fast,
    )
    model = model_class.from_pretrained(
        args.model_name_or_path,
        from_tf=bool(".ckpt" in args.model_name_or_path),
        config=config,
        cache_dir=args.cache_dir if args.cache_dir else None,
    )

    if args.local_rank == 0:
        torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

    model.to(args.device)

    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode="train")
        #train_dataset = load_examples(args, mode="train")
        global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels, pad_token_label_id)
        #global_step, tr_loss = train_ner(args, train_dataset, model, tokenizer, labels, pad_token_label_id)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Fine-tuning
    if args.do_finetune:
        tokenizer = tokenizer_class.from_pretrained(args.input_dir, do_lower_case=args.do_lower_case)
        model = model_class.from_pretrained(args.input_dir)
        model.to(args.device)
        result, predictions = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="test")
        train_dataset = load_and_cache_examples(args, tokenizer, labels, pad_token_label_id, mode="train")

        # train_dataset = load_examples(args, mode="train")
        global_step, tr_loss = train(args, train_dataset, model, tokenizer, labels, pad_token_label_id)
        # global_step, tr_loss = train_ner(args, train_dataset, model, tokenizer, labels, pad_token_label_id)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
    if (args.do_train or args.do_finetune) and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
        # Create output directory if needed
        if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
            os.makedirs(args.output_dir)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("pytorch_transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            model = model_class.from_pretrained(checkpoint)
            model.to(args.device)
            result, _ = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="dev", prefix=global_step)
            if global_step:
                result = {"{}_{}".format(global_step, k): v for k, v in result.items()}
            results.update(result)
        output_eval_file = os.path.join(args.output_dir, "eval_results.txt")
        with open(output_eval_file, "w") as writer:
            for key in sorted(results.keys()):
                writer.write("{} = {}\n".format(key, str(results[key])))

    if args.do_predict and args.local_rank in [-1, 0]:
        tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
        model = model_class.from_pretrained(args.output_dir)
        model.to(args.device)
        result, predictions = evaluate(args, model, tokenizer, labels, pad_token_label_id, mode="test")
        # Save results
        output_test_results_file = os.path.join(args.output_dir, "test_results.txt")
        with open(output_test_results_file, "w") as writer:
            for key in sorted(result.keys()):
                writer.write("{} = {}\n".format(key, str(result[key])))
        # Save predictions
        output_test_predictions_file = os.path.join(args.output_dir, "test_predictions.txt")
        with open(output_test_predictions_file, "w") as writer:
            with open(os.path.join(args.data_dir, "test.txt"), "r") as f:
                example_id = 0
                for line in f:
                    if line.startswith("-DOCSTART-") or line == "" or line == "\n":
                        writer.write(line)
                        if not predictions[example_id]:
                            example_id += 1
                    elif predictions[example_id]:
                        output_line = line.split()[0] + " " + predictions[example_id].pop(0) + "\n"
                        writer.write(output_line)
                    else:
                        logger.warning("Maximum sequence length exceeded: No prediction for '%s'.", line.split()[0])

    logger.info(results)


### 4. Training

Here, we perform the actual training process after defining the training arguments. 

In [10]:
import argparse

In [11]:
def get_args():
    """
    Get training arguments
    """
    parser = argparse.ArgumentParser()
    # Required parameters
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.",
    )
    parser.add_argument(
        "--model_type",
        default=None,
        type=str,
        #help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()),
    )
    parser.add_argument(
        "--model_name_or_path",
        default=None,
        type=str,
        help="Path to pre-trained model or shortcut name selected in the list: " + ", ",
    )
    parser.add_argument(
        "--input_dir",
        default=None,
        type=str,
        required=False,
        help="The input model directory.",
    )
    parser.add_argument(
        "--output_dir",
        default=None,
        type=str,
        help="The output directory where the model predictions and checkpoints will be written.",
    )

    # Other parameters
    parser.add_argument(
        "--labels",
        default="",
        type=str,
        help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.",
    )
    parser.add_argument(
        "--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name"
    )
    parser.add_argument(
        "--tokenizer_name",
        default="",
        type=str,
        help="Pretrained tokenizer name or path if not the same as model_name",
    )
    parser.add_argument(
        "--cache_dir",
        default="",
        type=str,
        help="Where do you want to store the pre-trained models downloaded from s3",
    )
    parser.add_argument(
        "--max_seq_length",
        default=128,
        type=int,
        help="The maximum total input sequence length after tokenization. Sequences longer "
        "than this will be truncated, sequences shorter will be padded.",
    )
    parser.add_argument("--do_train", action="store_true", help="Whether to run training.")
    parser.add_argument("--do_finetune", action="store_true", help="Whether to run training.")
    parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.")
    parser.add_argument("--do_predict", action="store_true", help="Whether to run predictions on the test set.")
    parser.add_argument(
        "--evaluate_during_training",
        action="store_true",
        help="Whether to run evaluation during training at each logging step.",
    )
    parser.add_argument(
        "--do_lower_case", action="store_true", help="Set this flag if you are using an uncased model."
    )

    parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.")
    parser.add_argument(
        "--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation."
    )
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help="Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.")
    parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
    parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.")
    parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.")
    parser.add_argument(
        "--num_train_epochs", default=3.0, type=float, help="Total number of training epochs to perform."
    )
    parser.add_argument(
        "--max_steps",
        default=-1,
        type=int,
        help="If > 0: set total number of training steps to perform. Override num_train_epochs.",
    )
    parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")

    parser.add_argument("--logging_steps", type=int, default=500, help="Log every X updates steps.")
    parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.")
    parser.add_argument(
        "--eval_all_checkpoints",
        action="store_true",
        help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number",
    )
    parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available")
    parser.add_argument(
        "--overwrite_output_dir", action="store_true", help="Overwrite the content of the output directory"
    )
    parser.add_argument(
        "--overwrite_cache", action="store_true", help="Overwrite the cached training and evaluation sets"
    )
    parser.add_argument("--seed", type=int, default=42, help="random seed for initialization")

    parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank")
    parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.")
    parser.add_argument("--server_port", type=str, default="", help="For distant debugging.")
    
    return parser.parse_known_args()

**CHANGES CAN BE MADE HERE:**

Note that the argments with comments may need to be modified. The remaining arguments can be left as they are, as these are good defaults. 

Hence, you should start by only supplying the data directory and output directory. 

In [12]:
!git clone https://github.com/dadelani/nepali-ner.git

Cloning into 'nepali-ner'...
remote: Enumerating objects: 96, done.[K
remote: Counting objects: 100% (96/96), done.[K
remote: Compressing objects: 100% (73/73), done.[K
remote: Total 96 (delta 18), reused 86 (delta 12), pack-reused 0[K
Unpacking objects: 100% (96/96), done.


# Training on NAAMII NER

In [13]:
data_path = '/content/nepali-ner/data/labeled/naami_ner'

BERT model training:

In [17]:
args, _ = get_args()
args.data_dir = data_path  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.output_dir = "naamii-ner" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "bert-base-multilingual-cased"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
# args.do_finetune = True
args.do_train = True
args.do_eval = True
args.do_predict = True

In [18]:
# confirm your cuda devices before setting this command
#!export CUDA_VISIBLE_DEVICES=1,2,3
start_training(args)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/bert-base-multilingual-cased HTTP/1.1" 200 940
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:filelock:Attempting to acquire lock 139620621130448 on /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c.lock
DEBUG:filelock:Lock 139620621130448 acquired on /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggin

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 139620621130448 on /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c.lock
DEBUG:filelock:Lock 139620621130448 released on /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
DEBUG:filelock:Attempting to acquire lock 139629180218960 on /root/.cache/huggingface/transformers/f55e7a2ad4f8d0fff2733b3f79777e1e99247f2e4583703e92ce74453af8c235.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f.lock
DEBUG:filelock:Lock 139629180218960 acquired on /root/.cache/huggingface/transformers

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 139629180218960 on /root/.cache/huggingface/transformers/f55e7a2ad4f8d0fff2733b3f79777e1e99247f2e4583703e92ce74453af8c235.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f.lock
DEBUG:filelock:Lock 139629180218960 released on /root/.cache/huggingface/transformers/f55e7a2ad4f8d0fff2733b3f79777e1e99247f2e4583703e92ce74453af8c235.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/bert-base-multilingual-cased HTTP/1.1" 200 940
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggin

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 139620620987920 on /root/.cache/huggingface/transformers/eff018e45de5364a8368df1f2df3461d506e2a111e9dd50af1fae061cd460ead.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29.lock
DEBUG:filelock:Lock 139620620987920 released on /root/.cache/huggingface/transformers/eff018e45de5364a8368df1f2df3461d506e2a111e9dd50af1fae061cd460ead.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/tokenizer.json HTTP/1.1" 200 0
DEBUG:filelock:Attempting to acquire lock 139620621131216 on /root/.cache/huggingface/transformers/46880f3b0081fda494a4e15b05787692aa4c1e21e0ff2428ba8b14d4eda0784d.b33e51591f94f17c238ee9b1fac75b96ff2678cbaed6e108feadb3449d18dc24.lock
DEBUG:filelock:Lock 139620621131216 acquired on /root/.cache/huggingface/transformers/46880f

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 139620621131216 on /root/.cache/huggingface/transformers/46880f3b0081fda494a4e15b05787692aa4c1e21e0ff2428ba8b14d4eda0784d.b33e51591f94f17c238ee9b1fac75b96ff2678cbaed6e108feadb3449d18dc24.lock
DEBUG:filelock:Lock 139620621131216 released on /root/.cache/huggingface/transformers/46880f3b0081fda494a4e15b05787692aa4c1e21e0ff2428ba8b14d4eda0784d.b33e51591f94f17c238ee9b1fac75b96ff2678cbaed6e108feadb3449d18dc24.lock
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/added_tokens.json HTTP/1.1" 404 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/special_tokens_map.json HTTP/1.1" 404 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

DEBUG:filelock:Attempting to release lock 139620612383440 on /root/.cache/huggingface/transformers/0a3fd51713dcbb4def175c7f85bddc995d5976ce1dde327f99104e4d33069f17.aa7be4c79d76f4066d9b354496ea477c9ee39c5d889156dd1efb680643c2b052.lock
DEBUG:filelock:Lock 139620612383440 released on /root/.cache/huggingface/transformers/0a3fd51713dcbb4def175c7f85bddc995d5976ce1dde327f99104e4d33069f17.aa7be4c79d76f4066d9b354496ea477c9ee39c5d889156dd1efb680643c2b052.lock
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task 

 ['फेरि', 'तिनै', 'पाकिस्तानी', 'हरू', 'शेख', 'मुजिबर', 'रहमान', 'लाई', '‘', 'अलगाववादी', '’', 'भन्छन्', ',', 'जबकि', 'बंगलादेश', 'मा', 'राष्ट्र', 'का', 'संस्थापक', 'पिता', 'वा', 'बंगबन्धु', 'मानिन्छ', '।']
 ['सिक्किम', 'को', 'भारत', 'मा', 'विलयन', 'किन', 'राष्ट्रघात', ',', 'हस्तक्षेप', 'र', 'विस्तारवाद', 'हो', 'तर', 'तिब्बत', 'चीन', 'चाहिं', 'हैन', 'एकता', '?']
 ['दोस्रो', 'विश्वयुद्ध', 'ले', 'कोरिया', 'लाई', 'मात्र', 'हैन', ',', 'जर्मनी', 'पनि', 'दुई', 'राज्य', 'मा', 'बाँडे', 'को', 'थियो', '।']
 ['पूर्वी', 'जर्मनी', 'को', 'साम्यवादी', 'सत्ता', 'ढले', 'सँगै', 'फेरि', 'जर्मन', 'एकीकरण', 'भयो', '।']
 ['शायद', 'उत्तरकोरिया', 'को', 'साम्यवादी', 'सत्ता', 'ढले', 'का', 'दिन', 'कोरिया', 'पनि', 'एकीकरण', 'हुनेछ', '।']
 ['तर', ',', 'यी', 'उदाहरण', 'बाट', 'उठ्ने', 'केही', 'प्रश्न', 'छन्-', 'पूर्वी', 'र', 'पश्चिम', 'जर्मनी', 'दुई', 'फरक', '-', 'राज्य', 'हुँदा', 'के', 'को', 'राष्ट्रियता', 'पनि', 'थियो', 'वा', 'एउटै', '?']
 ['कोरिया', 'अहिले', 'उत्तर', 'र', 'दक्षिण', 'दुई', 'वटा', 'हुँदा', 'यी', '‘

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/naami_ner/cached_train_bert-base-multilingual-cased_164
INFO:Nep_NER_Log:***** Running training *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Num Epochs = 10
INFO:Nep_NER_Log:  Instantaneous batch size per GPU = 32
INFO:Nep_NER_Log:  Total train batch size (w. parallel, distributed & accumulation) = 32
INFO:Nep_NER_Log:  Gradient Accumulation steps = 1
INFO:Nep_NER_Log:  Total optimization steps = 40
Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:  25%|██▌       | 1/4 [00:01<00:05,  1.92s/it][A
Iteration:  50%|█████     | 2/4 [00:03<00:03,  1.81s/it][A
Iteration:  75%|███████▌  | 3/4 [00:05<00:01,  1.78s/it][A
Iteration: 100%|██████████| 4/4 [00:05<00:00,  1.45s/it]
Epoch:  10%|█         | 1/10 [00:05<00:52,  5.80s/it]
Iteration:   0%|          | 0/4 [00:00<?, ?it/s][A
Iteration:  25%|██▌       | 1/4 [00:01<00:05,  1.74s/it][A
Iteration:  50%|█████     | 2/4 [00:03<00:03, 

 ['मलेसिया', 'का', 'नेपाली', 'व्यवसायी', 'बिरु', 'पाठक', 'ले', 'थप', 'राहत', 'पठाउन', 'लागि', 'सहयोग', 'रकम', 'संकलन', 'भै', 'रहेको', 'जानकारी', 'दिए', '।']
 ['यसै', 'बिच', ',', 'भुकम्प', 'मा', 'परी', 'काठमाडौ', 'श्रीमती', 'र', 'छोरी', 'गुमाए', 'का', 'नागी', '३', 'पाचथर', 'रत्न', 'मेयाङ्बो', 'को', 'परिवार', 'लाई', 'राहत', 'बितरण', 'गर्न', 'नेपाल', 'पुगे', 'नेपाली', 'व्यवसायी', 'समाज', 'मलेसिया', 'टोली', 'ले', '२०', 'हजार', 'रुपैया', 'सहयोग', 'गरे', 'छ', '।']
 ['काठमाडैं', 'मा', 'खाजाघर', 'संचालन', 'गरे', 'का', 'मेयाङ्बो', 'की', 'श्रीमती', 'कमला', 'र', '८', 'बर्षिया', 'छोरी', 'रक्षा', 'को', 'बैशाख', '१२', 'भुकम्प', 'परी', 'ज्यान', 'गए', 'थियो', '।']
 ['मृतक', 'का', 'परिवार', 'लाई', 'भेटेर', 'आज', '२०', 'हजार', 'रुपैंया', 'हस्तान्तरण', 'गरे', 'को', 'समाज', 'सदस्य', 'देवेन्द्र', 'खापुङ', 'ले', 'बताए', '।']
 ['भूकम्प', 'पीडित', 'लाई', 'राहत', 'सामग्री', ',', 'घाइते', 'को', 'उपचार', 'तथा', 'स्रोत', 'र', 'साधन', 'जुटाउने', 'मुख्य', 'लक्ष्य', 'रहे', 'उद्धार', 'टोली', 'का', 'संयोजक', 'डा', '.'

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/naami_ner/cached_dev_bert-base-multilingual-cased_164
INFO:Nep_NER_Log:***** Running evaluation  *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Batch size = 8
Evaluating: 100%|██████████| 13/13 [00:02<00:00,  5.43it/s]
INFO:Nep_NER_Log:***** Eval results  *****
INFO:Nep_NER_Log:  f1 = 0.5194274028629857
INFO:Nep_NER_Log:  loss = 0.4196198227313849
INFO:Nep_NER_Log:  precision = 0.5059760956175299
INFO:Nep_NER_Log:  recall = 0.5336134453781513
INFO:Nep_NER_Log:Creating features from dataset file at /content/nepali-ner/data/labeled/naami_ner
INFO:Nep_NER_Log:Writing example 0 of 310
INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: test-1
INFO:Nep_NER_Log:tokens: [CLS] वर्ष म ##ना ##इ ##र ##ह ##ँ ##दा नेपाल आ ##उने पर ##्य ##टक ले प ##टक [UNK] वा ##ता ##वरण क ##सरी बना ##उने भन्ने सन् ##दर ##्भ मा काम गर्न ##ु आवश्यक छ । [SEP]
INFO:Nep_NER_Log:input_ids: 101 24332 889 16380 34231 1154

 ['वर्ष', 'मनाइरहँदा', 'नेपाल', 'आउने', 'पर्यटक', 'ले', 'पटक', '–', 'वातावरण', 'कसरी', 'बनाउने', 'भन्ने', 'सन्दर्भ', 'मा', 'काम', 'गर्नु', 'आवश्यक', 'छ', '।']
 ['किन', 'की', 'नेपाल', 'को', 'समृद्धि', 'आधार', 'नै', 'पर्यटन', 'उद्योग', 'हो', '।']
 ['सडक', 'यातायात', 'का', 'कारण', 'परम्परागत', 'पदयात्रा', 'मार्ग', 'लोप', 'हुने', 'अवस्था', 'मा', 'पुगे', 'छन्', '।']
 ['एससीटी', 'ले', 'युनियन', 'पे', 'इन्टरनेशन', 'सँग', 'सहकार्य', 'सुरु', 'गरि', 'अत्याधुनिक', '‘', 'कन्ट्याक्ट', 'लेस', '’', 'कार्ड', 'को', 'लञ्चिङ', 'गरे', 'छ', '।']
 ['एससीटी', 'का', 'अध्यक्ष', 'देबिप्रकाश', 'भट्टचन', 'ले', 'नेपाली', 'कार्ड', 'र', 'अत्याधुनिक', 'सुमेत', 'भएका', 'प्रयोगकर्ता', 'लाई', 'यो', 'प्रयोग', 'गर्न', 'आग्रह', 'गरे', '।']
 ['यस', 'मा', 'प्रभु', 'ग्रुप', 'का', 'साथै', 'आइएमइ', 'समुह', 'र', 'हिमालयन', 'बैंक', 'ले', 'समेत', 'लगानी', 'गरे', 'छन्', '।']
 ['यस', 'का', 'लागि', 'मुख्य', 'कोर', 'नेटवर्क', 'काठमाडौं', 'र', 'बुटवल', 'मा', 'रहने', 'छ', '।']
 ['नेपाल', 'टेलिकम', 'ले', 'मुलुक', 'मै', 'पहिलो', 'पटक', 'म

INFO:Nep_NER_Log:guid: test-4
INFO:Nep_NER_Log:tokens: [CLS] एस ##सी ##टी ले य ##ुन ##ियन प ##े इन ##्टर ##ने ##शन स ##ँ ##ग स ##ह ##कार ##्य स ##ुरु ग ##र ##ि अ ##त्या ##ध ##ुन ##िक [UNK] क ##न् ##ट ##्या ##क ##्ट ले ##स [UNK] का ##र्ड को ल ##ञ ##् ##च ##ि ##ङ गरे छ । [SEP]
INFO:Nep_NER_Log:input_ids: 101 55046 27117 58580 57865 890 45649 31332 885 11554 30114 54071 13466 61820 898 28462 19741 898 17110 25561 14251 898 56902 867 11549 12878 851 76752 27694 45649 13671 100 865 41013 14835 17279 12151 25695 57865 13432 100 11081 52905 11267 893 111203 20429 16940 12878 81489 79263 871 920 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

# Training on wikiann_np

In [16]:
data_path = '/content/nepali-ner/data/labeled/wikiann_ne'

BERT model training:

In [17]:
args, _ = get_args()
args.data_dir = data_path  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.output_dir = "wikiann_np" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "bert-base-multilingual-cased"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
# args.do_finetune = True
args.do_train = True
args.do_eval = True
args.do_predict = True

In [18]:
# confirm your cuda devices before setting this command
#!export CUDA_VISIBLE_DEVICES=1,2,3
start_training(args)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/bert-base-multilingual-cased HTTP/1.1" 200 940
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/bert-base-multilingual-cased HTTP/1.1" 200 940
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-b

 ['राष्ट्रीय', 'प्रौद्योगिकी', 'संस्थान', ',', 'त्रिपुरा']
 ['२०३०', 'सालको', 'मदन', 'पुरस्कार', 'पाएका', 'थिए।']
 ['द', 'रोलिङ्ग', 'स्टोन्स']
 ['उहाँको', 'जन्म', 'अछाम', 'जिल्लामा', 'भएको', 'हो|']
 ["'", "''", 'रोहित', 'शर्मा', "''", "'"]
 ['एसियाली', 'क्रिकेट', 'परिषद']
 ['तिपालेटा', 'गा.वि.स', '.']
 ['महेन्द्रनगर', 'कञ्चनपुर', 'जिल्लाको', 'एक', 'शहर।']
 ["'", "''", 'भानुमती', "''", "'", 'नेपालको', 'पश्चिमाञ्चल', 'विकास', 'क्षेत्र', 'अन्तर्गत', 'गण्डकी', 'अञ्चलमा', 'पर्ने', 'तनहुँ', 'जिल्लामा', 'अवस्थित', 'एक', 'गाउँ', 'विकास', 'समिति', 'हो', '।']
 ['संयुक्त', 'राज्य', 'अमेरिकामा', 'पर्ने', 'एक', 'राज्य।']
 ['इजिप्ट]]', ',', 'लिबिया', ',', 'ट्युनिसिया', ',', 'अल्जेरिया']
 ['**', "''", 'श्रीलंकाका', 'विश्व', 'सम्पदा', 'क्षेत्रहरूको', 'सूची', "''"]
 ['उनी', 'प्राकृतिक', 'दृश्यहरूकाे', 'वर्णन', 'गर्न', 'पनि', 'दक्ष', 'रहेका', 'पाइन्छन्', '।', 'पूर्वीय', 'अाचार्यहरूले', 'माघलार्इ', 'उपमामा', 'कालिदास', 'सरह', ',', 'अर्थगाैरवमा', 'महाकाव्यकार', 'भारवि', 'सरह', 'र', 'पदलालित्यमा', 'दण्डी',

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/wikiann_ne/cached_train_bert-base-multilingual-cased_164
INFO:Nep_NER_Log:***** Running training *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Num Epochs = 10
INFO:Nep_NER_Log:  Instantaneous batch size per GPU = 32
INFO:Nep_NER_Log:  Total train batch size (w. parallel, distributed & accumulation) = 32
INFO:Nep_NER_Log:  Gradient Accumulation steps = 1
INFO:Nep_NER_Log:  Total optimization steps = 40
Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:  25%|██▌       | 1/4 [00:01<00:05,  1.97s/it][A
Iteration:  50%|█████     | 2/4 [00:03<00:03,  1.79s/it][A
Iteration:  75%|███████▌  | 3/4 [00:05<00:01,  1.75s/it][A
Iteration: 100%|██████████| 4/4 [00:05<00:00,  1.42s/it]
Epoch:  10%|█         | 1/10 [00:05<00:51,  5.70s/it]
Iteration:   0%|          | 0/4 [00:00<?, ?it/s][A
Iteration:  25%|██▌       | 1/4 [00:01<00:05,  1.68s/it][A
Iteration:  50%|█████     | 2/4 [00:03<00:03,

 ['सन्', '१९०५-१९०७', 'को', 'प्रथम', 'रुसी', 'पुँजीजीवी', 'प्रजातन्त्रवादी', 'क्रान्ति']
 ['सिसहनिया', ',', 'दाङ']
 ["'", "''", 'ब्रसेल्स', "''", "'", 'बेल्जियमको', 'राजधानी', 'शहर', 'हो', '।']
 ['नेपाली', 'सैनिक', 'विमान', 'सेवा']
 ['मजदुरहरूको', 'पार्टी', '(', 'उरुग्वे', ')']
 ["'", "''", 'गिरिजा', 'प्रसाद', 'कोइराला', "''", "'", '(', '४', '/', '५', ')', '(', '१९२५-२०१०', ')']
 ['क्विन्सल्याण्ड', 'न्यू', 'साउथ', 'वेल्स', ',']
 ["'", "''", 'इकोलोजी', "''", "'"]
 ['====संघीय', 'लोकतान्त्रिक', 'गणतन्त्रमा', 'भएका', 'प्रधानमन्त्रीहरु', '(', '२००८-हालसम्म', ')', '====']
 ['स्वतन्त्र', 'प्रजातान्त्रिक', 'संघ']
 ['तृतीय', '(', 'कम्युनिष्ट', ')', 'इन्टरनेसनल']
 ['उत्कल', 'यूनिभर्सिटी', 'अफ', 'कल्चर', 'भुवनेश्वर']
 ['*ब्रे', 'वायटले', 'रोमन', 'रेन्ज', 'लाई', 'हरायो']
 ['====', 'द', 'अंडरटेकर', 'संग', 'लामो', 'भ्कगडा', '====']
 ['२०४५', 'सालको', 'मदन', 'पुरस्कार', 'प्राप्त', 'गरेका', 'थिए।']
 ['महेन्द्र', 'वीर', 'विक्रम', 'शाह']
 ['डिल्लीरमण', 'रेग्मी', '-', 'गृह', 'मन्त्री']
 ['भारतीय', 'जनता

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/wikiann_ne/cached_dev_bert-base-multilingual-cased_164
INFO:Nep_NER_Log:***** Running evaluation  *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Batch size = 8
Evaluating: 100%|██████████| 13/13 [00:02<00:00,  5.41it/s]
INFO:Nep_NER_Log:***** Eval results  *****
INFO:Nep_NER_Log:  f1 = 0.7656250000000001
INFO:Nep_NER_Log:  loss = 0.347856189769048
INFO:Nep_NER_Log:  precision = 0.7480916030534351
INFO:Nep_NER_Log:  recall = 0.784
INFO:Nep_NER_Log:Creating features from dataset file at /content/nepali-ner/data/labeled/wikiann_ne
INFO:Nep_NER_Log:Writing example 0 of 100
INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: test-1
INFO:Nep_NER_Log:tokens: [CLS] नेपाल म ##ज ##द ##ुर कि ##सा ##न पार्टी २ स ##िट [SEP]
INFO:Nep_NER_Log:input_ids: 101 29953 889 17413 15552 52398 14117 35127 11453 85554 924 898 80858 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

 ['नेपाल', 'मजदुर', 'किसान', 'पार्टी', '२', 'सिट']
 ['नेपाल', 'कम्युनिष्ट', 'पार्टी', '(', 'एकीकृत', 'मार्क्सवादी-लेनिनवादी', ')']
 ['अन्तर्राष्ट्रिय', 'नागरिक', 'उड्डयन', 'सङ्गठन']
 ['पपुवा', 'न्युगिनी', 'राष्ट्रिय', 'क्रिकेट', 'टिम']
 ['चेल्सी', 'फुटबल', 'क्लब']
 ["'", "''", 'हरकठवा', "''", "'", 'सर्लाही', 'जिल्लामा', 'अवस्थित', 'एक', 'गाविस', 'हो।']
 ["'", "''", 'गिरिजा', 'प्रसाद', 'कोइराला', "''", "'", '(', '२', '/', '५', ')', '(', '१९२५-२०१०', ')']
 ['मजार', 'शरीफ', '-', '३००', ',', '६००']
 ['*१०', 'मार्च', '२००८', '–', 'सुवास', 'घिसिङ', 'दागोपापबाट', 'पदच्युत।']
 ['स्टेट', 'बैंक', 'अफ', 'मैसूर']
 ["'", "''", 'खिलराज', 'रेग्मी', "''", "'", '(', 'कार्यकारी', ')', '(', '१९४९-', ')']
 ['उदित', 'नारायण', 'झा']
 ['हाट', ',', 'बैतडी']
 ['संयुक्त', 'राज्य', 'अमेरिकामा', 'पर्ने', 'एक', 'राज्य।']
 ['उनको', 'मृत्यु', 'पछि', ',', 'माधवकुमार', 'नेपाल', 'उनको', 'पार्टीको', 'नेताको', 'रूपमा', 'आए।']
 ['डा.', 'तुलसी', 'गिरी', '-', 'अध्यक्ष-परराष्ट्र', 'तथा', 'राजदरबारसम्बन्धी']
 ['तोलीजैसी', 'गा

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/wikiann_ne/cached_test_bert-base-multilingual-cased_164
INFO:Nep_NER_Log:***** Running evaluation  *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Batch size = 8
Evaluating: 100%|██████████| 13/13 [00:02<00:00,  5.58it/s]
INFO:Nep_NER_Log:***** Eval results  *****
INFO:Nep_NER_Log:  f1 = 0.6693227091633466
INFO:Nep_NER_Log:  loss = 0.37733195148981535
INFO:Nep_NER_Log:  precision = 0.631578947368421
INFO:Nep_NER_Log:  recall = 0.711864406779661
INFO:Nep_NER_Log:{'loss': 0.347856189769048, 'precision': 0.7480916030534351, 'recall': 0.784, 'f1': 0.7656250000000001}


# Training on wikiann_hi

In [21]:
data_path = '/content/nepali-ner/data/labeled/wikiann_hi'

BERT model training:

In [22]:
args, _ = get_args()
args.data_dir = data_path  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.output_dir = "wikiann_hi" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "bert-base-multilingual-cased"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
# args.do_finetune = True
args.do_train = True
args.do_eval = True
args.do_predict = True

In [23]:
# confirm your cuda devices before setting this command
#!export CUDA_VISIBLE_DEVICES=1,2,3
start_training(args)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/bert-base-multilingual-cased HTTP/1.1" 200 940
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/bert-base-multilingual-cased HTTP/1.1" 200 940
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-b

 ['टैपी', 'ने', 'अपने', 'उत्पादों', 'को', 'एशिया', 'के', 'अपतटीय', 'भागों', 'में', 'भेजने', 'का', 'फैसला', 'लिया।', 'उन्होंने', 'प्रोमोशन', 'के', 'लिए', 'मैडोना', 'को', 'अपने', 'साथ', 'मिलाया।']
 ['केष्टो', 'मुखर्जी', '-', 'गुलु']
 ['कादर', 'ख़ान', '-', 'मैने्जर']
 ['(', 'जम्मू', 'और', 'कश्मीर', ')']
 ['गुलशन', 'ग्रोवर', '-', 'अतिथि', 'भूमिका']
 ['एअर', 'इंडिया', 'क्षेत्रीय']
 ["'", "''", 'भालचंद्र', 'नेमाडे', "''", "'"]
 ['भारतीय', 'अंतरिक्ष', 'अनुसंधान', 'संगठन']
 ["'", "''", 'बुल्गारिया', "''", "'"]
 ['**', 'अल्जीयर्स', '(', 'दूतावास', ')']
 ['डेनियला', 'हंचुकोवा', 'ने', 'स्वेतलाना', 'कुज़नेतसोवा', 'को', '6–3', ',', '6–4', 'से', 'हराया।']
 ['भारतीय', 'दूरसंचार', 'विनियामक', 'प्राधिकरण']
 ['वैद्यनाथ', 'मन्दिर', ',', 'देवघर']
 ['रॉनित', 'रॉय', '(', 'के', '.']
 ['किशनगंज', ',', 'बिहार', 'का', 'एक', 'प्रखण्ड।']
 ['रब', 'से', 'सोणा', 'इश्\u200dक']
 ['और', 'मेसाचुसेट्स', 'प्रौद्योगिक', 'संस्थान', '.']
 ["'", "''", 'सज़ेस्टोंचोवा', "''", "'", 'पोलैंड']
 ['**', 'पारामरिबो', '(', 'दूतावास', 

INFO:Nep_NER_Log:Writing example 0 of 5000
INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: train-1
INFO:Nep_NER_Log:tokens: [CLS] ट ##ै ##पी ने अपने उ ##त ##् ##पा ##द ##ों को ए ##शिया के अ ##प ##त ##टी ##य भाग ##ों में भ ##ेज ##ने का फ ##ै ##स ##ला लिया । उन्होंने प ##्रो ##म ##ो ##शन के लिए म ##ै ##ड ##ोन ##ा को अपने साथ म ##िला ##या । [SEP]
INFO:Nep_NER_Log:input_ids: 101 875 18438 104062 13088 19346 855 11845 20429 42035 15552 11497 11267 860 87084 10412 851 18187 11845 58580 13874 26803 11497 10532 888 73184 13466 11081 886 18438 13432 14334 37444 920 27640 885 63750 13841 13718 61820 10412 13182 889 18438 20691 73354 11208 11267 19346 16208 889 33156 15168 920 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 


 ['पंजाब', 'कृषि', 'विश्वविद्यालय', ',', 'लुधियाना']
 ['अनुप्रेषित', 'मोढेरा', 'सूर्य', 'मंदिर']
 ['वर्जीनिया', 'रुआनो', 'पास्कुआल', 'पाओला', 'सुआरेज़']
 ['सायन', 'पर्वत', 'शृंखला']
 ['सिक्किम', '(', '1', ')', '===']
 ['गणित', 'विज्ञान', 'संस्थान', ',', 'चेन्नई']
 ['लीमा', ',', 'पेरू', 'की', 'गलियों', 'में', ',', 'उसे', 'अपनी', 'शक्तियों', 'को', 'पता', 'चलता', 'है', '.']
 ['संयुक्त', 'राज्य', 'अमेरिका']
 ['कनाडा', 'या', 'बरमूडा', 'का', 'नागरिक', 'हो', ',', 'या']
 ['लोकमान्य', 'तिलक', 'टर्मिनस', 'रेलवे', 'स्टेशन']
 ['इण्डियन', 'ओवरसीज़', 'बैंक']
 ['दिल्ली', 'नौयडा', 'टॉल', 'ब्रिज']
 ['दरभंगा', ',', 'बिहार', 'का', 'एक', 'प्रखण्ड।']
 ['भारतीय', 'जीवन', 'बीमा', 'निगम']
 ['अरुणा', 'ईरानी', '-', 'चँदा']
 ['अनुप्रेषित', 'सोहन', 'सिंह', 'भकना']
 ['यशवन्त', 'सिंह', 'परमार', 'औद्यानिकी', 'एवं', 'वानिकी', 'विश्वविद्यालय', ',', 'सोलन']
 ['माहिम', 'की', 'खाड़ी']
 ['ताशकन्द', '(', '1930', 'से', 'पहले', 'समरक़न्द', ')']
 ['भारतीय', 'प्रबंधन', 'संस्थान', ',', 'शिलांग']
 ['थाणे', ',', 'भारत', 'और', 'न

INFO:Nep_NER_Log:tokens: [CLS] ( ज ##म् ##म ##ू और क ##श ##्मी ##र ) [SEP]
INFO:Nep_NER_Log:input_ids: 101 113 872 45753 13841 15778 10977 865 21835 76881 11549 114 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

 ['हावड़ा', 'जंक्शन', 'रेलवे', 'स्टेशन']
 ['त्वय्फेल्फोंतैन', '(', 'Twyfelfontein', ')']
 ["'", "''", 'विलियम', 'क्लिंटन', "''", "'", '(', 'born', '1946', ')']
 ['पूर्णिया', ',', 'बिहार', 'का', 'एक', 'प्रखण्ड।']
 ['अनुप्रेषित', 'क़ुतुबुद्दीन', 'मुबारक़', 'ख़िलजी']
 ['दक्षिणी', 'गोलार्ध', 'की', 'सर्दियों', 'में']
 ['नई', 'दिल्ली', 'रेलवे', 'स्टेशन']
 ['अनुप्रेषित', 'ओल्ड', 'ट्रैफर्ड', 'क्रिकेट', 'मैदान']
 ['72°', '35', "'", 'E', 'अहमदाबाद', ',']
 ['भारतीय', 'स्टेट', 'बैंक']
 ['पश्चिम', 'बंगाल', 'राष्ट्रीय', 'न्यायिक', 'विज्ञान', 'विश्वविद्यालय']
 ['हैदराबाद', '--', '1080', 'किमी', ',']
 ['गेब्रियल', 'गार्सिया', 'मार्ख़ेस']
 ['भारत', '(', '1992', 'से', ')']
 ['शंकर', 'अन्तर्राष्ट्रीय', 'गुड़िया', 'संग्रहालय', ',', 'नई', 'दिल्ली']
 ['बड़े', 'अच्छे', 'लगते', 'हैं']
 ['मध्य', 'प्रदेश', '(', '264', ')']
 ['भारतीय', 'दूरसंचार', 'विनियामक', 'प्राधिकरण']
 ['रियो', 'डि', 'जेनेरो']
 ['ऑस्टिन', ',', 'टेक्सास']
 ['किंग्स्टन', ',', 'नॉर्फ़ोक', 'द्वीप']
 ['राष्ट्रीय', 'राजमार्ग', '६२']
 ['जयन्त', 'वि

INFO:Nep_NER_Log:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:label_ids: -100 5 -100 -100 6 -100 -100 6 6 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -100 -

 ['नन्दा', 'देवी', 'पर्वत']
 ['===', 'आगरा', 'में', 'आमंत्रण', 'और', 'पलायन', '===']
 ["'", "''", 'अटलांटा', 'फाल्कन्स', "''", "'"]
 ['कोटा', ',', 'राजस्थान']
 ['संयुक्त', 'राष्ट्र', 'खाद्य', 'एवं', 'कृषि', 'संगठन']
 ['किसकी', 'आजादी', ',', 'किसका', 'जश्न', 'दैनिक', 'जागरण', 'का', 'एक', 'लेख।']
 ['राधा', 'कृष्ण', 'मंदिर', ',', 'कानपुर']
 ['रोहतक', 'लोक', 'सभा', 'निर्वाचन', 'क्षेत्र']
 ['प्लाज़्मा', '(', 'भौतिकी', ')']
 ['इम्फाल', 'पूर्व', 'जिला']
 ['अनुप्रेषित', '2014', 'राष्ट्रमण्डल', 'खेल']
 ['अनुप्रेषित', 'के॰', 'जे॰', 'येशुदास']
 ['पुनर्प्रेषित', 'अलेक्जेण्डर', 'वॉन', 'हम्बोल्ट']
 ["'", "''", 'गुजरात', "''", "'"]
 ['विदर्भ', 'क्रिकेट', 'एसोसिएशन', 'ग्राउंड', ',', 'नागपुर']
 ["'", "''", 'टेन', 'स्पोर्ट्स', "''", "'"]
 ['अनुप्रेषित', 'रिओ', 'दे', 'ला', 'प्लाता']
 ['उधम', 'सिंह', 'नगर', 'जिला']
 ['अनुप्रेषित', 'मिस्र', 'के', 'सशस्त्र', 'बल']
 ["''", 'छत्तीसगढ़', "''", "'"]
 ['राष्ट्रीय', 'तापविद्युत', 'निगम', 'लिमिटेड']
 ['डबिंग]]', '(', 'श्रेय', 'दिया', 'डीवीडी', 'रिलीज', 'पर', ')']


INFO:Nep_NER_Log:guid: test-2
INFO:Nep_NER_Log:tokens: [CLS] = = = आ ##गर ##ा में आम ##ंत्रण और प ##ला ##यन = = = [SEP]
INFO:Nep_NER_Log:input_ids: 101 134 134 134 852 45086 11208 10532 70575 107229 10977 885 14334 65469 134 134 134 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

# Training on singh_ner

In [24]:
data_path = '/content/nepali-ner/data/labeled/singh_ner'

BERT model training:

In [25]:
args, _ = get_args()
args.data_dir = data_path  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.output_dir = "singh_ner" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "bert-base-multilingual-cased"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
# args.do_finetune = True
args.do_train = True
args.do_eval = True
args.do_predict = True

In [26]:
# confirm your cuda devices before setting this command
#!export CUDA_VISIBLE_DEVICES=1,2,3
start_training(args)

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/bert-base-multilingual-cased HTTP/1.1" 200 940
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-base-multilingual-cased/resolve/main/tokenizer_config.json HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/bert-base-multilingual-cased HTTP/1.1" 200 940
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /bert-b

 ['बिते', 'को', 'अढाइ', 'दशक', 'को', 'अवधि', 'मा', 'अत्यधिक', 'उद्योग', 'हरू', 'चीन', 'तथा', 'नवोदित', 'औद्योगिक', 'राष्ट्र', 'हरूमा', 'सारएि', '।']
 ['समग्र', 'देश', 'लाई', 'एउटै', 'एकाइ', 'मानेर', 'राजनीतिक', 'संरचना', 'हरू', 'बनाउने', 'परम्पराविपरीत', 'पछिल्लो', 'अवधि', 'मा', 'जातीय', 'अल्पसङ्ख्यक', 'क्षेत्रीय', 'समुदाय', 'दलित', 'धार्मिक', 'र', 'लैङ्गिक', 'समानता', 'का', 'लागि', 'सङ्र्घषरत', 'समूह', 'हरूलाई', 'समेटेर', 'नवमार्क्सवादी', 'हरूले', 'निर्माण', 'गरे', 'को', 'वैचारकि', 'डिस्कोर्स', 'ले', 'नेपाल', 'मा', 'भिन्न', 'स्थिति', 'सिर्जना', 'गरे', 'को', 'छ', '।']
 ['संसदीय', 'प्रणाली', 'का', 'बारेमा', 'प्रचण्ड', 'को', 'अभिव्यक्ति', 'लाई', 'काङ्ग्रेस', 'ले', 'अन्तर्राष्ट्रियकरण', 'गर्न', 'खोज्यो', '।']
 ['भूपरविेष्ठित', 'मुलुक', 'ले', 'पाउनु', 'पर्ने', 'व्यापार', 'तथा', 'पारवहन', 'सन्धि', 'कोलकाता', 'बन्दरगाह', 'को', 'समस्या', 'निकासी', 'तथा', 'पैठारी', 'मा', 'दोहोरो', 'मापदण्ड', 'आदि', 'पनि', 'विवादित', 'विषय', 'हरू', 'छन्', '।']
 ['नेपाल', 'ले', 'याक', 'को', 'विकास', 'गर्न', 'सके

INFO:Nep_NER_Log:Writing example 0 of 2301
INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: train-1
INFO:Nep_NER_Log:tokens: [CLS] ब ##ित ##े को अ ##ढ ##ा ##इ दशक को अ ##व ##धि मा अ ##त्य ##धि ##क उ ##द्य ##ोग हर ##ू चीन तथा न ##व ##ो ##द ##ित औ ##द्य ##ोग ##िक राष्ट्र हर ##ू ##मा स ##ार ##ए ##ि । [SEP]
INFO:Nep_NER_Log:input_ids: 101 887 13184 11554 11267 851 111204 11208 34231 103202 11267 851 15070 32831 32629 851 91618 32831 12151 855 97110 81555 85465 15778 91582 14862 884 15070 13718 15552 13184 864 97110 81555 13671 72998 85465 15778 12347 898 19885 22599 12878 920 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

['समाचार', 'समिति', 'ले', 'बाबुनिया', 'क्षेत्र', 'मा', 'शनिवार', 'सर', 'कारी', 'सैनिक', 'तथा', 'छापामार', 'बीच', 'भए', 'को', 'सङ्घर्ष', 'मा', 'करीब', 'टाइगर्स', 'छापामार', 'मारिए', 'को', 'उल्लेख', 'गरे', 'को', 'छ', '।']
 ['कहि', 'ले', 'दृश्य', 'र', 'अक्सर', 'अदृश्य', 'रूपमा', 'आइरहे', 'को', 'परविर्तन', 'को', 'प्रवाह', 'भावी', 'नेपाल', 'को', 'जीवन', 'मा', 'दूरगामी', 'महत्त्व', 'राख्ने', 'खाल', 'का', 'छन्', '।']
 ['समिति', 'को', 'उपाध्यक्ष', 'मा', 'महमद', 'स्माइल', ',', 'सचिव', 'मा', 'रामप्रसाद', 'निरौला', ',', 'कोषाध्यक्ष', 'मा', 'महिचन्द्र', 'शाह', 'तथा', 'सदस्य', 'हरू', 'मा', 'रमेश', 'सुवेदी', ',', 'धनराज', 'भट्टराई', ',', 'गोपाल', 'पोखरेल', ',', 'राजेन्द्र', 'के.', 'सी.', ',', 'रमेश', 'दाहाल', ',', 'सुरेश', 'शाह', ',', 'चक्रबहादुर', 'क्षेत्री', ',', 'निर', 'मय', 'नारायण', 'मल्लिक', 'र', 'नारायण', 'पोखरेल', 'रहनु', 'भए', 'को', 'छ', '।']
 ['त्यहाँ', 'स्थित', 'जमीर', 'राइन', 'ले', 'सो', 'कुलो', 'बाट', 'प्राप्त', 'आय', 'निजि', 'रुप', 'मा', 'खाने', 'गरे', 'को', 'कुरा', 'चर्चा', 'मा', 'आए'

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/singh_ner/cached_train_bert-base-multilingual-cased_164
INFO:Nep_NER_Log:***** Running training *****
INFO:Nep_NER_Log:  Num examples = 2301
INFO:Nep_NER_Log:  Num Epochs = 10
INFO:Nep_NER_Log:  Instantaneous batch size per GPU = 32
INFO:Nep_NER_Log:  Total train batch size (w. parallel, distributed & accumulation) = 32
INFO:Nep_NER_Log:  Gradient Accumulation steps = 1
INFO:Nep_NER_Log:  Total optimization steps = 720
Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:   1%|▏         | 1/72 [00:01<02:05,  1.77s/it][A
Iteration:   3%|▎         | 2/72 [00:03<02:02,  1.75s/it][A
Iteration:   4%|▍         | 3/72 [00:05<02:00,  1.75s/it][A
Iteration:   6%|▌         | 4/72 [00:06<01:58,  1.75s/it][A
Iteration:   7%|▋         | 5/72 [00:08<01:56,  1.74s/it][A
Iteration:   8%|▊         | 6/72 [00:10<01:54,  1.74s/it][A
Iteration:  10%|▉         | 7/72 [00:12<01:53,  1.74s/it][A
Iteration:  11%|█   

 ['विम्बल्डन', 'को', 'सेमिफाइनल', 'मा', 'सेरेना', 'विलियम्स', 'सँगको', 'हार', 'पछि', 'मस्', 'को', 'को', 'घर', 'मै', 'अधिकांश', 'स', 'मय', 'बिताए', 'की', 'तेस्रो', 'वरियता', 'की', 'इलेना', 'डेमेन्टिएभा', 'ले', 'एनकाइथाभोंग', 'लाई', 'र', 'ले', 'हराइन्', '।']
 ['मिहिनेत', 'का', 'साथ', 'पढ्दै', 'आए', 'की', 'सुमित्रा', 'को', 'विद्यालय', 'को', 'टिफिन', 'टाइम', 'र', 'उन', 'को', 'वस्तुभाउ', 'को', 'खाने', 'वेला', 'एउटै', 'हुन्छ', '।']
 ['खेलकुद', 'सम्बन्धी', 'आधिकारकि', 'मन्त्रालय', 'हुँदाहुँदै', 'गृह', 'मन्त्रालय', 'को', 'हस्तक्षेप', 'किन', 'त', 'गृह', 'को', 'भनाइ', 'छ', 'जिल्ला', 'प्रशासन', 'कार्यालय', 'को', 'मुद्दा', 'भएका', 'ले', 'मन्त्रालय', 'ले', 'हस्तक्षेप', 'गरे', 'को', 'हो', '।']
 ['वीर', 'खुम्चिएर', 'बस्ने', 'तन्नेरी', 'थिएनन्', '।']
 ['चिकित्सक', 'को', 'लापरबाही', 'का', 'कारण', 'दृष्टि', 'गुमाए', 'का', 'राजु', 'महत', 'लाई', 'लाख', 'रुपिया', 'क्षतिपर्ूर्ति', 'गराउन', 'होस्', 'या', 'चिकित्सक', 'ले', 'दम', 'भएकी', 'युवती', 'लाई', 'स्वस्थ', 'लेखिदिएर', 'इजरायल', 'पठाए', 'पछि', 'उस', 'ले'

INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: dev-5
INFO:Nep_NER_Log:tokens: [CLS] च ##िक ##ित ##्स ##क को ला ##पर ##बा ##ही का कारण द ##ृष्टि ग ##ु ##मा ##ए का र ##ाज ##ु म ##हत ला ##ई लाख रुप ##िया क ##्ष ##ति ##पर ##् ##ूर ##्ति ग ##रा ##उन हो ##स ##् या च ##िक ##ित ##्स ##क ले द ##म भ ##ए ##की य ##ु ##वत ##ी ला ##ई स ##्व ##स ##्थ ले ##ख ##ि ##द ##ि ##एर इ ##ज ##रा ##य ##ल प ##ठा ##ए पछि उस ले तो ##के को ठ ##ा ##उ मा काम न ##पा ##एर य ##ौ ##न दूर ##ाचा ##र का कारण च ##ेत ##ना स ##म ##ेत ग ##ु ##मा ##ए की ब ##स ##्ने ##त थ ##र की य ##ु ##वत ##ी को क ##्ष ##ति ##पर ##् ##ूर ##्ति का म ##ु ##द ##्दा ज ##्यो ##ति ल ##ड ##्द ##ै ##छ ##न् । [SEP]
INFO:Nep_NER_Log:input_ids: 101 870 13671 13184 18869 12151 11267 21147 61533 57381 24667 11081 23640 882 98334 867 14070 12347 22599 11081 891 44735 14070 889 108775 21147 15801 83690 72954 21394 865 76979 24877 61533 20429 45028 32953 867 31277 62132 13220 13432 20429 12194 870 13671 13184 18869 12151 57865 882 13841 888 22599 24962 

 ['पत्रकार', 'थापा', 'को', 'हत्या', 'आरोप', 'मा', 'पक्राउ', 'परे', 'का', 'प्रतिवादी', 'हरुका', 'तर्फ', 'बाट', 'दुई', 'वकिल', 'ले', 'विहीबार', 'बहस', 'गरे', 'का', 'छन्', '।']
 ['अध्यक्ष', 'गोविन्द', 'मल्ल', 'माथि', 'गंभीर', 'आरोप', '।']
 ['सभामुख', 'दमन', 'ढुंगाना', 'चुनिने', 'निश्चित', '।']
 ['तर', 'प्रहरी', 'द्वारा', 'पक्राउ', 'परसिके', 'का', 'अभियुक्त', 'हरूको', 'बयान', 'रोक्न', 'महान्यायाधिवक्ता', 'मार्फत', 'निर्देशन', 'दिए', 'का', 'प्रधानमन्त्री', 'भट्टराई', 'त्यस', 'मा', 'सफल', 'हुन', 'सकेनन्', '।']
 ['त्यसरी', 'राजनीतिक', 'अभ्यास', 'र', 'अनुभवहरूद्बारा', 'स्थानीय', 'नेतृत्व', 'को', 'विकास', 'मा', 'सफल', 'रहे', 'का', 'स्थानीय', 'निकाय', 'हरू', 'लाई', 'सम्विधान', 'मा', 'राष्ट्रिय', 'परिषद', 'मा', 'प्रतिनिधि', 'पठाउने', 'सम्म', 'को', 'क', 'ओटा', 'राखिए', 'को', 'छ', 'तर', 'वर्तमान', 'सरकार', 'स्थानीय', 'स्वशासन', 'का', 'निकाय', 'हरू', 'को', 'निर्वाचन', 'र', 'गठन', 'गर्ने', 'कुरा', 'सम्म', 'सोचिरहे', 'को', 'छैन', '।']
 ['सभासद्', 'खड्', 'का', 'ले', 'नियमापत्ति', 'गर्दा', 'सभामुख', 'ने

INFO:Nep_NER_Log:Writing example 0 of 658
INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: test-1
INFO:Nep_NER_Log:tokens: [CLS] प ##त्र ##कार था ##पा को ह ##त्या आर ##ोप मा प ##क ##्रा ##उ पर ##े का प्रति ##वादी हर ##ुक ##ा तर ##्फ ब ##ाट दु ##ई व ##कि ##ल ले वि ##ही ##बार ब ##ह ##स गरे का छन् । [SEP]
INFO:Nep_NER_Log:input_ids: 101 885 27373 25561 13794 42035 11267 899 76752 107144 65430 32629 885 12151 77809 111194 12213 11554 11081 54930 63291 85465 97571 11208 32824 75315 887 70608 11854 15801 895 66943 11714 57865 55190 24667 105012 887 17110 13432 79263 11081 35140 920 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

['सुमित्रा', 'छोराछोरी', 'सँगै', 'बसेर', 'पढ्न', 'थाले', 'पछि', 'कक्षा', 'को', 'वातावरण', 'नै', 'सपि्रए', 'को', 'प्रधानाध्यापक', 'राम', 'बहादुर', 'केसी', 'ले', 'बताए', '।']
 ['नेता', 'थापा', 'ले', 'जातीय', 'सँरचना', 'सहित', 'को', 'सँघीय', 'राज्य', 'असम्भव', 'भएका', 'ले', 'सबै', 'लाई', 'मान्य', 'हुने', 'गरि', 'सँघीय', 'राज्य', 'स्थापना', 'गर्नुपर्ने', 'मा', 'जोड', 'दिए', '।']
 ['भारत', 'को', 'आशिर्वाद', 'बिना', 'सत्तारोहण', 'को', 'उकालो', 'चढ्न', 'सकिदैन', 'भन्ने', 'माओवादी', 'मनस्थीति', 'लाई', 'भारत', 'ले', 'धेरै', 'नजिक', 'बाट', 'छामि', 'सकेको', 'छ', '।']
 ['माओवादी', 'ले', 'जति', 'बोलि', 'फेर्छ', 'त्यो', 'स्वयं', 'उसै', 'लाई', 'घाटा', 'हुने', 'उन', 'ले', 'बताए', '।']
 ['तानसेन', 'मा', 'वामपन्थी', 'नेंता', 'को', 'आचरण', 'कस्तो', 'होला', 'भन्ने', 'पनि', 'धेरै', 'लाई', 'जिज्ञासा', 'थियो', '।']
 ['करिब', 'घण्टा', 'को', 'कार्यक्रम', 'संघ', 'का', 'सदस्य', 'श्री', 'मालिक', 'शर्', 'मा', 'को', 'धन्यवाद', 'ज्ञापन', 'पछि', 'समाप्त', 'भयो', '।']
 ['वर्ष', 'को', 'उत्कृष्ट', 'खेलाडी', 'महिला', 'तथ

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/singh_ner/cached_test_bert-base-multilingual-cased_164
INFO:Nep_NER_Log:***** Running evaluation  *****
INFO:Nep_NER_Log:  Num examples = 658
INFO:Nep_NER_Log:  Batch size = 8
Evaluating: 100%|██████████| 83/83 [00:15<00:00,  5.40it/s]
INFO:Nep_NER_Log:***** Eval results  *****
INFO:Nep_NER_Log:  f1 = 0.85953293830603
INFO:Nep_NER_Log:  loss = 0.12776944380469554
INFO:Nep_NER_Log:  precision = 0.8491735537190083
INFO:Nep_NER_Log:  recall = 0.8701482004234298
INFO:Nep_NER_Log:{'loss': 0.109506253175953, 'precision': 0.8623481781376519, 'recall': 0.891213389121339, 'f1': 0.8765432098765433}


# Transfer Learning

Trained in Hindi NER and evaluated in NAAMII NER 

In [29]:
args, _ = get_args()
args.data_dir = "/content/nepali-ner/data/labeled/naami_ner"  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.output_dir = "wikiann_hi" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "./wikiann_hi"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
# args.do_finetune = True
args.do_train = False
args.do_eval = True
args.do_predict = True
start_training(args)

INFO:Nep_NER_Log:Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/content/nepali-ner/data/labeled/naami_ner', device=device(type='cuda'), do_eval=True, do_finetune=False, do_lower_case=False, do_predict=True, do_train=False, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir=None, labels='', learning_rate=5e-05, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='./wikiann_hi', model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=10, output_dir='wikiann_hi', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
INFO:Nep_NER_Log:Evaluate the following checkpoints: ['wikiann_hi']
INFO:Nep_NER_Log:Creating features from dataset file at /content/nepali-ner/data/labeled

 ['मलेसिया', 'का', 'नेपाली', 'व्यवसायी', 'बिरु', 'पाठक', 'ले', 'थप', 'राहत', 'पठाउन', 'लागि', 'सहयोग', 'रकम', 'संकलन', 'भै', 'रहेको', 'जानकारी', 'दिए', '।']
 ['यसै', 'बिच', ',', 'भुकम्प', 'मा', 'परी', 'काठमाडौ', 'श्रीमती', 'र', 'छोरी', 'गुमाए', 'का', 'नागी', '३', 'पाचथर', 'रत्न', 'मेयाङ्बो', 'को', 'परिवार', 'लाई', 'राहत', 'बितरण', 'गर्न', 'नेपाल', 'पुगे', 'नेपाली', 'व्यवसायी', 'समाज', 'मलेसिया', 'टोली', 'ले', '२०', 'हजार', 'रुपैया', 'सहयोग', 'गरे', 'छ', '।']
 ['काठमाडैं', 'मा', 'खाजाघर', 'संचालन', 'गरे', 'का', 'मेयाङ्बो', 'की', 'श्रीमती', 'कमला', 'र', '८', 'बर्षिया', 'छोरी', 'रक्षा', 'को', 'बैशाख', '१२', 'भुकम्प', 'परी', 'ज्यान', 'गए', 'थियो', '।']
 ['मृतक', 'का', 'परिवार', 'लाई', 'भेटेर', 'आज', '२०', 'हजार', 'रुपैंया', 'हस्तान्तरण', 'गरे', 'को', 'समाज', 'सदस्य', 'देवेन्द्र', 'खापुङ', 'ले', 'बताए', '।']
 ['भूकम्प', 'पीडित', 'लाई', 'राहत', 'सामग्री', ',', 'घाइते', 'को', 'उपचार', 'तथा', 'स्रोत', 'र', 'साधन', 'जुटाउने', 'मुख्य', 'लक्ष्य', 'रहे', 'उद्धार', 'टोली', 'का', 'संयोजक', 'डा', '.'

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/naami_ner/cached_dev_wikiann_hi_164
INFO:Nep_NER_Log:***** Running evaluation  *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Batch size = 8
Evaluating: 100%|██████████| 13/13 [00:02<00:00,  5.31it/s]
INFO:Nep_NER_Log:***** Eval results  *****
INFO:Nep_NER_Log:  f1 = 0.509090909090909
INFO:Nep_NER_Log:  loss = 0.7729910566256597
INFO:Nep_NER_Log:  precision = 0.5544554455445545
INFO:Nep_NER_Log:  recall = 0.47058823529411764
INFO:Nep_NER_Log:Creating features from dataset file at /content/nepali-ner/data/labeled/naami_ner
INFO:Nep_NER_Log:Writing example 0 of 310
INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: test-1
INFO:Nep_NER_Log:tokens: [CLS] वर्ष म ##ना ##इ ##र ##ह ##ँ ##दा नेपाल आ ##उने पर ##्य ##टक ले प ##टक [UNK] वा ##ता ##वरण क ##सरी बना ##उने भन्ने सन् ##दर ##्भ मा काम गर्न ##ु आवश्यक छ । [SEP]
INFO:Nep_NER_Log:input_ids: 101 24332 889 16380 34231 11549 17110 28462 2550

 ['वर्ष', 'मनाइरहँदा', 'नेपाल', 'आउने', 'पर्यटक', 'ले', 'पटक', '–', 'वातावरण', 'कसरी', 'बनाउने', 'भन्ने', 'सन्दर्भ', 'मा', 'काम', 'गर्नु', 'आवश्यक', 'छ', '।']
 ['किन', 'की', 'नेपाल', 'को', 'समृद्धि', 'आधार', 'नै', 'पर्यटन', 'उद्योग', 'हो', '।']
 ['सडक', 'यातायात', 'का', 'कारण', 'परम्परागत', 'पदयात्रा', 'मार्ग', 'लोप', 'हुने', 'अवस्था', 'मा', 'पुगे', 'छन्', '।']
 ['एससीटी', 'ले', 'युनियन', 'पे', 'इन्टरनेशन', 'सँग', 'सहकार्य', 'सुरु', 'गरि', 'अत्याधुनिक', '‘', 'कन्ट्याक्ट', 'लेस', '’', 'कार्ड', 'को', 'लञ्चिङ', 'गरे', 'छ', '।']
 ['एससीटी', 'का', 'अध्यक्ष', 'देबिप्रकाश', 'भट्टचन', 'ले', 'नेपाली', 'कार्ड', 'र', 'अत्याधुनिक', 'सुमेत', 'भएका', 'प्रयोगकर्ता', 'लाई', 'यो', 'प्रयोग', 'गर्न', 'आग्रह', 'गरे', '।']
 ['यस', 'मा', 'प्रभु', 'ग्रुप', 'का', 'साथै', 'आइएमइ', 'समुह', 'र', 'हिमालयन', 'बैंक', 'ले', 'समेत', 'लगानी', 'गरे', 'छन्', '।']
 ['यस', 'का', 'लागि', 'मुख्य', 'कोर', 'नेटवर्क', 'काठमाडौं', 'र', 'बुटवल', 'मा', 'रहने', 'छ', '।']
 ['नेपाल', 'टेलिकम', 'ले', 'मुलुक', 'मै', 'पहिलो', 'पटक', 'म

INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: test-4
INFO:Nep_NER_Log:tokens: [CLS] एस ##सी ##टी ले य ##ुन ##ियन प ##े इन ##्टर ##ने ##शन स ##ँ ##ग स ##ह ##कार ##्य स ##ुरु ग ##र ##ि अ ##त्या ##ध ##ुन ##िक [UNK] क ##न् ##ट ##्या ##क ##्ट ले ##स [UNK] का ##र्ड को ल ##ञ ##् ##च ##ि ##ङ गरे छ । [SEP]
INFO:Nep_NER_Log:input_ids: 101 55046 27117 58580 57865 890 45649 31332 885 11554 30114 54071 13466 61820 898 28462 19741 898 17110 25561 14251 898 56902 867 11549 12878 851 76752 27694 45649 13671 100 865 41013 14835 17279 12151 25695 57865 13432 100 11081 52905 11267 893 111203 20429 16940 12878 81489 79263 871 920 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0

Trained in wikiann_ne and evaluated in NAAMII NER 

In [20]:
args, _ = get_args()
args.data_dir = "/content/nepali-ner/data/labeled/naami_ner"  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.output_dir = "wikiann_np" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "./wikiann_np"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
# args.do_finetune = True
args.do_train = False
args.do_eval = True
args.do_predict = True
start_training(args)

INFO:Nep_NER_Log:Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/content/nepali-ner/data/labeled/naami_ner', device=device(type='cuda'), do_eval=True, do_finetune=False, do_lower_case=False, do_predict=True, do_train=False, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir=None, labels='', learning_rate=5e-05, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='./wikiann_np', model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=10, output_dir='wikiann_np', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
INFO:Nep_NER_Log:Evaluate the following checkpoints: ['wikiann_np']
INFO:Nep_NER_Log:Creating features from dataset file at /content/nepali-ner/data/labeled

 ['मलेसिया', 'का', 'नेपाली', 'व्यवसायी', 'बिरु', 'पाठक', 'ले', 'थप', 'राहत', 'पठाउन', 'लागि', 'सहयोग', 'रकम', 'संकलन', 'भै', 'रहेको', 'जानकारी', 'दिए', '।']
 ['यसै', 'बिच', ',', 'भुकम्प', 'मा', 'परी', 'काठमाडौ', 'श्रीमती', 'र', 'छोरी', 'गुमाए', 'का', 'नागी', '३', 'पाचथर', 'रत्न', 'मेयाङ्बो', 'को', 'परिवार', 'लाई', 'राहत', 'बितरण', 'गर्न', 'नेपाल', 'पुगे', 'नेपाली', 'व्यवसायी', 'समाज', 'मलेसिया', 'टोली', 'ले', '२०', 'हजार', 'रुपैया', 'सहयोग', 'गरे', 'छ', '।']
 ['काठमाडैं', 'मा', 'खाजाघर', 'संचालन', 'गरे', 'का', 'मेयाङ्बो', 'की', 'श्रीमती', 'कमला', 'र', '८', 'बर्षिया', 'छोरी', 'रक्षा', 'को', 'बैशाख', '१२', 'भुकम्प', 'परी', 'ज्यान', 'गए', 'थियो', '।']
 ['मृतक', 'का', 'परिवार', 'लाई', 'भेटेर', 'आज', '२०', 'हजार', 'रुपैंया', 'हस्तान्तरण', 'गरे', 'को', 'समाज', 'सदस्य', 'देवेन्द्र', 'खापुङ', 'ले', 'बताए', '।']
 ['भूकम्प', 'पीडित', 'लाई', 'राहत', 'सामग्री', ',', 'घाइते', 'को', 'उपचार', 'तथा', 'स्रोत', 'र', 'साधन', 'जुटाउने', 'मुख्य', 'लक्ष्य', 'रहे', 'उद्धार', 'टोली', 'का', 'संयोजक', 'डा', '.'

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/naami_ner/cached_dev_wikiann_np_164
INFO:Nep_NER_Log:***** Running evaluation  *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Batch size = 8
Evaluating: 100%|██████████| 13/13 [00:02<00:00,  5.31it/s]
INFO:Nep_NER_Log:***** Eval results  *****
INFO:Nep_NER_Log:  f1 = 0.4470588235294118
INFO:Nep_NER_Log:  loss = 0.4845678290495506
INFO:Nep_NER_Log:  precision = 0.5080213903743316
INFO:Nep_NER_Log:  recall = 0.39915966386554624
INFO:Nep_NER_Log:Creating features from dataset file at /content/nepali-ner/data/labeled/naami_ner


 ['वर्ष', 'मनाइरहँदा', 'नेपाल', 'आउने', 'पर्यटक', 'ले', 'पटक', '–', 'वातावरण', 'कसरी', 'बनाउने', 'भन्ने', 'सन्दर्भ', 'मा', 'काम', 'गर्नु', 'आवश्यक', 'छ', '।']
 ['किन', 'की', 'नेपाल', 'को', 'समृद्धि', 'आधार', 'नै', 'पर्यटन', 'उद्योग', 'हो', '।']
 ['सडक', 'यातायात', 'का', 'कारण', 'परम्परागत', 'पदयात्रा', 'मार्ग', 'लोप', 'हुने', 'अवस्था', 'मा', 'पुगे', 'छन्', '।']
 ['एससीटी', 'ले', 'युनियन', 'पे', 'इन्टरनेशन', 'सँग', 'सहकार्य', 'सुरु', 'गरि', 'अत्याधुनिक', '‘', 'कन्ट्याक्ट', 'लेस', '’', 'कार्ड', 'को', 'लञ्चिङ', 'गरे', 'छ', '।']
 ['एससीटी', 'का', 'अध्यक्ष', 'देबिप्रकाश', 'भट्टचन', 'ले', 'नेपाली', 'कार्ड', 'र', 'अत्याधुनिक', 'सुमेत', 'भएका', 'प्रयोगकर्ता', 'लाई', 'यो', 'प्रयोग', 'गर्न', 'आग्रह', 'गरे', '।']
 ['यस', 'मा', 'प्रभु', 'ग्रुप', 'का', 'साथै', 'आइएमइ', 'समुह', 'र', 'हिमालयन', 'बैंक', 'ले', 'समेत', 'लगानी', 'गरे', 'छन्', '।']
 ['यस', 'का', 'लागि', 'मुख्य', 'कोर', 'नेटवर्क', 'काठमाडौं', 'र', 'बुटवल', 'मा', 'रहने', 'छ', '।']
 ['नेपाल', 'टेलिकम', 'ले', 'मुलुक', 'मै', 'पहिलो', 'पटक', 'म

INFO:Nep_NER_Log:Writing example 0 of 310
INFO:Nep_NER_Log:*** Example ***
INFO:Nep_NER_Log:guid: test-1
INFO:Nep_NER_Log:tokens: [CLS] वर्ष म ##ना ##इ ##र ##ह ##ँ ##दा नेपाल आ ##उने पर ##्य ##टक ले प ##टक [UNK] वा ##ता ##वरण क ##सरी बना ##उने भन्ने सन् ##दर ##्भ मा काम गर्न ##ु आवश्यक छ । [SEP]
INFO:Nep_NER_Log:input_ids: 101 24332 889 16380 34231 11549 17110 28462 25501 29953 852 65261 12213 14251 76826 57865 885 76826 100 37038 13537 72989 865 110572 60701 65261 76775 16701 71478 90218 32629 28043 54248 14070 73012 871 920 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 ['भारत', 'मा', 'मात्रै', 'नभइ', 'डोनियर', 'का', 'उत्पादन', 'हरु', 'अन्य', '२८', 'वटा', 'देश', 'हरुमा', 'पनि', 'निर्यात', 'हुँदै', 'आए', 'छन्', '।']
 ['त्यसै', 'ले', 'डोनियर', 'हरेक', 'ठाउँ', 'का', 'स्थानीय', 'व्यक्तित्व', 'हरुलाई', 'नै', 'आफ्नो', 'प्रतिनिधि', 'बनाउने', 'गरे', 'को', 'अग्रवाल', 'बताए', '।']
 ['डोनियर', 'ले', 'आफु', 'उत्पादन', 'गरे', 'को', 'कपडा', 'हरु', 'र', 'आफ्नो', 'सुनियोजित', 'विस्तृत', 'संजाल', 'बाट', 'भारत', 'भरि', 'का', 'सबै', 'मानिस', 'हरुलाई', 'प्रभावित', 'अग्रवाल', 'दबी', '।']
 ['पछिल्लो', 'समय', 'मा', 'स्वास्थ्य', 'क्षेत्र', 'आक्रामक', 'रुप', 'लगानी', 'गरि', 'रहेका', 'डा', '.', 'ज्योति', 'यस', 'लाई', 'अर्को', 'जीवन', 'का', 'लागि', 'तयारी', 'ब्याख्या', 'गर्छन्', '।']
 ['ग्राण्डी', 'अस्पताल', 'र', 'सिटी', 'मा', 'गरी', 'करिव', 'डेढ', 'अर्ब', 'रुपैयाँ', 'लगानी', 'गरे', 'का', 'ज्योति', 'स्वास्थ्य', 'क्षेत्र', 'निरन्तर', 'घाटा', 'खाँदा', 'पनि', 'अझै', 'गर्न', 'उत्साहित', 'देखिन्छन्', '।']
 ['हाम्रा', 'बुबा', '(', 'मणिहर्ष', 'ज्योति', ')', 'ले', '५२', 'वर्ष', 'अगाडि

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/naami_ner/cached_test_wikiann_np_164
INFO:Nep_NER_Log:***** Running evaluation  *****
INFO:Nep_NER_Log:  Num examples = 310
INFO:Nep_NER_Log:  Batch size = 8
Evaluating: 100%|██████████| 39/39 [00:07<00:00,  5.30it/s]
INFO:Nep_NER_Log:***** Eval results  *****
INFO:Nep_NER_Log:  f1 = 0.4049416609471516
INFO:Nep_NER_Log:  loss = 0.5529774075899369
INFO:Nep_NER_Log:  precision = 0.4704944178628389
INFO:Nep_NER_Log:  recall = 0.35542168674698793
INFO:Nep_NER_Log:{'loss': 0.4845678290495506, 'precision': 0.5080213903743316, 'recall': 0.39915966386554624, 'f1': 0.4470588235294118}


Trained in Singh NER Dataset and evaluated in NAAMII NER 

In [31]:
args, _ = get_args()
args.data_dir = "/content/nepali-ner/data/labeled/naami_ner"  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.output_dir = "singh_ner" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "./singh_ner"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
# args.do_finetune = True
args.do_train = False
args.do_eval = True
args.do_predict = True
start_training(args)

INFO:Nep_NER_Log:Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/content/nepali-ner/data/labeled/naami_ner', device=device(type='cuda'), do_eval=True, do_finetune=False, do_lower_case=False, do_predict=True, do_train=False, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir=None, labels='', learning_rate=5e-05, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='./singh_ner', model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=10, output_dir='singh_ner', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
INFO:Nep_NER_Log:Evaluate the following checkpoints: ['singh_ner']
INFO:Nep_NER_Log:Creating features from dataset file at /content/nepali-ner/data/labeled/na

 ['मलेसिया', 'का', 'नेपाली', 'व्यवसायी', 'बिरु', 'पाठक', 'ले', 'थप', 'राहत', 'पठाउन', 'लागि', 'सहयोग', 'रकम', 'संकलन', 'भै', 'रहेको', 'जानकारी', 'दिए', '।']
 ['यसै', 'बिच', ',', 'भुकम्प', 'मा', 'परी', 'काठमाडौ', 'श्रीमती', 'र', 'छोरी', 'गुमाए', 'का', 'नागी', '३', 'पाचथर', 'रत्न', 'मेयाङ्बो', 'को', 'परिवार', 'लाई', 'राहत', 'बितरण', 'गर्न', 'नेपाल', 'पुगे', 'नेपाली', 'व्यवसायी', 'समाज', 'मलेसिया', 'टोली', 'ले', '२०', 'हजार', 'रुपैया', 'सहयोग', 'गरे', 'छ', '।']
 ['काठमाडैं', 'मा', 'खाजाघर', 'संचालन', 'गरे', 'का', 'मेयाङ्बो', 'की', 'श्रीमती', 'कमला', 'र', '८', 'बर्षिया', 'छोरी', 'रक्षा', 'को', 'बैशाख', '१२', 'भुकम्प', 'परी', 'ज्यान', 'गए', 'थियो', '।']
 ['मृतक', 'का', 'परिवार', 'लाई', 'भेटेर', 'आज', '२०', 'हजार', 'रुपैंया', 'हस्तान्तरण', 'गरे', 'को', 'समाज', 'सदस्य', 'देवेन्द्र', 'खापुङ', 'ले', 'बताए', '।']
 ['भूकम्प', 'पीडित', 'लाई', 'राहत', 'सामग्री', ',', 'घाइते', 'को', 'उपचार', 'तथा', 'स्रोत', 'र', 'साधन', 'जुटाउने', 'मुख्य', 'लक्ष्य', 'रहे', 'उद्धार', 'टोली', 'का', 'संयोजक', 'डा', '.'

INFO:Nep_NER_Log:guid: dev-5
INFO:Nep_NER_Log:tokens: [CLS] भ ##ूक ##म् ##प पी ##ड ##ित ला ##ई र ##ाह ##त स ##ाम ##ग्री , घ ##ा ##इ ##ते को उ ##प ##चार तथा स ##्रो ##त र स ##ा ##धन ज ##ुट ##ा ##उने मुख्य ल ##क्ष ##्य रहे उ ##द्ध ##ार ट ##ोली का सं ##य ##ोज ##क ड ##ा . म ##दन उ ##प ##्र ##ेत ##ी ले ब ##ता ##ए । [SEP]
INFO:Nep_NER_Log:input_ids: 101 888 101412 45753 18187 85166 20691 13184 21147 15801 891 66921 11845 898 49362 75147 117 868 11208 34231 17203 11267 855 18187 36208 14862 898 63750 11845 891 898 11208 51799 872 90434 11208 65261 31169 893 32662 14251 33555 855 43228 19885 875 54317 11081 28466 13874 72850 12151 877 11208 119 889 49978 855 18187 18321 53316 10914 57865 887 13537 22599 920 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

 ['वर्ष', 'मनाइरहँदा', 'नेपाल', 'आउने', 'पर्यटक', 'ले', 'पटक', '–', 'वातावरण', 'कसरी', 'बनाउने', 'भन्ने', 'सन्दर्भ', 'मा', 'काम', 'गर्नु', 'आवश्यक', 'छ', '।']
 ['किन', 'की', 'नेपाल', 'को', 'समृद्धि', 'आधार', 'नै', 'पर्यटन', 'उद्योग', 'हो', '।']
 ['सडक', 'यातायात', 'का', 'कारण', 'परम्परागत', 'पदयात्रा', 'मार्ग', 'लोप', 'हुने', 'अवस्था', 'मा', 'पुगे', 'छन्', '।']
 ['एससीटी', 'ले', 'युनियन', 'पे', 'इन्टरनेशन', 'सँग', 'सहकार्य', 'सुरु', 'गरि', 'अत्याधुनिक', '‘', 'कन्ट्याक्ट', 'लेस', '’', 'कार्ड', 'को', 'लञ्चिङ', 'गरे', 'छ', '।']
 ['एससीटी', 'का', 'अध्यक्ष', 'देबिप्रकाश', 'भट्टचन', 'ले', 'नेपाली', 'कार्ड', 'र', 'अत्याधुनिक', 'सुमेत', 'भएका', 'प्रयोगकर्ता', 'लाई', 'यो', 'प्रयोग', 'गर्न', 'आग्रह', 'गरे', '।']
 ['यस', 'मा', 'प्रभु', 'ग्रुप', 'का', 'साथै', 'आइएमइ', 'समुह', 'र', 'हिमालयन', 'बैंक', 'ले', 'समेत', 'लगानी', 'गरे', 'छन्', '।']
 ['यस', 'का', 'लागि', 'मुख्य', 'कोर', 'नेटवर्क', 'काठमाडौं', 'र', 'बुटवल', 'मा', 'रहने', 'छ', '।']
 ['नेपाल', 'टेलिकम', 'ले', 'मुलुक', 'मै', 'पहिलो', 'पटक', 'म

INFO:Nep_NER_Log:input_ids: 101 24332 889 16380 34231 11549 17110 28462 25501 29953 852 65261 12213 14251 76826 57865 885 76826 100 37038 13537 72989 865 110572 60701 65261 76775 16701 71478 90218 32629 28043 54248 14070 73012 871 920 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:Nep_NER_Log:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

# Fine Tuning

Trained in Hindi NER and fine tuned in NAAMII NER 

In [32]:
args, _ = get_args()
args.data_dir = "/content/nepali-ner/data/labeled/naami_ner"  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.input_dir = "./wikiann_hi"
args.output_dir = "hi2np_finetune" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "./wikiann_hi"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
args.do_finetune = True
args.do_train = False
args.do_eval = True
args.do_predict = True
start_training(args)

INFO:Nep_NER_Log:Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/content/nepali-ner/data/labeled/naami_ner', device=device(type='cuda'), do_eval=True, do_finetune=True, do_lower_case=False, do_predict=True, do_train=False, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir='./wikiann_hi', labels='', learning_rate=5e-05, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='./wikiann_hi', model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=10, output_dir='hi2np_finetune', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
INFO:Nep_NER_Log:Loading features from cached file /content/nepali-ner/data/labeled/naami_ner/cached_test_wikiann_hi_164
INFO:Nep_NER_Log:*****

 ['फेरि', 'तिनै', 'पाकिस्तानी', 'हरू', 'शेख', 'मुजिबर', 'रहमान', 'लाई', '‘', 'अलगाववादी', '’', 'भन्छन्', ',', 'जबकि', 'बंगलादेश', 'मा', 'राष्ट्र', 'का', 'संस्थापक', 'पिता', 'वा', 'बंगबन्धु', 'मानिन्छ', '।']
 ['सिक्किम', 'को', 'भारत', 'मा', 'विलयन', 'किन', 'राष्ट्रघात', ',', 'हस्तक्षेप', 'र', 'विस्तारवाद', 'हो', 'तर', 'तिब्बत', 'चीन', 'चाहिं', 'हैन', 'एकता', '?']
 ['दोस्रो', 'विश्वयुद्ध', 'ले', 'कोरिया', 'लाई', 'मात्र', 'हैन', ',', 'जर्मनी', 'पनि', 'दुई', 'राज्य', 'मा', 'बाँडे', 'को', 'थियो', '।']
 ['पूर्वी', 'जर्मनी', 'को', 'साम्यवादी', 'सत्ता', 'ढले', 'सँगै', 'फेरि', 'जर्मन', 'एकीकरण', 'भयो', '।']
 ['शायद', 'उत्तरकोरिया', 'को', 'साम्यवादी', 'सत्ता', 'ढले', 'का', 'दिन', 'कोरिया', 'पनि', 'एकीकरण', 'हुनेछ', '।']
 ['तर', ',', 'यी', 'उदाहरण', 'बाट', 'उठ्ने', 'केही', 'प्रश्न', 'छन्-', 'पूर्वी', 'र', 'पश्चिम', 'जर्मनी', 'दुई', 'फरक', '-', 'राज्य', 'हुँदा', 'के', 'को', 'राष्ट्रियता', 'पनि', 'थियो', 'वा', 'एउटै', '?']
 ['कोरिया', 'अहिले', 'उत्तर', 'र', 'दक्षिण', 'दुई', 'वटा', 'हुँदा', 'यी', '‘

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/naami_ner/cached_train_wikiann_hi_164
INFO:Nep_NER_Log:***** Running training *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Num Epochs = 10
INFO:Nep_NER_Log:  Instantaneous batch size per GPU = 32
INFO:Nep_NER_Log:  Total train batch size (w. parallel, distributed & accumulation) = 32
INFO:Nep_NER_Log:  Gradient Accumulation steps = 1
INFO:Nep_NER_Log:  Total optimization steps = 40
INFO:Nep_NER_Log:  Continuing training from checkpoint, will skip to saved global_step
INFO:Nep_NER_Log:  Continuing training from epoch 0
INFO:Nep_NER_Log:  Continuing training from global step 0
INFO:Nep_NER_Log:  Will skip the first 0 steps in the first epoch
Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:  25%|██▌       | 1/4 [00:01<00:05,  1.72s/it][A
Iteration:  50%|█████     | 2/4 [00:03<00:03,  1.74s/it][A
Iteration:  75%|███████▌  | 3/4 [00:05<00:01,  1.74s/it][A
Iteration: 100%|████████

Trained in Singh NER and fine tuned in NAAMII NER 

In [33]:
args, _ = get_args()
args.data_dir = "/content/nepali-ner/data/labeled/naami_ner"  # to-change: supply data directory
# args.input_dir = "naamii_ner"
args.input_dir = "./singh_ner"
args.output_dir = "singh2naamii_finetune" # to-change: supply output directory
args.model_type = "bert"
args.model_name_or_path = "./singh_ner"
# args.model_name_or_path = "./hi_ner_bert"
args.max_seq_length = 164
args.num_train_epochs = 10
args.per_gpu_train_batch_size = 32
args.save_steps = 10000
args.seed = 1
args.do_finetune = True
args.do_train = False
args.do_eval = True
args.do_predict = True
start_training(args)

INFO:Nep_NER_Log:Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/content/nepali-ner/data/labeled/naami_ner', device=device(type='cuda'), do_eval=True, do_finetune=True, do_lower_case=False, do_predict=True, do_train=False, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, input_dir='./singh_ner', labels='', learning_rate=5e-05, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_seq_length=164, max_steps=-1, model_name_or_path='./singh_ner', model_type='bert', n_gpu=1, no_cuda=False, num_train_epochs=10, output_dir='singh2naamii_finetune', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=32, save_steps=10000, seed=1, server_ip='', server_port='', tokenizer_name='', warmup_steps=0, weight_decay=0.0)
INFO:Nep_NER_Log:Loading features from cached file /content/nepali-ner/data/labeled/naami_ner/cached_test_singh_ner_164
INFO:Nep_NER_Log:*

 ['फेरि', 'तिनै', 'पाकिस्तानी', 'हरू', 'शेख', 'मुजिबर', 'रहमान', 'लाई', '‘', 'अलगाववादी', '’', 'भन्छन्', ',', 'जबकि', 'बंगलादेश', 'मा', 'राष्ट्र', 'का', 'संस्थापक', 'पिता', 'वा', 'बंगबन्धु', 'मानिन्छ', '।']
 ['सिक्किम', 'को', 'भारत', 'मा', 'विलयन', 'किन', 'राष्ट्रघात', ',', 'हस्तक्षेप', 'र', 'विस्तारवाद', 'हो', 'तर', 'तिब्बत', 'चीन', 'चाहिं', 'हैन', 'एकता', '?']
 ['दोस्रो', 'विश्वयुद्ध', 'ले', 'कोरिया', 'लाई', 'मात्र', 'हैन', ',', 'जर्मनी', 'पनि', 'दुई', 'राज्य', 'मा', 'बाँडे', 'को', 'थियो', '।']
 ['पूर्वी', 'जर्मनी', 'को', 'साम्यवादी', 'सत्ता', 'ढले', 'सँगै', 'फेरि', 'जर्मन', 'एकीकरण', 'भयो', '।']
 ['शायद', 'उत्तरकोरिया', 'को', 'साम्यवादी', 'सत्ता', 'ढले', 'का', 'दिन', 'कोरिया', 'पनि', 'एकीकरण', 'हुनेछ', '।']
 ['तर', ',', 'यी', 'उदाहरण', 'बाट', 'उठ्ने', 'केही', 'प्रश्न', 'छन्-', 'पूर्वी', 'र', 'पश्चिम', 'जर्मनी', 'दुई', 'फरक', '-', 'राज्य', 'हुँदा', 'के', 'को', 'राष्ट्रियता', 'पनि', 'थियो', 'वा', 'एउटै', '?']
 ['कोरिया', 'अहिले', 'उत्तर', 'र', 'दक्षिण', 'दुई', 'वटा', 'हुँदा', 'यी', '‘

INFO:Nep_NER_Log:Saving features into cached file /content/nepali-ner/data/labeled/naami_ner/cached_train_singh_ner_164
INFO:Nep_NER_Log:***** Running training *****
INFO:Nep_NER_Log:  Num examples = 100
INFO:Nep_NER_Log:  Num Epochs = 10
INFO:Nep_NER_Log:  Instantaneous batch size per GPU = 32
INFO:Nep_NER_Log:  Total train batch size (w. parallel, distributed & accumulation) = 32
INFO:Nep_NER_Log:  Gradient Accumulation steps = 1
INFO:Nep_NER_Log:  Total optimization steps = 40
INFO:Nep_NER_Log:  Continuing training from checkpoint, will skip to saved global_step
INFO:Nep_NER_Log:  Continuing training from epoch 0
INFO:Nep_NER_Log:  Continuing training from global step 0
INFO:Nep_NER_Log:  Will skip the first 0 steps in the first epoch
Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Iteration:  25%|██▌       | 1/4 [00:01<00:05,  1.70s/it][A
Iteration:  50%|█████     | 2/4 [00:03<00:03,  1.72s/it][A
Iteration:  75%|███████▌  | 3/4 [00:05<00:01,  1.72s/it][A
Iteration: 100%|█████████

# Inference

In [34]:
model = AutoModelForTokenClassification.from_pretrained("./singh_ner")

In [37]:
tokenizer = AutoTokenizer.from_pretrained("./singh_ner")

In [38]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="max")

In [45]:
exp = "मेरो नाम राम हो, म काठमाडौँ नेपाल मा बस्छु । "

In [46]:
def infer_and_visualize(input_text):
  entities = nlp(input_text)

  # Convert to format required for spacy displacy
  data = {}
  data["text"] = input_text
  data["ents"] = []

  for ent in entities:
    data["ents"].append({
        "start": ent["start"],
        "end": ent["end"],
        "label": ent["entity_group"]
    })
  displacy.render(data, style="ent", jupyter=True, manual=True)

In [47]:
infer_and_visualize(exp)