<a href="https://colab.research.google.com/github/ftmthb/NER/blob/main/ner_india_train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The notebook flow

* **Environment set**:
This consists of:
  *  creating a virtual environment in which we'll install necessary libraries in order to avoid reinstalling them in each session.
  * With that being done, we'll only need to append the path of the installation in the virtual environment at the beginning, which is somehow the equivalent to the activation

* **Preprocess**: The goal of this part is to <ins> create training and development datasets </ins>. In order to achieve that we'll:
  * import the json data: train and development ( validation)
  * get the indexes of the documents <ins>with annotations</ins>, in other terms with special entities
  * clean the data: This function removes special characters that are ignored or treated as whitespaces by the tokenizer like '\xad', and those that would waste token space. For example, sequences like '===', '---' used to separate passages are tokenized seperatelly, consuming unnecessary token slots. Since the tokenizer handles whitespaces, we don't need to remove them during cleaning.

  * define a function to get the new position of a special entity based on its position in the text given in the annotations because the cleaning will imply some changes in the positions and we'll need those later to make sure the labels will be aligned.
  * define the model, the tokenizer and tokenize the data. The tokenization will give a list of tokens for each document.
  * define a function that'll get the position of the token of a given special entity based on its position in the text. This function will be used to make sure the labels will be aligned with the tokens.
  
  * create the list of labels' tokens based on the list of tokens. (Considering that the tokens are a list of lists. A list of tokens for each document where each document is an element of the bigger list)
  *  gather all the labels: labels_list, which will be encoded
  * create the datasets:
    * with 0 for cls, sep and pad
    * with -100 for cls, sep and pad

* **Process**:
  * load the datasets
  * fine tune model on different datasets


# Conclusion


* The model shows better performance when we don't affect -100 on cls, sep and pad
* Some labels (GPE, OTHER_PERSON) gives the following warning

    &emsp; | seems not to be NE tag   \\
after checking, these labels are present in both the IOB/BIO format (we have B-GPE, I-GPE, B-OTHER_PERSON and I-OTHER_PERSON ammoung the labels) and raw format (we also have GPE and OTHER_PERSON)
which could be the source of the confusion


# 1.Environment set

## Create and activate venv to avoid installing libraries in every session

In [47]:
!pip install virtualenv

Collecting virtualenv
  Downloading virtualenv-20.26.2-py3-none-any.whl (3.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting distlib<1,>=0.3.7 (from virtualenv)
  Downloading distlib-0.3.8-py2.py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.9/468.9 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: distlib, virtualenv
Successfully installed distlib-0.3.8 virtualenv-20.26.2


In [48]:
!virtualenv /content/drive/MyDrive/virtual_env


created virtual environment CPython3.10.12.final.0-64 in 26673ms
  creator CPython3Posix(dest=/content/drive/MyDrive/virtual_env, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: MarkupSafe==2.1.5, PyYAML==6.0.1, accelerate==0.30.1, aiohttp==3.9.5, aiosignal==1.3.1, async_timeout==4.0.3, attrs==23.2.0, certifi==2024.2.2, charset_normalizer==3.3.2, datasets==2.19.1, dill==0.3.8, evaluate==0.4.2, filelock==3.14.0, frozenlist==1.4.1, fsspec==2024.3.1, huggingface_hub==0.23.0, idna==3.7, jinja2==3.1.4, joblib==1.4.2, mpmath==1.3.0, multidict==6.0.5, multiprocess==0.70.16, networkx==3.3, numpy==1.26.4, nvidia_cublas_cu12==12.1.3.1, nvidia_cuda_cupti_cu12==12.1.105, nvidia_cuda_nvrtc_cu12==12.1.105, nvidia_cuda_runtime_cu12==12.1.105, nvidia_cudnn_cu12==8.9.2.26, nvidia_cufft_cu12==11.0.2.54, nvidia_curand_cu12==10.3.2.106, nvidia_cus

In [49]:
import sys
sys.path.append("/content/drive/MyDrive/virtual_env/lib/python3.10/site-packages")

### install libraries

#### datasets

In [52]:
!source /content/drive/MyDrive/virtual_env/bin/activate; pip install datasets --upgrade

Collecting datasets
  Downloading datasets-2.19.2-py3-none-any.whl.metadata (19 kB)
Collecting requests>=2.32.1 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Downloading datasets-2.19.2-py3-none-any.whl (542 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: requests, datasets
  Attempting uninstall: requests
    Found existing installation: requests 2.31.0
    Uninstalling requests-2.31.0:
      Successfully uninstalled requests-2.31.0
  Attempting uninstall: datasets
    Found existing installation: datasets 2.19.1
    Uninstalling datasets-2.19.1:
      Successfully uninstalled datasets-2.19.1
Successfully installed datasets-2.19.2 requests-2.32.3


#### transformers [torch]

In [None]:
!source /content/drive/MyDrive/virtual_env/bin/activate; pip install transformers[torch]

Collecting transformers[torch]
  Downloading transformers-4.40.2-py3-none-any.whl.metadata (137 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/138.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━[0m [32m81.9/138.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.0/138.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting regex!=2019.12.17 (from transformers[torch])
  Downloading regex-2024.5.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.20,>=0.19 (from transformers[torch])
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.1 (from transformers[torch])
  Downloading s

#### accelerate

In [None]:
!source /content/drive/MyDrive/virtual_env/bin/activate; pip install accelerate -U



#### seqeval

In [None]:
!source /content/drive/MyDrive/virtual_env/bin/activate; pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m41.0/43.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m950.6 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=911957a1f4c77fac1770f1f20b38337d07b0c706572a68b5950cc7c9f4c6ac67
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


#### evaluate

In [None]:
!source /content/drive/MyDrive/virtual_env/bin/activate; pip install evaluate

# 2.Preprocess

#### append path
 (equiv to activate the virtualenv at the begining of fevery session, to avoid reinstalling libraries)
 (only run if the installation was done in a different previous session, in other terms check if the appended path is not already in sys.

In [2]:
import sys
sys.path.append("/content/drive/MyDrive/virtual_env/lib/python3.10/site-packages")

### Data import
+only taking dcts with annotations, clean,

In [3]:
import json

with open(r"/content/drive/MyDrive/ner_india/NER_TRAIN_JUDGEMENT.json") as file:
    json_data = json.load(file)
len(json_data)

9435

In [4]:
import json
with open(r"/content/drive/MyDrive/ner_india/NER_DEV_PREAMBLE.json") as file:
    dev_data = json.load(file)
print(len(dev_data))

125


#### only taking the indices with annotations

In [5]:
with_annot_indices = []
for indice in range(len(json_data)):
  for annot in json_data[indice]['annotations']:
    if annot['result']!= []:
      with_annot_indices.append(indice)

print('num of dcts with no annot:', len(json_data) - len(with_annot_indices))

num of dcts with no annot: 2177


#### clean the text data

Technically, there is no need to get rid of whitespaces, because if the goal is really to prepare the data for the tokenizer, they aren't even taken into consideration. We only need to get rid of the special characters, such as \xad, \xa0, etc..., (they could be found using re.finditer(r'[\x80-\xff]', text), to find all possible itterations). To make sure we cover all the special characters, we need to find all the posibilities inside the text data

In [6]:
import re
def clean_text_data(text_data):
    replacements = [
        (r'\.{2,}', ' '),
        (r'\={2,}', ' '),
        (r'\-{2,}', ' '),
        (r'\xad', ' '),
        (r'\xa0', ' '),
        (r'\x80', ''),
        (r'\x9d', ''),
        (r'\x13', ' '),
        ]

    for old, new in replacements:
      text_data = re.sub(old, new, text_data)
    return text_data

dev_text_data = []
for line in dev_data:
    dev_text_data.append(clean_text_data(line['data']['text']))

text_data = []
plain_text_data = []
# for line in json_data:
for idx in with_annot_indices:
    text_data.append(clean_text_data(json_data[idx]['data']['text']))
    plain_text_data.append(json_data[idx]['data']['text'])

#### get new position in clean data function

In [7]:
def get_new_position(old_text, new_text, pos):
  #a function that'll allow to find the position of a special entity in the new cleaned text using the position in the raw text given in annotations

  if new_text[pos] == old_text[pos]: #pos did not change
    return pos

  else: # we actually need to update the pos

    special_entity = old_text[pos]
    if clean_text_data(special_entity).split() == []:
      return 'not so special entity'

    if special_entity[0:2] == '--':
      special_entity = special_entity[2:] #cases where the NE starts with -, preceeded by -- in data['text'] and omitted by the cleaninh function
      pos = slice(pos.start + 2, pos.stop)

    if special_entity[0:1] == '-' and special_entity[1:2] != '-':
      special_entity = special_entity[1:] #cases where the NE starts with -, preceeded by -- in data['text'] and omitted by the cleaninh function
      pos = slice(pos.start + 1, pos.stop)

    in_old_text = [m.start() for m in re.finditer(re.escape(special_entity), old_text)]
    in_new_text = [m.start() for m in re.finditer(re.escape(clean_text_data(special_entity)), new_text)]

    if len(in_new_text) != len(in_old_text) and special_entity != '.':  #written differently #with whitespaces, or special characters
      # this block is for test, and shouldn't be achieved
      print(indice, "\n not same len")
      print('\t special_entity: ', special_entity, 'at pos', pos)
      print('in_old_text: ', in_old_text, '\t in_new_text: ',in_new_text)

    else:

      all_occ = dict(zip(in_old_text,in_new_text))

      start = all_occ[pos.start]

      end = start+len(clean_text_data(old_text[pos]))#-1

      # check
      if clean_text_data(old_text[pos]) != new_text[slice(start, end)]:
        print('not similar', old_text[pos], '\t', new_text[slice(start, end)])


      return slice(start,end)


### model and tokenizer
init, tokenize train and dev data,

In [49]:
model = 'dslim/bert-base-NER'
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model)

#### tokenize train data

In [50]:
tokenized = tokenizer(text_data,  is_split_into_words = False, return_offsets_mapping = True, add_special_tokens = True, truncation = True, padding = 'max_length', max_length=512, stride=128, return_overflowing_tokens=True,
 )
tokens = [tokenizer.convert_ids_to_tokens(tokenizedx) for tokenizedx in tokenized.input_ids]

#### tokenization of the dev data with truncation and return overflowing

In [51]:
dev_tokenized = tokenizer(dev_text_data,  is_split_into_words = False,  return_offsets_mapping = True, add_special_tokens = True,  padding = 'max_length', truncation = True, max_length=512, stride=128, return_overflowing_tokens=True, )

dev_tokens = [tokenizer.convert_ids_to_tokens(tokenizedx) for tokenizedx in dev_tokenized.input_ids]

## Dataset preparation
get token pos, affect labels to train and dev

#### get token pos

In [52]:
def get_token_pos(mappers, start_str, end_str):
  # get position in tokens list from annotations & offset_mapping
  for idx, (start, end) in enumerate(mappers):
    if idx<= mappers.index(max(mappers)):
        if start <= start_str:
            idx_start = idx
        if end == end_str:
            idx_end = idx+1
        if end < end_str:
            idx_end = idx+2

  pos = slice(idx_start, idx_end)
  return pos

### Affectation

#### affect on train data

In [53]:
#initialize all tokens labels as 'O': Outside
tokens_labels = [['O' for mapper in tokenized["offset_mapping"][i]] for i in range(len(tokenized["offset_mapping"]))]

 #then update based on the str position of the special entity and the get token pos (from str to tokens)

for i, indice in enumerate(with_annot_indices):
  # print(indice)
  mappers = tokenized["offset_mapping"][i]
  raw_text = json_data[indice]['data']['text']

  for annot in json_data[indice]['annotations']:    #list of all the entities that are part of the deal
    for result in annot['result']:
      deb_str = result['value']['start']
      fin_str = result['value']['end']
      clean_pos = get_new_position(raw_text, text_data[i], slice(deb_str,fin_str) )
      if type(clean_pos) == slice:
        pos = get_token_pos(mappers, clean_pos.start, clean_pos.stop )

        label = result['value']['labels'][0]          #[0] to remove from list
        if tokens[i][pos][0] == result['value']['text']:   #when the special entity == token, not cut
          tokens_labels[i][pos.start] = label

        else:
          tokens_labels[i][pos.start] = 'B-'+label     #only put B- when there is an I-
          for x in range(pos.start+1, pos.stop):
            if x <len(tokens_labels[i]):
              tokens_labels[i][x] = 'I-'+label

        if ''.join(tokens[i][pos]).replace('##', '') != ''.join(text_data[i][clean_pos].split()):
          print(indice, i, pos,  text_data[i][clean_pos], pos, tokens[i][pos])

      else:
        print(clean_pos)


not so special entity
1423 1103 slice(40, 44, None) September  30,	 slice(40, 44, None) ['September', '30', ',', '1989']
3008 2335 slice(40, 48, None) Clause 8(vi)

 slice(40, 48, None) ['Claus', '##e', '8', '(', 'v', '##i', ')', '(']
3917 3038 slice(44, 53, None) section 3(1)(c)	  slice(44, 53, None) ['section', '3', '(', '1', ')', '(', 'c', ')', 'of']
4657 3606 slice(66, 74, None) Articles 14 and 19 (1)
 slice(66, 74, None) ['Articles', '14', 'and', '19', '(', '1', ')', '(']
5100 3940 slice(27, 35, None) Cr.P.C. 

  slice(27, 35, None) ['C', '##r', '.', 'P', '.', 'C', '.', 'C']
5824 4510 slice(13, 38, None) 'Paupuk Kannu Anni v. Thoppayya Mudaliar', (J) :   slice(13, 38, None) ["'", 'Pa', '##up', '##uk', 'Ka', '##nn', '##u', 'Ann', '##i', 'v', '.', 'T', '##hop', '##pa', '##yya', 'Mu', '##dal', '##iar', "'", ',', '(', 'J', ')', ':', 'Claus']
6042 4687 slice(34, 40, None) Rahmania Coffee Works. slice(34, 40, None) ['Rahman', '##ia', 'Coffee', 'Works', '.', '32']
6144 4778 slice(27, 29

#### affect on dev

In [81]:
#the real deal with redundance

dev_tokens_labels = [['O' for mapper in dev_tokenized["offset_mapping"][i]] for i in range(len(dev_tokenized["offset_mapping"]))]

#the problem is that the affectation is done on the part at the end of the first chunk but not in the stride in the second chunk
#so i need to keep the new offset mapping with the positions to be able to get the index and affect
#problem solved


for indice in range(len(dev_data)):
  # print(indice)
  tkn_indice = dev_tokenized['overflow_to_sample_mapping'].index(indice)
  # print('tkn_indice: ', tkn_indice)

  for annot in dev_data[indice]['annotations']: #list of all the entities that are part of the deal
    for result in annot['result']:
      deb_str = result['value']['start']
      fin_str = result['value']['end']
      label = result['value']['labels'][0]

      raw_text = dev_data[indice]['data']['text']
      special_entity = result['value']['text']
      clean_pos = get_new_position(raw_text,dev_text_data[indice], slice(deb_str,fin_str) ) #will give the tupple for the offset_map

      if dev_tokenized['overflow_to_sample_mapping'].count(indice) > 1:

        #the text was split into chunks, so we need to check the redundance of a special entity in them using offset mapping

        for i in range(dev_tokenized['overflow_to_sample_mapping'].count(indice)+1) :
          if clean_pos.start > dev_tokenized["offset_mapping"][tkn_indice+i+1][2][0] and clean_pos.stop < max(dev_tokenized["offset_mapping"][tkn_indice+i+1])[1]:
              #the special entity was found inside the following chunk

            mapper = dev_tokenized["offset_mapping"][tkn_indice+i+1]
            token_pos =  get_token_pos(mapper, clean_pos.start, clean_pos.stop)

            dev_tokens_labels[tkn_indice+i+1][token_pos.start] = 'B-'+label

            for x in range(token_pos.start+1, token_pos.stop):
              if x <len(dev_tokens_labels[tkn_indice+i+1]):
                dev_tokens_labels[tkn_indice+i+1][x] = 'I-'+label


          if clean_pos.start > dev_tokenized["offset_mapping"][tkn_indice+i][2][0] and clean_pos.stop < max(dev_tokenized["offset_mapping"][tkn_indice+i])[1]:
            #update the clean_pos too
            tkn_indice += i
            break

      mapper = dev_tokenized["offset_mapping"][tkn_indice]
      token_pos =  get_token_pos(mapper, clean_pos.start, clean_pos.stop)

      #check similarity

      if ''.join(dev_tokens[tkn_indice][token_pos]).replace('##', '') != ''.join(dev_text_data[indice][clean_pos].split()):
        print('mismatch at:', indice,tkn_indice, special_entity, dev_tokens[tkn_indice][token_pos], '\t', token_pos, '\t', slice(deb_str,fin_str),  clean_pos )

      dev_tokens_labels[tkn_indice][token_pos.start] = 'B-'+label

      for x in range(token_pos.start+1, token_pos.stop):
        if x <len(dev_tokens_labels[tkn_indice]):
          dev_tokens_labels[tkn_indice][x] = 'I-'+label

        else:
          if x!=512 :
            print('not simillar,  \t', 'indice: \t', indice, '\t x: \t', x )


mismatch at: 17 20 High Court For The State Of Telangana
 ['High', 'Court', 'For', 'The', 'State', 'Of', 'Telangana', 'At'] 	 slice(3, 11, None) 	 slice(7, 45, None) slice(7, 45, None)


not a real mismatch tho

## Datasets creation
labels_list, encode: int_labels, and dataset generation

### labels lists

#### encode labels (str -> int)

In [55]:
ne_distribution = {}
## objective: {'ne' : num_occ}
for labels in tokens_labels:
  for idx in range(len(labels)):
    if labels[idx] in ne_distribution.keys():
      ne_distribution[labels[idx]] +=1
    else:
      ne_distribution[labels[idx]] = 1

for labels in dev_tokens_labels:
  for idx in range(len(labels)):
    if labels[idx] in ne_distribution.keys():
      ne_distribution[labels[idx]] +=1
    else:
      ne_distribution[labels[idx]] = 1


print(len(ne_distribution ))
import pandas as pd

pd.DataFrame.from_dict(dict(sorted(ne_distribution.items(), key = lambda x: x[1], reverse=True)),  orient='index').head(3)

40


Unnamed: 0,0
O,3678669
I-PRECEDENT,24789
I-CASE_NUMBER,9546


In [56]:
labels_list = list(ne_distribution.keys())
len(labels_list)

40

In [57]:
int_labels = {}

for i, lab in enumerate(labels_list):
  int_labels[lab] = i
print(int_labels)

{'O': 0, 'B-ORG': 1, 'I-ORG': 2, 'B-OTHER_PERSON': 3, 'I-OTHER_PERSON': 4, 'B-WITNESS': 5, 'I-WITNESS': 6, 'GPE': 7, 'B-STATUTE': 8, 'I-STATUTE': 9, 'B-DATE': 10, 'I-DATE': 11, 'B-PROVISION': 12, 'I-PROVISION': 13, 'B-COURT': 14, 'I-COURT': 15, 'B-PRECEDENT': 16, 'I-PRECEDENT': 17, 'B-GPE': 18, 'I-GPE': 19, 'B-CASE_NUMBER': 20, 'I-CASE_NUMBER': 21, 'ORG': 22, 'B-PETITIONER': 23, 'I-PETITIONER': 24, 'B-JUDGE': 25, 'I-JUDGE': 26, 'WITNESS': 27, 'B-RESPONDENT': 28, 'I-RESPONDENT': 29, 'STATUTE': 30, 'RESPONDENT': 31, 'OTHER_PERSON': 32, 'DATE': 33, 'JUDGE': 34, 'PETITIONER': 35, 'PROVISION': 36, 'CASE_NUMBER': 37, 'B-LAWYER': 38, 'I-LAWYER': 39}


#### int train labels

In [58]:
import copy
int_tokens_labels = copy.deepcopy(tokens_labels)

for i in range(len(tokens_labels)):
  for j in range(len(tokens_labels[i])):
    int_tokens_labels[i][j] = int_labels[tokens_labels[i][j]]

#### int dev labels list

In [59]:
import copy
int_dev_tokens_labels = copy.deepcopy(dev_tokens_labels)

for i in range(len(dev_tokens_labels)):
  for j in range(len(dev_tokens_labels[i])):
    int_dev_tokens_labels[i][j] = int_labels[dev_tokens_labels[i][j]]

### generate datasets

#### gen train dt with_annot_indices

In [60]:
def no_ne_gen():
  i = 0
  for indice in range(len(with_annot_indices)):
    yield {'labels' : int_tokens_labels[i], 'input_ids': tokenized['input_ids'][i] , 'attention_mask': tokenized['attention_mask'][i]}
    i+=1

from datasets import Dataset
with_ne_train_dt = Dataset.from_generator(no_ne_gen)
with_ne_train_dt

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 7258
})

#### gen dev dt

In [61]:
def dev_gen():
  i = 0
  for line in dev_data: #   'id':i,  'tokens': dev_tokens[i],
    yield {'labels' : int_dev_tokens_labels[i], 'input_ids': dev_tokenized['input_ids'][i] , 'attention_mask': dev_tokenized['attention_mask'][i]}
    i+=1

# from datasets import Dataset
dev_dt = Dataset.from_generator(dev_gen)
dev_dt

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 125
})

#### display function

In [3]:
from datasets import ClassLabel, Sequence
import pandas as pd
from IPython.display import display, HTML

def display_dt(dataset, num_examples=10):
    assert num_examples <= len(dataset);

    df = pd.DataFrame(dataset[:num_examples])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))


In [62]:
with_ne_train_dt_no_100 = with_ne_train_dt
dev_dt_no_100 = dev_dt

In [None]:
display_dt(with_ne_train_dt_no_100, 2)

Unnamed: 0,labels,input_ids,attention_mask
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[101, 113, 128, 114, 1212, 2747, 15027, 1616, 1118, 1103, 3096, 1732, 1164, 1126, 3990, 1104, 19003, 119, 122, 117, 1955, 117, 3413, 117, 2260, 1113, 14304, 1334, 1104, 3475, 19890, 1403, 2950, 3300, 1104, 1134, 170, 6307, 5633, 1110, 5452, 1120, 185, 119, 1969, 1104, 15187, 3051, 112, 188, 2526, 1520, 117, 3560, 21812, 4702, 7402, 1115, 1122, 1108, 2272, 1106, 4891, 1121, 24535, 117, 16890, 24287, 111, 3291, 119, 1113, 1103, 3142, 1104, 1117, 13455, 170, 3238, 4551, 1110, 1508, 1118, 1366, 1113, 1115, 6307, 5633, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
1,"[0, 0, 0, 0, 0, 0, 3, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[101, 1124, 1108, 1145, 1455, 2480, 138, 4873, 1161, 133, 8492, 1705, 134, 107, 4610, 168, 3087, 107, 25021, 134, 107, 8492, 168, 126, 107, 135, 15531, 1592, 1302, 119, 2724, 1545, 118, 24044, 1104, 1772, 127, 133, 120, 8492, 135, 14812, 2149, 117, 1534, 118, 1107, 118, 1644, 1104, 1103, 10281, 2077, 10380, 1121, 22515, 17670, 9962, 1389, 5329, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


In [44]:
display_dt(dev_dt_no_100, 2)

Unnamed: 0,labels,input_ids,attention_mask
0,"[0, 0, 0, 0, 0, 0, 0, 14, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 38, 39, 39, 39, 39, 0, 0, 0, 38, 39, 39, 39, 39, 39, 39, 0, 0, 0, 38, 39, 39, 0, 0, 0, 38, 39, 39, 39, 39, 39, 39, 0, ...]","[101, 109, 199, 1969, 115, 1130, 1109, 1693, 2031, 2096, 6175, 1335, 1203, 6175, 110, 13063, 10517, 1113, 131, 1955, 119, 5004, 119, 10351, 116, 6603, 119, 138, 8661, 119, 5311, 1545, 120, 1857, 111, 140, 1306, 1302, 1116, 119, 3993, 11964, 1477, 120, 1857, 117, 15722, 25631, 120, 10351, 117, 3236, 16382, 1571, 120, 10351, 13078, 11037, 3291, 4492, 119, 138, 24756, 9180, 4737, 131, 1828, 119, 156, 119, 153, 119, 18489, 117, 1828, 119, 15619, 5443, 6583, 144, 2312, 1830, 14518, 117, 1828, 119, 11896, 1197, 5329, 1105, 1828, 119, 153, 13148, 6610, 5329, 14812, 1179, 11487, 117, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
1,"[0, 0, 0, 0, 0, 0, 14, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 28, 29, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[101, 122, 7178, 1895, 1130, 1109, 3732, 2031, 2096, 1726, 3145, 138, 24756, 8052, 23915, 4889, 15906, 3145, 13969, 1302, 119, 3102, 1545, 2096, 17881, 1475, 113, 10789, 4253, 3929, 2096, 156, 1233, 1643, 113, 140, 114, 1302, 119, 26409, 11049, 2096, 1410, 114, 20967, 1197, 2687, 7418, 1394, 1324, 12189, 15513, 1394, 1324, 795, 138, 24756, 9180, 159, 1116, 119, 1426, 2096, 15019, 111, 2926, 1116, 119, 795, 11336, 20080, 16838, 9857, 1556, 3145, 13969, 1302, 119, 3102, 1559, 2096, 17881, 1475, 113, 10789, 4253, 3929, 2096, 156, 1233, 1643, 113, 140, 114, 1302, 119, 27724, 19297, 2096, 1410, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


#### affect -100 to cls, sep and pad

In [63]:
def affect_minus_100(dt):
  #the role of this function is to find the indices of 101, 102 and 0 (cls, sep and pad) from input_ids and in the labels column affect -100
  #without having -100 in the labels list
  input_ids = dt['input_ids']
  labels = dt['labels']

  new_labs = labels.copy()

  for idx, input_id in enumerate(input_ids):
    if input_id in [101, 102, 0]:
      new_labs[idx] = -100

  return {'labels': new_labs}



In [50]:
display_dt(dev_dt_with_100, 2)

Unnamed: 0,labels,input_ids,attention_mask
0,"[-100, 0, 0, 0, 0, 0, 0, 14, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 38, 39, 39, 39, 39, 0, 0, 0, 38, 39, 39, 39, 39, 39, 39, 0, 0, 0, 38, 39, 39, 0, 0, 0, 38, 39, 39, 39, 39, 39, 39, 0, ...]","[101, 109, 199, 1969, 115, 1130, 1109, 1693, 2031, 2096, 6175, 1335, 1203, 6175, 110, 13063, 10517, 1113, 131, 1955, 119, 5004, 119, 10351, 116, 6603, 119, 138, 8661, 119, 5311, 1545, 120, 1857, 111, 140, 1306, 1302, 1116, 119, 3993, 11964, 1477, 120, 1857, 117, 15722, 25631, 120, 10351, 117, 3236, 16382, 1571, 120, 10351, 13078, 11037, 3291, 4492, 119, 138, 24756, 9180, 4737, 131, 1828, 119, 156, 119, 153, 119, 18489, 117, 1828, 119, 15619, 5443, 6583, 144, 2312, 1830, 14518, 117, 1828, 119, 11896, 1197, 5329, 1105, 1828, 119, 153, 13148, 6610, 5329, 14812, 1179, 11487, 117, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
1,"[-100, 0, 0, 0, 0, 0, 14, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 28, 29, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[101, 122, 7178, 1895, 1130, 1109, 3732, 2031, 2096, 1726, 3145, 138, 24756, 8052, 23915, 4889, 15906, 3145, 13969, 1302, 119, 3102, 1545, 2096, 17881, 1475, 113, 10789, 4253, 3929, 2096, 156, 1233, 1643, 113, 140, 114, 1302, 119, 26409, 11049, 2096, 1410, 114, 20967, 1197, 2687, 7418, 1394, 1324, 12189, 15513, 1394, 1324, 795, 138, 24756, 9180, 159, 1116, 119, 1426, 2096, 15019, 111, 2926, 1116, 119, 795, 11336, 20080, 16838, 9857, 1556, 3145, 13969, 1302, 119, 3102, 1559, 2096, 17881, 1475, 113, 10789, 4253, 3929, 2096, 156, 1233, 1643, 113, 140, 114, 1302, 119, 27724, 19297, 2096, 1410, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


#### save datasets

In [None]:
# O, cls, sep and pad labelled as 0
with_ne_train_dt.save_to_disk('/content/drive/MyDrive/ner_india/with_ne_train_dtst_no_100')
dev_dt.save_to_disk('/content/drive/MyDrive/ner_india/dev_dtst_no_100')


# O, cls, sep and pad labelled as -100
with_ne_train_dt.map(affect_minus_100).save_to_disk('/content/drive/MyDrive/ner_india/with_ne_train_dtst_with-100')
dev_dt.map(affect_minus_100).save_to_disk('/content/drive/MyDrive/ner_india/dev_dtst_with-100')


In [64]:
with_ne_train_dt_with_100 = with_ne_train_dt.map(affect_minus_100)
dev_dt_with_100 = dev_dt.map(affect_minus_100)

Map:   0%|          | 0/7258 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

# 3.Process

#### load datasets
and display them

In [None]:
import sys
sys.path.append("/content/drive/MyDrive/virtual_env/lib/python3.10/site-packages")

#### display function

In [2]:
from datasets import ClassLabel, Sequence
import pandas as pd
from IPython.display import display, HTML

def display_dt(dataset, num_examples=10):
    assert num_examples <= len(dataset);

    df = pd.DataFrame(dataset[:num_examples])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))
# show_random_elements(mapped_dataset)

#### loading

In [3]:
from datasets import load_from_disk
with_ne_train_dtst_with_100 = load_from_disk('/content/drive/MyDrive/ner_india/with_ne_train_dtst_with-100')
display_dt(with_ne_train_dtst_with_100, 2)

Unnamed: 0,labels,input_ids,attention_mask
0,"[-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, ...]","[101, 113, 128, 114, 1212, 2747, 15027, 1616, 1118, 1103, 3096, 1732, 1164, 1126, 3990, 1104, 19003, 119, 122, 117, 1955, 117, 3413, 117, 2260, 1113, 14304, 1334, 1104, 3475, 19890, 1403, 2950, 3300, 1104, 1134, 170, 6307, 5633, 1110, 5452, 1120, 185, 119, 1969, 1104, 15187, 3051, 112, 188, 2526, 1520, 117, 3560, 21812, 4702, 7402, 1115, 1122, 1108, 2272, 1106, 4891, 1121, 24535, 117, 16890, 24287, 111, 3291, 119, 1113, 1103, 3142, 1104, 1117, 13455, 170, 3238, 4551, 1110, 1508, 1118, 1366, 1113, 1115, 6307, 5633, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
1,"[-100, 0, 0, 0, 0, 0, 3, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 4, 4, 0, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, ...]","[101, 1124, 1108, 1145, 1455, 2480, 138, 4873, 1161, 133, 8492, 1705, 134, 107, 4610, 168, 3087, 107, 25021, 134, 107, 8492, 168, 126, 107, 135, 15531, 1592, 1302, 119, 2724, 1545, 118, 24044, 1104, 1772, 127, 133, 120, 8492, 135, 14812, 2149, 117, 1534, 118, 1107, 118, 1644, 1104, 1103, 10281, 2077, 10380, 1121, 22515, 17670, 9962, 1389, 5329, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


In [4]:
# from datasets import load_from_disk
dev_dtst_with_100 = load_from_disk('/content/drive/MyDrive/ner_india/dev_dtst_with-100')
display_dt(dev_dtst_with_100, 2)

Unnamed: 0,labels,input_ids,attention_mask
0,"[-100, 0, 0, 0, 0, 0, 0, 14, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 38, 39, 39, 39, 39, 0, 0, 0, 38, 39, 39, 39, 39, 39, 39, 0, 0, 0, 38, 39, 39, 0, 0, 0, 38, 39, 39, 39, 39, 39, 39, 0, ...]","[101, 109, 199, 1969, 115, 1130, 1109, 1693, 2031, 2096, 6175, 1335, 1203, 6175, 110, 13063, 10517, 1113, 131, 1955, 119, 5004, 119, 10351, 116, 6603, 119, 138, 8661, 119, 5311, 1545, 120, 1857, 111, 140, 1306, 1302, 1116, 119, 3993, 11964, 1477, 120, 1857, 117, 15722, 25631, 120, 10351, 117, 3236, 16382, 1571, 120, 10351, 13078, 11037, 3291, 4492, 119, 138, 24756, 9180, 4737, 131, 1828, 119, 156, 119, 153, 119, 18489, 117, 1828, 119, 15619, 5443, 6583, 144, 2312, 1830, 14518, 117, 1828, 119, 11896, 1197, 5329, 1105, 1828, 119, 153, 13148, 6610, 5329, 14812, 1179, 11487, 117, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
1,"[-100, 0, 0, 0, 0, 0, 14, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 28, 29, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[101, 122, 7178, 1895, 1130, 1109, 3732, 2031, 2096, 1726, 3145, 138, 24756, 8052, 23915, 4889, 15906, 3145, 13969, 1302, 119, 3102, 1545, 2096, 17881, 1475, 113, 10789, 4253, 3929, 2096, 156, 1233, 1643, 113, 140, 114, 1302, 119, 26409, 11049, 2096, 1410, 114, 20967, 1197, 2687, 7418, 1394, 1324, 12189, 15513, 1394, 1324, 795, 138, 24756, 9180, 159, 1116, 119, 1426, 2096, 15019, 111, 2926, 1116, 119, 795, 11336, 20080, 16838, 9857, 1556, 3145, 13969, 1302, 119, 3102, 1559, 2096, 17881, 1475, 113, 10789, 4253, 3929, 2096, 156, 1233, 1643, 113, 140, 114, 1302, 119, 27724, 19297, 2096, 1410, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


##### labels_list

In [9]:
labels_list = ['O', 'B-ORG', 'I-ORG', 'B-OTHER_PERSON', 'I-OTHER_PERSON', 'B-WITNESS', 'I-WITNESS', 'GPE', 'B-STATUTE', 'I-STATUTE', 'B-DATE', 'I-DATE', 'B-PROVISION', 'I-PROVISION', 'B-COURT', 'I-COURT', 'B-PRECEDENT', 'I-PRECEDENT', 'B-GPE', 'I-GPE', 'B-CASE_NUMBER', 'I-CASE_NUMBER', 'ORG', 'B-PETITIONER', 'I-PETITIONER', 'B-JUDGE', 'I-JUDGE', 'WITNESS', 'B-RESPONDENT', 'I-RESPONDENT', 'STATUTE', 'RESPONDENT', 'OTHER_PERSON', 'DATE', 'JUDGE', 'PETITIONER', 'PROVISION', 'CASE_NUMBER', 'B-LAWYER', 'I-LAWYER']

int_labels = {}
for i, lab in enumerate(labels_list):
  int_labels[lab] = i

print(int_labels)

{'O': 0, 'B-ORG': 1, 'I-ORG': 2, 'B-OTHER_PERSON': 3, 'I-OTHER_PERSON': 4, 'B-WITNESS': 5, 'I-WITNESS': 6, 'GPE': 7, 'B-STATUTE': 8, 'I-STATUTE': 9, 'B-DATE': 10, 'I-DATE': 11, 'B-PROVISION': 12, 'I-PROVISION': 13, 'B-COURT': 14, 'I-COURT': 15, 'B-PRECEDENT': 16, 'I-PRECEDENT': 17, 'B-GPE': 18, 'I-GPE': 19, 'B-CASE_NUMBER': 20, 'I-CASE_NUMBER': 21, 'ORG': 22, 'B-PETITIONER': 23, 'I-PETITIONER': 24, 'B-JUDGE': 25, 'I-JUDGE': 26, 'WITNESS': 27, 'B-RESPONDENT': 28, 'I-RESPONDENT': 29, 'STATUTE': 30, 'RESPONDENT': 31, 'OTHER_PERSON': 32, 'DATE': 33, 'JUDGE': 34, 'PETITIONER': 35, 'PROVISION': 36, 'CASE_NUMBER': 37, 'B-LAWYER': 38, 'I-LAWYER': 39}


In [51]:
tokenizer

BertTokenizerFast(name_or_path='dslim/bert-base-NER', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [54]:
model

'dslim/bert-base-NER'

In [26]:
import transformers
from transformers import  DataCollatorForTokenClassification, AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification, pipeline

In [None]:
# Load Sqeval.
import evaluate
import numpy as np

metric = evaluate.load("seqeval")

# Create the list with the tags.

# Function to compute precision, recall, F1 and accuracy.
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [labels_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [labels_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

#### model

In [40]:
# Load Sqeval.
import evaluate
import numpy as np

metric = evaluate.load("seqeval")

# Create the list with the tags.

# Function to compute precision, recall, F1 and accuracy.
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [labels_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [labels_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels, )
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [67]:
model_checkpoint = 'dslim/bert-base-NER'
batch_size = 16

model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    model_name,
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
    remove_unused_columns=False
)

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(labels_list), ignore_mismatched_sizes=True)


data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model,
    args,
    train_dataset= with_ne_train_dt_with_100,
    eval_dataset= dev_dt_with_100,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at dslim/bert-base-NER and are newly initialized because the shapes did not match:
- classifier.bias: found shape torch.Size([9]) in the checkpoint and torch.Size([40]) in the model instantiated
- classifier.weight: found shape torch.Size([9, 768])

#### train on created datasets

In [68]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.979785,0.062182,0.142091,0.086507,0.554392
2,0.352500,2.026945,0.069888,0.15639,0.096605,0.582652
3,0.134600,2.195241,0.074827,0.164433,0.102851,0.578243


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1362, training_loss=0.204241961284531, metrics={'train_runtime': 1993.8901, 'train_samples_per_second': 10.92, 'train_steps_per_second': 0.683, 'total_flos': 5691430232801280.0, 'train_loss': 0.204241961284531, 'epoch': 3.0})

##### train on datasets without -100

In [76]:
display_dt(with_ne_train_dt, 1)

Unnamed: 0,labels,input_ids,attention_mask
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[101, 113, 128, 114, 1212, 2747, 15027, 1616, 1118, 1103, 3096, 1732, 1164, 1126, 3990, 1104, 19003, 119, 122, 117, 1955, 117, 3413, 117, 2260, 1113, 14304, 1334, 1104, 3475, 19890, 1403, 2950, 3300, 1104, 1134, 170, 6307, 5633, 1110, 5452, 1120, 185, 119, 1969, 1104, 15187, 3051, 112, 188, 2526, 1520, 117, 3560, 21812, 4702, 7402, 1115, 1122, 1108, 2272, 1106, 4891, 1121, 24535, 117, 16890, 24287, 111, 3291, 119, 1113, 1103, 3142, 1104, 1117, 13455, 170, 3238, 4551, 1110, 1508, 1118, 1366, 1113, 1115, 6307, 5633, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


In [77]:
display_dt(dev_dt, 1)

Unnamed: 0,labels,input_ids,attention_mask
0,"[0, 0, 0, 0, 0, 0, 0, 14, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 23, 24, 24, 24, 24, 0, 0, 0, 0, 0, 0, 0, 38, 39, 39, 39, 39, 0, 0, 0, 38, 39, 39, 39, 39, 39, 39, 0, 0, 0, 38, 39, 39, 0, 0, 0, 38, 39, 39, 39, 39, 39, 39, 0, ...]","[101, 109, 199, 1969, 115, 1130, 1109, 1693, 2031, 2096, 6175, 1335, 1203, 6175, 110, 13063, 10517, 1113, 131, 1955, 119, 5004, 119, 10351, 116, 6603, 119, 138, 8661, 119, 5311, 1545, 120, 1857, 111, 140, 1306, 1302, 1116, 119, 3993, 11964, 1477, 120, 1857, 117, 15722, 25631, 120, 10351, 117, 3236, 16382, 1571, 120, 10351, 13078, 11037, 3291, 4492, 119, 138, 24756, 9180, 4737, 131, 1828, 119, 156, 119, 153, 119, 18489, 117, 1828, 119, 15619, 5443, 6583, 144, 2312, 1830, 14518, 117, 1828, 119, 11896, 1197, 5329, 1105, 1828, 119, 153, 13148, 6610, 5329, 14812, 1179, 11487, 117, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


In [78]:
trainer = Trainer(
    model,
    args,
    train_dataset= with_ne_train_dt,  #without -100
    eval_dataset= dev_dt,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.058191,0.081013,0.171582,0.11006,0.787469
2,0.012700,1.061291,0.071856,0.160858,0.099338,0.800188
3,0.005100,1.182585,0.075868,0.168007,0.104532,0.786109


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1362, training_loss=0.00772375130968472, metrics={'train_runtime': 2107.2855, 'train_samples_per_second': 10.333, 'train_steps_per_second': 0.646, 'total_flos': 5691430232801280.0, 'train_loss': 0.00772375130968472, 'epoch': 3.0})

#### train on loaded datasets

In [3]:
from datasets import load_from_disk

with_ne_train_dtst_with_100 = load_from_disk('/content/drive/MyDrive/ner_india/with_ne_train_dtst_with-100')
dev_dtst_with_100 = load_from_disk('/content/drive/MyDrive/ner_india/dev_dtst_with-100')


In [4]:
model_checkpoint = 'dslim/bert-base-NER'
batch_size = 16

model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    model_name,
    eval_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
    remove_unused_columns=False
)

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(labels_list), ignore_mismatched_sizes=True)


data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model,
    args,
    train_dataset= with_ne_train_dt_with_100,
    eval_dataset= dev_dt_with_100,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

NameError: name 'TrainingArguments' is not defined

##### with -100

In [73]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.913024,0.061827,0.138517,0.085494,0.569425
2,0.358400,2.040764,0.070224,0.159964,0.097601,0.573069
3,0.131900,2.176579,0.072043,0.157283,0.098821,0.577968


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1362, training_loss=0.2050603230794271, metrics={'train_runtime': 1997.2391, 'train_samples_per_second': 10.902, 'train_steps_per_second': 0.682, 'total_flos': 5691430232801280.0, 'train_loss': 0.2050603230794271, 'epoch': 3.0})

##### train without -100 in int labels

In [79]:
with_ne_train_dtst_without_100 = load_from_disk('/content/drive/MyDrive/ner_india/with_ne_train_dtst_no_100')
dev_dtst_without_100 = load_from_disk('/content/drive/MyDrive/ner_india/dev_dtst_no_100')

In [80]:
trainer = Trainer(
    model,
    args,
    train_dataset= with_ne_train_dtst_without_100,
    eval_dataset= dev_dtst_without_100,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.384539,0.068261,0.158177,0.095366,0.776656
2,0.003200,1.493755,0.072276,0.174263,0.102174,0.770266
3,0.002400,1.420731,0.079513,0.175156,0.109375,0.786641


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1362, training_loss=0.0026925060017343487, metrics={'train_runtime': 2109.9367, 'train_samples_per_second': 10.32, 'train_steps_per_second': 0.646, 'total_flos': 5691430232801280.0, 'train_loss': 0.0026925060017343487, 'epoch': 3.0})

In [12]:
# similar to previous but with less warnings
trainer = Trainer(
    model,
    args,
    train_dataset= with_ne_train_dtst_without_100,
    eval_dataset= dev_dtst_without_100,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.99017,0.022895,0.058088,0.032845,0.768219
2,0.079000,0.988958,0.061098,0.141197,0.08529,0.788016
3,0.015600,1.059599,0.066041,0.151028,0.091898,0.784813




TrainOutput(global_step=1362, training_loss=0.037758320963855355, metrics={'train_runtime': 2181.3382, 'train_samples_per_second': 9.982, 'train_steps_per_second': 0.624, 'total_flos': 5691430232801280.0, 'train_loss': 0.037758320963855355, 'epoch': 3.0})

# Get rid of the labels that doesn't follow the BIO format

#### Data preparation

In this part, we'll import the data, define a function that'll get rid of the raw labels that were ignored by simply adding B- in front of them, set the model parameters and train

In [2]:
import sys
sys.path.append("/content/drive/MyDrive/virtual_env/lib/python3.10/site-packages")

Since the model showed a better performance on the data without the -100 label

In [5]:
from datasets import load_from_disk
with_ne_train_dtst_without_100 = load_from_disk('/content/drive/MyDrive/ner_india/with_ne_train_dtst_no_100')
dev_dtst_without_100 = load_from_disk('/content/drive/MyDrive/ner_india/dev_dtst_no_100')

In [10]:
labels_list = ['O', 'B-ORG', 'I-ORG', 'B-OTHER_PERSON', 'I-OTHER_PERSON', 'B-WITNESS', 'I-WITNESS', 'GPE', 'B-STATUTE', 'I-STATUTE', 'B-DATE', 'I-DATE', 'B-PROVISION', 'I-PROVISION', 'B-COURT', 'I-COURT', 'B-PRECEDENT', 'I-PRECEDENT', 'B-GPE', 'I-GPE', 'B-CASE_NUMBER', 'I-CASE_NUMBER', 'ORG', 'B-PETITIONER', 'I-PETITIONER', 'B-JUDGE', 'I-JUDGE', 'WITNESS', 'B-RESPONDENT', 'I-RESPONDENT', 'STATUTE', 'RESPONDENT', 'OTHER_PERSON', 'DATE', 'JUDGE', 'PETITIONER', 'PROVISION', 'CASE_NUMBER', 'B-LAWYER', 'I-LAWYER']

int_labels = {}
for i, lab in enumerate(labels_list):
  int_labels[lab] = i

print(int_labels)

{'O': 0, 'B-ORG': 1, 'I-ORG': 2, 'B-OTHER_PERSON': 3, 'I-OTHER_PERSON': 4, 'B-WITNESS': 5, 'I-WITNESS': 6, 'GPE': 7, 'B-STATUTE': 8, 'I-STATUTE': 9, 'B-DATE': 10, 'I-DATE': 11, 'B-PROVISION': 12, 'I-PROVISION': 13, 'B-COURT': 14, 'I-COURT': 15, 'B-PRECEDENT': 16, 'I-PRECEDENT': 17, 'B-GPE': 18, 'I-GPE': 19, 'B-CASE_NUMBER': 20, 'I-CASE_NUMBER': 21, 'ORG': 22, 'B-PETITIONER': 23, 'I-PETITIONER': 24, 'B-JUDGE': 25, 'I-JUDGE': 26, 'WITNESS': 27, 'B-RESPONDENT': 28, 'I-RESPONDENT': 29, 'STATUTE': 30, 'RESPONDENT': 31, 'OTHER_PERSON': 32, 'DATE': 33, 'JUDGE': 34, 'PETITIONER': 35, 'PROVISION': 36, 'CASE_NUMBER': 37, 'B-LAWYER': 38, 'I-LAWYER': 39}


In [9]:
labels_list.index('OTHER_PERSON')

32

In [5]:
def get_rid_of_raw_labels(dtst, labels_list = labels_list ):
  #this function takes the

  labels = dtst['labels']
  new_labs = labels.copy()

  int_lab_gpe = labels_list.index('GPE')
  int_lab_b_gpe = labels_list.index('B-GPE')
  int_lab_pers = labels_list.index('OTHER_PERSON')
  int_lab_b_pers = labels_list.index('B-OTHER_PERSON')

  for idx, int_lab in enumerate(new_labs):
    if int_lab == int_lab_gpe:
      new_labs[idx] = int_lab_b_gpe

    if int_lab == int_lab_pers:
      new_labs[idx] = int_lab_b_pers

  return {'labels': new_labs}


In [6]:
train_dtst = with_ne_train_dtst_without_100.map(get_rid_of_raw_labels)
dev_dtst = dev_dtst_without_100.map(get_rid_of_raw_labels)

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

In [7]:
train_dtst.save_to_disk('/content/drive/MyDrive/ner_india/train_dtst_no_gpe_other_pers')
dev_dtst.save_to_disk('/content/drive/MyDrive/ner_india/dev_dtst_no_gpe_other_pers')

Saving the dataset (0/1 shards):   0%|          | 0/7258 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/125 [00:00<?, ? examples/s]

In [6]:
import transformers
from transformers import  DataCollatorForTokenClassification, AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification, pipeline

In [7]:
model = 'dslim/bert-base-NER'
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model)

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [8]:
# Load Sqeval.
import evaluate
import numpy as np

metric = evaluate.load("seqeval")

# Create the list with the tags.

# Function to compute precision, recall, F1 and accuracy.
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [labels_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [labels_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division = 0)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [None]:
model_checkpoint = 'dslim/bert-base-NER'
batch_size = 16

model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    model_name,
    eval_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
    remove_unused_columns=False
)

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(labels_list), ignore_mismatched_sizes=True)


data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model,
    args,
    train_dataset= train_dtst,
    eval_dataset= dev_dtst,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

#### train without -100, gpe, other_person

In [18]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.945327,0.03701,0.090259,0.052495,0.771734
2,0.078100,0.962562,0.063327,0.141197,0.087438,0.791609
3,0.015000,1.06147,0.063984,0.142091,0.088235,0.785016


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=1362, training_loss=0.03710419526009132, metrics={'train_runtime': 2155.4654, 'train_samples_per_second': 10.102, 'train_steps_per_second': 0.632, 'total_flos': 5691430232801280.0, 'train_loss': 0.03710419526009132, 'epoch': 3.0})

we can see slight improvements, but they're still not enough

In [19]:
trainer.evaluate()

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 1.061469554901123,
 'eval_precision': 0.06398390342052314,
 'eval_recall': 0.14209115281501342,
 'eval_f1': 0.08823529411764705,
 'eval_accuracy': 0.785015625,
 'eval_runtime': 5.3534,
 'eval_samples_per_second': 23.35,
 'eval_steps_per_second': 1.494,
 'epoch': 3.0}

#### further work: adjusting the warning

In [20]:
# Load Sqeval.
import evaluate
import numpy as np

metric = evaluate.load("seqeval")

# Create the list with the tags.

# Function to compute precision, recall, F1 and accuracy.
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [labels_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [labels_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division = 0 )
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [21]:
#no gpe, other_pers, and no warnings
trainer = Trainer(
    model,
    args,
    train_dataset= train_dtst,
    eval_dataset= dev_dtst,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [22]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,1.131759,0.069307,0.15639,0.096048,0.779984
2,0.009200,1.066657,0.072605,0.173369,0.102348,0.798438
3,0.005500,1.224778,0.075194,0.173369,0.104893,0.789047


TrainOutput(global_step=1362, training_loss=0.006668241801240896, metrics={'train_runtime': 2168.1802, 'train_samples_per_second': 10.043, 'train_steps_per_second': 0.628, 'total_flos': 5691430232801280.0, 'train_loss': 0.006668241801240896, 'epoch': 3.0})

In [23]:
trainer.evaluate()

{'eval_loss': 1.2247782945632935,
 'eval_precision': 0.07519379844961241,
 'eval_recall': 0.17336907953529937,
 'eval_f1': 0.10489321438226548,
 'eval_accuracy': 0.789046875,
 'eval_runtime': 5.1753,
 'eval_samples_per_second': 24.153,
 'eval_steps_per_second': 1.546,
 'epoch': 3.0}

### Different model

we'll still need to re-tokenize the data

after runing the first cells in process except for the model, tokenizer

#### model config

In [8]:
from transformers import AutoTokenizer
task = "ner" # Should be one of "ner", "pos" or "chunk"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

#### tokenize train data

In [9]:
tokenized = tokenizer(text_data,  is_split_into_words = False, return_offsets_mapping = True, add_special_tokens = True, truncation = True, padding = 'max_length', max_length=512, stride=128, return_overflowing_tokens=True,
 )
tokens = [tokenizer.convert_ids_to_tokens(tokenizedx) for tokenizedx in tokenized.input_ids]

In [10]:
print(len(tokens) -  len(text_data), 'lines were added')


0 lines were added


#### tokenization of the dev data with truncation and return overflowing

In [11]:
dev_tokenized = tokenizer(dev_text_data,  is_split_into_words = False,  return_offsets_mapping = True, add_special_tokens = True,  padding = 'max_length', truncation = True, max_length=512, stride=128, return_overflowing_tokens=True, )

dev_tokens = [tokenizer.convert_ids_to_tokens(tokenizedx) for tokenizedx in dev_tokenized.input_ids]

In [12]:
print(len(dev_tokens) - len(dev_data), 'lines were added')


18 lines were added


#### get token pos

In [13]:
def get_token_pos(mappers, start_str, end_str):
  # get position in tokens list from annotations & offset_mapping
  for idx, (start, end) in enumerate(mappers):
    if idx<= mappers.index(max(mappers)):
        if start <= start_str:
            idx_start = idx
        if end == end_str:
            idx_end = idx+1
        if end < end_str:
            idx_end = idx+2

  pos = slice(idx_start, idx_end)
  return pos

### Affectation

#### affect on train data

In [14]:
#initialize all tokens labels as 'O': Outside
tokens_labels = [['O' for mapper in tokenized["offset_mapping"][i]] for i in range(len(tokenized["offset_mapping"]))]

 #then update based on the str position of the special entity and the get token pos (from str to tokens)

for i, indice in enumerate(with_annot_indices):
  # print(indice)
  mappers = tokenized["offset_mapping"][i]
  raw_text = json_data[indice]['data']['text']

  for annot in json_data[indice]['annotations']:    #list of all the entities that are part of the deal
    for result in annot['result']:
      deb_str = result['value']['start']
      fin_str = result['value']['end']
      clean_pos = get_new_position(raw_text, text_data[i], slice(deb_str,fin_str) )
      if type(clean_pos) == slice:
        pos = get_token_pos(mappers, clean_pos.start, clean_pos.stop )

        label = result['value']['labels'][0]          #[0] to remove from list
      #   if tokens[i][pos][0] == result['value']['text']:   #when the special entity == token, not cut
      #     tokens_labels[i][pos.start] = label

      #   else:

      # if '[UNK]' in tokens[i][pos]

        tokens_labels[i][pos.start] = 'B-'+label     #only put B- when there is an I-
        for x in range(pos.start+1, pos.stop):
          if x <len(tokens_labels[i]):
            tokens_labels[i][x] = 'I-'+label

        if ''.join(tokens[i][pos]).replace('##', '') != ''.join(text_data[i][clean_pos].split()).lower() and '[UNK]' not in tokens[i][pos] :
          print(indice, i, clean_pos,  text_data[i][clean_pos], pos, tokens[i][pos])
          break

      else:
        print(clean_pos)


not so special entity
1423 1103 slice(174, 189, None) September  30,	 slice(38, 42, None) ['september', '30', ',', '1989']
3008 2335 slice(206, 220, None) Clause 8(vi)

 slice(38, 44, None) ['clause', '8', '(', 'vi', ')', '(']
3917 3038 slice(170, 187, None) section 3(1)(c)	  slice(43, 52, None) ['section', '3', '(', '1', ')', '(', 'c', ')', 'of']
4657 3606 slice(272, 295, None) Articles 14 and 19 (1)
 slice(64, 72, None) ['articles', '14', 'and', '19', '(', '1', ')', '(']
5100 3940 slice(115, 126, None) Cr.P.C. 

  slice(27, 34, None) ['cr', '.', 'p', '.', 'c', '.', 'cc']
5824 4510 slice(39, 89, None) 'Paupuk Kannu Anni v. Thoppayya Mudaliar', (J) :   slice(13, 36, None) ["'", 'pau', '##pu', '##k', 'kan', '##nu', 'ann', '##i', 'v', '.', 'tho', '##ppa', '##yya', 'mud', '##alia', '##r', "'", ',', '(', 'j', ')', ':', 'clause']
6042 4687 slice(166, 189, None) Rahmania Coffee Works. slice(33, 39, None) ['rahman', '##ia', 'coffee', 'works', '.', '32']
6144 4778 slice(109, 116, None) Hari  

In [33]:
#0 0 slice(27, 30, None) Hongkong Bank slice(27, 30, None) ['hong', '##kong', 'bank']

''.join(tokens[0][27:30]).replace('##', '') ==  ''.join(text_data[0][90:103].split()).lower()

True

#### affect on dev

In [16]:
#the real deal with redundance

dev_tokens_labels = [['O' for mapper in dev_tokenized["offset_mapping"][i]] for i in range(len(dev_tokenized["offset_mapping"]))]

#the problem is that the affectation is done on the part at the end of the first chunk but not in the stride in the second chunk
#so i need to keep the new offset mapping with the positions to be able to get the index and affect
#problem solved


for indice in range(len(dev_data)):
  # print(indice)
  tkn_indice = dev_tokenized['overflow_to_sample_mapping'].index(indice)
  # print('tkn_indice: ', tkn_indice)

  for annot in dev_data[indice]['annotations']: #list of all the entities that are part of the deal
    for result in annot['result']:
      deb_str = result['value']['start']
      fin_str = result['value']['end']
      label = result['value']['labels'][0]

      raw_text = dev_data[indice]['data']['text']
      special_entity = result['value']['text']
      clean_pos = get_new_position(raw_text,dev_text_data[indice], slice(deb_str,fin_str) ) #will give the tupple for the offset_map

      if dev_tokenized['overflow_to_sample_mapping'].count(indice) > 1:

        #the text was split into chunks, so we need to check the redundance of a special entity in them using offset mapping

        for i in range(dev_tokenized['overflow_to_sample_mapping'].count(indice)+1) :
          if clean_pos.start > dev_tokenized["offset_mapping"][tkn_indice+i+1][2][0] and clean_pos.stop < max(dev_tokenized["offset_mapping"][tkn_indice+i+1])[1]:
              #the special entity was found inside the following chunk

            mapper = dev_tokenized["offset_mapping"][tkn_indice+i+1]
            token_pos =  get_token_pos(mapper, clean_pos.start, clean_pos.stop)

            dev_tokens_labels[tkn_indice+i+1][token_pos.start] = 'B-'+label

            for x in range(token_pos.start+1, token_pos.stop):
              if x <len(dev_tokens_labels[tkn_indice+i+1]):
                dev_tokens_labels[tkn_indice+i+1][x] = 'I-'+label


          if clean_pos.start > dev_tokenized["offset_mapping"][tkn_indice+i][2][0] and clean_pos.stop < max(dev_tokenized["offset_mapping"][tkn_indice+i])[1]:
            #update the clean_pos too
            tkn_indice += i
            break

      mapper = dev_tokenized["offset_mapping"][tkn_indice]
      token_pos =  get_token_pos(mapper, clean_pos.start, clean_pos.stop)

      #check similarity

      if ''.join(dev_tokens[tkn_indice][token_pos]).replace('##', '') != ''.join(dev_text_data[indice][clean_pos].split()).lower():
        print('mismatch at:', indice,tkn_indice, special_entity, dev_tokens[tkn_indice][token_pos], '\t', token_pos, '\t', slice(deb_str,fin_str),  clean_pos )

      dev_tokens_labels[tkn_indice][token_pos.start] = 'B-'+label

      for x in range(token_pos.start+1, token_pos.stop):
        if x <len(dev_tokens_labels[tkn_indice]):
          dev_tokens_labels[tkn_indice][x] = 'I-'+label

        else:
          if x!=512 :
            print('not simillar,  \t', 'indice: \t', indice, '\t x: \t', x )


mismatch at: 17 19 High Court For The State Of Telangana
 ['high', 'court', 'for', 'the', 'state', 'of', 'telangana', 'at'] 	 slice(3, 11, None) 	 slice(7, 45, None) slice(7, 45, None)


not a real mismatch tho

## Datasets creation
labels_list, encode: int_labels, and dataset generation

### labels lists

#### encode labels (str -> int)

In [18]:
ne_distribution = {}
## objective: {'ne' : num_occ}
for labels in tokens_labels:
  for idx in range(len(labels)):
    if labels[idx] in ne_distribution.keys():
      ne_distribution[labels[idx]] +=1
    else:
      ne_distribution[labels[idx]] = 1

for labels in dev_tokens_labels:
  for idx in range(len(labels)):
    if labels[idx] in ne_distribution.keys():
      ne_distribution[labels[idx]] +=1
    else:
      ne_distribution[labels[idx]] = 1


print(len(ne_distribution ))
import pandas as pd

pd.DataFrame.from_dict(dict(sorted(ne_distribution.items(), key = lambda x: x[1], reverse=True)),  orient='index').head(3)

29


Unnamed: 0,0
O,3685591
I-PRECEDENT,22792
I-CASE_NUMBER,8966


In [19]:
labels_list = list(ne_distribution.keys())
len(labels_list)

29

In [20]:
int_labels = {}

for i, lab in enumerate(labels_list):
  int_labels[lab] = i
print(int_labels)

{'O': 0, 'B-ORG': 1, 'I-ORG': 2, 'B-OTHER_PERSON': 3, 'I-OTHER_PERSON': 4, 'B-WITNESS': 5, 'I-WITNESS': 6, 'B-GPE': 7, 'B-STATUTE': 8, 'I-STATUTE': 9, 'B-DATE': 10, 'I-DATE': 11, 'B-PROVISION': 12, 'I-PROVISION': 13, 'B-COURT': 14, 'I-COURT': 15, 'B-PRECEDENT': 16, 'I-PRECEDENT': 17, 'I-GPE': 18, 'B-CASE_NUMBER': 19, 'I-CASE_NUMBER': 20, 'B-PETITIONER': 21, 'I-PETITIONER': 22, 'B-JUDGE': 23, 'I-JUDGE': 24, 'B-RESPONDENT': 25, 'I-RESPONDENT': 26, 'B-LAWYER': 27, 'I-LAWYER': 28}


#### int train labels

In [23]:
import copy
int_tokens_labels = copy.deepcopy(tokens_labels)

for i in range(len(tokens_labels)):
  for j in range(len(tokens_labels[i])):
    int_tokens_labels[i][j] = int_labels[tokens_labels[i][j]]

#### int dev labels list

In [24]:
import copy
int_dev_tokens_labels = copy.deepcopy(dev_tokens_labels)

for i in range(len(dev_tokens_labels)):
  for j in range(len(dev_tokens_labels[i])):
    int_dev_tokens_labels[i][j] = int_labels[dev_tokens_labels[i][j]]

### generate datasets

#### gen train dt with_annot_indices

In [25]:
def no_ne_gen():
  i = 0
  for indice in range(len(with_annot_indices)):
    yield {'labels' : int_tokens_labels[i], 'input_ids': tokenized['input_ids'][i] , 'attention_mask': tokenized['attention_mask'][i]}
    i+=1

from datasets import Dataset
with_ne_train_dt = Dataset.from_generator(no_ne_gen)
with_ne_train_dt

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 7258
})

#### gen dev dt

In [26]:
def dev_gen():
  i = 0
  for line in dev_data: #   'id':i,  'tokens': dev_tokens[i],
    yield {'labels' : int_dev_tokens_labels[i], 'input_ids': dev_tokenized['input_ids'][i] , 'attention_mask': dev_tokenized['attention_mask'][i]}
    i+=1

# from datasets import Dataset
dev_dt = Dataset.from_generator(dev_gen)
dev_dt

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 125
})

#### display function

In [27]:
from datasets import ClassLabel, Sequence
import pandas as pd
from IPython.display import display, HTML

def display_dt(dataset, num_examples=10):
    assert num_examples <= len(dataset);

    df = pd.DataFrame(dataset[:num_examples])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))


In [28]:
with_ne_train_dt_no_100 = with_ne_train_dt
dev_dt_no_100 = dev_dt

In [29]:
display_dt(with_ne_train_dt_no_100, 2)

Unnamed: 0,labels,input_ids,attention_mask
0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[101, 1006, 1021, 1007, 2006, 3563, 23032, 2011, 1996, 6847, 2055, 2019, 4443, 1997, 12667, 1012, 1015, 1010, 2861, 1010, 4261, 1010, 3156, 2006, 12816, 2217, 1997, 4291, 25460, 2924, 4070, 1997, 2029, 1037, 6302, 6100, 2003, 6037, 2012, 1052, 1012, 2871, 1997, 14358, 4402, 1005, 1055, 3259, 2338, 1010, 4342, 19256, 4387, 7864, 2008, 2009, 2001, 3141, 2000, 5414, 2013, 20138, 1010, 10958, 21886, 1004, 2522, 1012, 2006, 1996, 3978, 1997, 2010, 12339, 1037, 4072, 2928, 2003, 2404, 2011, 2149, 2006, 2008, 6302, 6100, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"
1,"[0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[101, 2002, 2001, 2036, 2356, 3251, 12943, 3148, 1026, 8487, 2465, 1027, 1000, 5023, 1035, 3793, 1000, 8909, 1027, 1000, 8487, 1035, 1019, 1000, 1028, 13675, 2050, 2053, 1012, 28188, 1011, 16962, 1997, 2687, 1020, 1026, 1013, 8487, 1028, 10556, 3126, 1010, 2388, 1011, 1999, 1011, 2375, 1997, 1996, 10181, 2973, 10329, 2013, 16985, 4135, 14856, 5960, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...]"


In [30]:
display_dt(dev_dt_no_100, 2)

Unnamed: 0,labels,input_ids,attention_mask
0,"[0, 0, 0, 0, 0, 0, 0, 14, 15, 15, 15, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 22, 22, 22, 22, 0, 0, 0, 0, 0, 0, 27, 28, 28, 28, 28, 0, 0, 0, 27, 28, 28, 28, 28, 28, 0, 0, 0, 27, 28, 28, 0, 0, 0, 27, 28, 28, 28, 28, 0, 0, 0, 0, 25, 26, 26, 26, 26, ...]","[101, 1002, 1066, 2871, 1008, 1999, 1996, 2152, 2457, 1997, 6768, 2012, 2047, 6768, 1003, 2787, 2006, 1024, 2861, 1012, 5718, 1012, 10476, 1009, 6097, 1012, 10439, 1012, 5989, 2575, 1013, 2760, 1004, 4642, 16839, 1012, 4805, 12521, 2475, 1013, 2760, 1010, 15017, 23777, 1013, 10476, 1010, 28358, 2683, 2629, 1013, 10476, 11481, 5427, 2522, 5183, 1012, 10439, 24178, 2083, 1024, 2720, 1012, 1055, 1012, 1052, 1012, 17136, 1010, 2720, 1012, 2032, 6962, 6979, 11721, 14905, 11961, 1010, 2720, 1012, 6583, 2099, 5960, 1998, 2720, 1012, 5245, 6673, 5960, 22827, 13476, 1010, 13010, 1012, 6431, 23564, 7646, 6979, 8418, 2063, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"
1,"[0, 0, 0, 0, 0, 0, 14, 15, 15, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 22, 22, 22, 22, 22, 22, 22, 22, 22, 0, 0, 0, 0, 0, 25, 26, 26, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 22, 22, 0, 0, 0, 0, 0, 25, 26, 26, 26, 26, 26, ...]","[101, 1015, 3189, 3085, 1999, 1996, 4259, 2457, 1997, 2634, 2942, 23240, 7360, 2942, 5574, 2053, 1012, 3963, 2575, 1997, 25682, 1006, 17707, 2041, 1997, 22889, 2361, 1006, 1039, 1007, 2053, 1012, 23628, 12376, 1997, 2325, 1007, 19177, 2099, 3520, 11390, 2378, 2232, 8529, 2098, 11493, 2232, 1529, 10439, 24178, 5443, 1012, 2110, 1997, 14288, 1004, 2030, 2015, 1012, 1529, 25094, 2007, 2942, 5574, 2053, 1012, 3963, 2581, 1997, 25682, 1006, 17707, 2041, 1997, 22889, 2361, 1006, 1039, 1007, 2053, 1012, 24622, 19481, 1997, 2325, 1007, 2110, 2602, 3222, 1529, 10439, 24178, 5443, 1012, 6819, 7389, 7265, 11493, 2232, 5003, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]"


In [31]:
model_checkpoint

'distilbert-base-uncased'

In [65]:
#the datasets doesn't have -100 in labels as they don't have raw labels, all of them are either preceded by B- or I-
from datasets import save_to_disk
with_ne_train_dt_no_100.save_to_disk('/content/drive/MyDrive/ner_india/dis_beart_tok_with_ne_train_dtst')
dev_dt_no_100.save_to_disk('/content/drive/MyDrive/ner_india/dis_bert_tok_dev_dtst')

ImportError: cannot import name 'save_to_disk' from 'datasets' (/content/drive/MyDrive/virtual_env/lib/python3.10/site-packages/datasets/__init__.py)

#### trainer config

In [56]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(labels_list))

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [58]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    eval_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [59]:
# Load Sqeval.
import evaluate
import numpy as np

metric = evaluate.load("seqeval")

# Create the list with the tags.

# Function to compute precision, recall, F1 and accuracy.
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [labels_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [labels_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division = 0 )
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [61]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [62]:
#no raw labels, no -100, and no warnings
trainer = Trainer(
    model,
    args,
    train_dataset= with_ne_train_dt_no_100,
    eval_dataset= dev_dt_no_100,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [63]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.859778,0.00915,0.020583,0.012668,0.77675
2,0.112800,0.846123,0.020319,0.048027,0.028557,0.786703
3,0.021000,0.885287,0.03233,0.074614,0.045113,0.789734


TrainOutput(global_step=1362, training_loss=0.05345885729124542, metrics={'train_runtime': 1128.0681, 'train_samples_per_second': 19.302, 'train_steps_per_second': 1.207, 'total_flos': 2846229431519232.0, 'train_loss': 0.05345885729124542, 'epoch': 3.0})

In [64]:
trainer.evaluate()

{'eval_loss': 0.8852869868278503,
 'eval_precision': 0.032329988851727984,
 'eval_recall': 0.07461406518010291,
 'eval_f1': 0.04511278195488722,
 'eval_accuracy': 0.789734375,
 'eval_runtime': 4.6073,
 'eval_samples_per_second': 27.131,
 'eval_steps_per_second': 1.736,
 'epoch': 3.0}