<a href="https://colab.research.google.com/github/agcosmin/diacritics_adder/blob/main/diacritics_adder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Adding diacritics to Romanian text.

In romanian the diacritics characters are replaced with their "base" character e.g. 'masă' (table in english) is written as 'masa'. The equivalence between diacritics characters and thier "plain" version is one to one.

The one to one equivalence simplifies the problem of adding diacritics to Romanian text to predicting if a "plain" character should be replaced with the diacritic version.

We are not going to predict at the character level but at the token level. 

First we are going to learn a tokenizer vocabulary from romanian text that contains versions of words with diacrtics and without diacritics. From each learnd token we are going to create a set of equivalent tokens that have diacritics. These are going to constitute our replacement tokens and prediction targets.

We are going to fit a token classifer to predict the replacement tokens.

Retreive and prepare dataset. We are going to use the Romanian corpus of the dataset published by:
Náplava, Jakub; Straka, Milan; Hajič, Jan and Straňák, Pavel, 2018, 
  Corpus for training and evaluating diacritics restoration systems, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, 
http://hdl.handle.net/11234/1-2607.

In [1]:
!mkdir data && cd data
!cd data && curl --remote-name-all https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-2607{/ro.zip}
!cd data && echo "a1d886a46f25c3b59404c6d15fba862d  ro.zip" | md5sum -c

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  718M  100  718M    0     0  23.2M      0  0:00:30  0:00:30 --:--:-- 24.9M
ro.zip: OK


In [6]:
!cd data && unzip ro.zip
!cd data/ro && xz --decompress target_dev.txt.xz  target_test.txt.xz  target_train.txt.xz

Archive:  ro.zip
 extracting: ro/target_test.txt.xz   
 extracting: ro/target_train.txt.xz  
 extracting: ro/target_dev.txt.xz    
 extracting: ro/statmt_2017_17_train_target_sentences.txt.xz  


Install https://huggingface.co/ utilites for transformers.

In [2]:
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 5.2 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 58.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 25.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 11.1 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 56.0 MB/s 
Collecting responses<0.19
 

In [3]:
import re
import itertools
import pickle
import random

import datasets
import torch
import transformers

from tqdm.auto import tqdm

In [7]:
dataset = datasets.load_dataset('text', data_files={'train': ['data/ro/target_train.txt'], 'validate': 'data/ro/target_dev.txt', 'test': 'data/ro/target_test.txt'})

Using custom data configuration default-e8ad1b8f10f73f41


Downloading and preparing dataset text/default to /root/.cache/huggingface/datasets/text/default-e8ad1b8f10f73f41/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/default-e8ad1b8f10f73f41/0.0.0/acc32f2f2ef863c93c2f30c52f7df6cc9053a1c2230b8d7da0d210404683ca08. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [8]:
def replace_diacritics(line):
  line = re.sub(r'[ăâ]', 'a', line)
  line = re.sub(r'[ĂÂ]', 'A', line)
  line = re.sub(r'[î]', 'i', line)
  line = re.sub(r'[Î]', 'I', line)
  line = re.sub(r'[ș]', 's', line)
  line = re.sub(r'[Ș]', 'S', line)
  line = re.sub(r'[ț]', 't', line)
  line = re.sub(r'[Ț]', 'T', line)
  return line

def gen_mixed_words(lhs_line, rhs_line):
  lhs_mix = []
  rhs_mix = []
  for lhs_word, rhs_word in zip(lhs_line.split(' '), rhs_line.split(' ')):
    mid_point = len(lhs_word) // 2
    lhs_mix.append(lhs_word[0:mid_point] + rhs_word[mid_point:])
    rhs_mix.append(rhs_word[0:mid_point] + lhs_word[mid_point:])
  return " ".join(lhs_mix), " ".join(rhs_mix)


def preprocess_line(line):
  line = re.sub(r"[\W]", ' ', line)
  line = re.sub(r"\ +", ' ', line)
  return line
  
def train_corpus_gen(dataset, num_preloaded = 1000):
  for l in range(0, len(dataset['train']) + num_preloaded, num_preloaded):
    lines_w_diacritics = [preprocess_line(line) for line in
                          dataset['train'][l : l + num_preloaded]['text']]
    lines_wo_diacritics = [replace_diacritics(line) for line in lines_w_diacritics]
    for line_wo, line_w in zip(lines_wo_diacritics, lines_w_diacritics):
      yield line_wo, line_w
      mixed_words = gen_mixed_words(line_w, line_wo)
      yield mixed_words[0], line_w
      yield mixed_words[1], line_w

def train_tokenizer_corpus_gen(dataset, num_preloaded = 1000):
  for line, target in train_corpus_gen(dataset, num_preloaded):
    yield f"{line} {target}"

In [9]:
tokenizer_corpus = train_tokenizer_corpus_gen(dataset)

base_tokenizer = transformers.AutoTokenizer.from_pretrained("distilbert-base-cased")
tokenizer = base_tokenizer.train_new_from_iterator(tokenizer_corpus,
                                                   len(base_tokenizer.vocab))
!mkdir -p models/tokenizers/encoder_tokenizer
tokenizer.save_pretrained("./models/tokenizers/encoder_tokenizer")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

('./models/tokenizers/encoder_tokenizer/tokenizer_config.json',
 './models/tokenizers/encoder_tokenizer/special_tokens_map.json',
 './models/tokenizers/encoder_tokenizer/vocab.txt',
 './models/tokenizers/encoder_tokenizer/added_tokens.json',
 './models/tokenizers/encoder_tokenizer/tokenizer.json')

In [10]:
tokenizer = transformers.AutoTokenizer.from_pretrained("./models/tokenizers/encoder_tokenizer")

We augment the dataset by mixing half of the word with diacritics with half without. 

For example from the ground truth sentence: "Sistemul de învățământ este la pământ." (in English: "The eduication system is down.") we generate 3 input sentences:
  * No diacritics: "Sistemul de invațamant este la pamant."
  * First word half with diacritics: "Sistemul de învățamant este la pămant."
  * Second word half with diacritics: "Sistemul de anvaaământ este la pamânt."

In [11]:
class Preprocessor():
  def __init__(self, augment=True):
    self._augment = augment
    
  def __call__(self, dataset):
    dataset['with'] = preprocess_line(dataset['text'])
    target = replace_diacritics(dataset['with'])
    mixed_words = []
    if self._augment:
      mixed_words = gen_mixed_words(dataset['with'], target)
    dataset['without'] = [target, *mixed_words]
    return dataset

preprocessor = Preprocessor(augment=True)

print(f"dataset = {dataset}")
dataset = dataset.filter(lambda sample: re.match(r"[a-zA-zăâĂÂîÎșȘțȚ]", sample['text']))
print(f"dataset = {dataset}")
dataset = dataset.filter(lambda sample: len(sample['text']) >= 20)
print(f"dataset = {dataset}")
preprocessed_dataset = dataset.map(preprocessor, remove_columns=['text'])
print(f"preprocessed_dataset = {preprocessed_dataset}")



dataset = DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 837647
    })
    validate: Dataset({
        features: ['text'],
        num_rows: 14897
    })
    test: Dataset({
        features: ['text'],
        num_rows: 30000
    })
})


  0%|          | 0/838 [00:00<?, ?ba/s]

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/30 [00:00<?, ?ba/s]

dataset = DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 793299
    })
    validate: Dataset({
        features: ['text'],
        num_rows: 14468
    })
    test: Dataset({
        features: ['text'],
        num_rows: 29057
    })
})


  0%|          | 0/794 [00:00<?, ?ba/s]

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/30 [00:00<?, ?ba/s]

dataset = DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 771596
    })
    validate: Dataset({
        features: ['text'],
        num_rows: 14248
    })
    test: Dataset({
        features: ['text'],
        num_rows: 28471
    })
})


  0%|          | 0/771596 [00:00<?, ?ex/s]

  0%|          | 0/14248 [00:00<?, ?ex/s]

  0%|          | 0/28471 [00:00<?, ?ex/s]

preprocessed_dataset = DatasetDict({
    train: Dataset({
        features: ['with', 'without'],
        num_rows: 771596
    })
    validate: Dataset({
        features: ['with', 'without'],
        num_rows: 14248
    })
    test: Dataset({
        features: ['with', 'without'],
        num_rows: 28471
    })
})


In [12]:
def generate_equivalent_tokens(tokenizer, corpus):
  equivalent_tokens = {}
  tokens_size_map = {token : len(re.sub(r'##+', '', token))
    for token in tokenizer.vocab.keys()}
  for sample in tqdm(corpus['train']):
    target = re.sub(r'\ +', '', sample['with'])
    for input_text in sample['without']:
      tokenized_input = tokenizer.tokenize(input_text)
      tokens_size = [tokens_size_map[token] for token in tokenized_input]
      tokens_start = [0] + list(itertools.accumulate(tokens_size))
      assert len(target) == tokens_start[-1], "Lenght mismatch"
      for token, start, size in zip(tokenized_input, tokens_start, tokens_size):
        if token[-size:] != target[start:start + size]:
          equivalent_token = token[:-size] + target[start:start + size]
          equi_tokens = equivalent_tokens.get(token, set([equivalent_token]))
          equi_tokens.add(equivalent_token)
          equivalent_tokens[token] = equi_tokens
  return {token : list(equival) for token, equival in equivalent_tokens.items()}

equivalent_tokens = generate_equivalent_tokens(tokenizer, preprocessed_dataset)
with open('equivalent_tokens.pkl', 'wb') as f:
  pickle.dump(equivalent_tokens, f)

  0%|          | 0/771596 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1094 > 512). Running this sequence through the model will result in indexing errors


In [13]:
with open('equivalent_tokens.pkl', 'rb') as f:
  equivalent_tokens = pickle.load(f)

In [14]:
num_equivalent_token_classes = max([len(v) for _, v in equivalent_tokens.items()]) + 1
print(f"Num equivalence classes = {num_equivalent_token_classes}")

Num equivalence classes = 21


Tokens have different number of equivalent tokens with diacritics. For example token "##ras" can have the equivalent tokens: {##răs, ##râs, ##raș, ##râș, #răș}, whereas token "##dop" does not have any equivalent tokens. 

For each token we predict the probability of replacing the token with the equivalent tokens. The number of token classes/lables of the model is: 

`num_equivalent_token_classe = max(num_equi_tokens(token) for token in tokenizer tokens) + 1`

The 0 class label is used to indicate that the token should not be changed.

For tokens that have fewer equivalent tokens than `num_equivalent_token_classes` when the highest score label is greater than the number of equivalent tokens we can choose as replacment strategy:
  * Default to do not change label - 0
  * Choose the equivalent token with highest score

Diacritics are not prevalent and hence most tokens don't have equivalent tokens which unbalances the dataset label distribution towards the "do not change" label, zero. In order to balance the label distribution, for tokens that do not have equivalent tokens we label it a random label from the possible label. This is possible based on the replacment stragey.

In [15]:
def equivalent_token_lables(tokenizer, equivalent_tokens):
  return {tokenizer.convert_tokens_to_ids([token])[0] : {re.sub(r'##+', '', equi_token) : i + 1
                                for i, equi_token in enumerate(equi_tokens)}
      for token, equi_tokens in tqdm(equivalent_tokens.items(), "Equi classes")}

class EquivalentTokenMapper():
  def __init__(self, tokenizer, equivalent_tokens, random_label=True):
    self._tokenizer = tokenizer
    self._tokens_sizes = {value : len(re.sub(r'##+', '', token))
      for token, value in tqdm(tokenizer.vocab.items(), "Token sizes")}
    self._equivalent_classes = equivalent_token_lables(tokenizer,
                                                          equivalent_tokens)
    self._num_equi_classes = max([len(v) for _, v in equivalent_tokens.items()]) + 1
    self._random_label = random_label

  def __call__(self, dataset):
    target = dataset['with']
    target = re.sub(r'\ +', '', target)
    tokenized = self._tokenizer(dataset['without'],
                                           add_special_tokens=False,
                                           padding='do_not_pad',
                                           truncation=True)
    if len(tokenized['input_ids'][0]) == 0:
      print(dataset['without'])
      assert False
    all_labels = []
    for tokenized_input in tokenized['input_ids']:
      tokens_size = [self._tokens_sizes[token] for token in tokenized_input]
      tokens_start = [0] + list(itertools.accumulate(tokens_size))
      labels = []
      for t, (token, start, size) in enumerate(zip(tokenized_input, tokens_start, tokens_size)):
        equi_classes = self._equivalent_classes.get(token, {})
        label = equi_classes.get(target[start:start + size], None)
        if label is None:
          label = 0
          if self._random_label and len(equi_classes) != self._num_equi_classes - 1:
              label = random.randint(len(equi_classes) + 1, self._num_equi_classes - 1)
        labels.append(label)
      all_labels.append(labels)
    tokenized['labels'] = all_labels
    dataset['labels'] = all_labels
    dataset['input_ids'] = tokenized['input_ids']
    dataset['attention_mask'] = tokenized['attention_mask']
    return dataset

helper = EquivalentTokenMapper(tokenizer, equivalent_tokens)

Token sizes:   0%|          | 0/28996 [00:00<?, ?it/s]

Equi classes:   0%|          | 0/7945 [00:00<?, ?it/s]

In [16]:
tokenizer.save_vocabulary("./", "decoder")

('./decoder-vocab.txt',)

In [17]:
tokens_to_add = [token for tokens in tqdm(equivalent_tokens.values(), "Tokens to add to vocab")
 for token in tokens if tokenizer.convert_ids_to_tokens(tokenizer.convert_tokens_to_ids([token]))[0] == '[UNK]']
print(f"Num equivalent tokens added = {len(tokens_to_add)}")
with open("decoder-vocab.txt", "a") as decoder_vocab:
  decoder_vocab.write("\n".join(tokens_to_add))

Tokens to add to vocab:   0%|          | 0/7945 [00:00<?, ?it/s]

Num equivalent tokens added = 9437


In [18]:
decoder_tokenizer = transformers.DistilBertTokenizerFast("decoder-vocab.txt",
                                                         do_lower_case=False)

In [19]:
#tokenized_dataset = preprocessed_dataset.map(helper)
smaller_dataset = preprocessed_dataset['train'].shuffle(seed=42).select(range((3600 // 14 )* 100))
print(smaller_dataset)
tokenized_smaller_dataset = smaller_dataset.map(helper, remove_columns=['with', 'without'])
print(tokenized_smaller_dataset)

class TrainCollater():
  def __init__(self, pad_id, size):
    self._pad_id = pad_id
    self._size = size

  def __call__(self, samples):
    num_tensors = len(samples[0]['input_ids'])
    num_samples = len(samples)
    shape = (num_samples * num_tensors, self._size)
    input_ids = torch.ones(shape, dtype=torch.int64) * self._pad_id
    attention_mask = torch.zeros(shape, dtype=torch.int64)
    labels = torch.zeros(shape, dtype=torch.int64)
    for s, sample in enumerate(samples):
      for i, ids in enumerate(sample['input_ids']):
        input_ids[s * num_tensors + i, 0:ids.shape[0]] = ids
      for m, mask in enumerate(sample['attention_mask']):
        attention_mask[s * num_tensors + m, 0:mask.shape[0]] = mask
      for l, label in enumerate(sample['labels']):
        labels[s * num_tensors + l, 0:label.shape[0]] = label
    
    return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}

tokenized_smaller_dataset.set_format("torch")
dataloader = torch.utils.data.DataLoader(tokenized_smaller_dataset, collate_fn=TrainCollater(0, 512), batch_size=4)

Dataset({
    features: ['with', 'without'],
    num_rows: 25700
})


  0%|          | 0/25700 [00:00<?, ?ex/s]

Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 25700
})


In [20]:
def get_equivalent_ids(tokenizer, equivalent_tokens):
  return {tokenizer.convert_tokens_to_ids([token])[0] :
          {label + 1 : id for label, id in enumerate(tokenizer.convert_tokens_to_ids(equi_tokens))}
   for token, equi_tokens in tqdm(equivalent_tokens.items())
  }
equivalent_ids = get_equivalent_ids(decoder_tokenizer, equivalent_tokens)

  0%|          | 0/7945 [00:00<?, ?it/s]

In [21]:
def replace_tokens(input_ids, labels, equivalent_ids):
  for row in range(input_ids.shape[0]):
    for col in range(input_ids.shape[1]):
      if labels[row, col] != 0:
        input_ids[row, col] = equivalent_ids.get(input_ids[row, col].item(), {}).get(labels[row, col].item(), input_ids[row, col].item())
  return input_ids

In [23]:
model = transformers.AutoModelForTokenClassification.from_pretrained(
    "distilbert-base-cased", num_labels=num_equivalent_token_classes)

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForTokenClassification: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this 

In [None]:
model = transformers.AutoModelForTokenClassification.from_pretrained("/content/models/distil_bert_trained")

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
if torch.cuda.is_available():
  torch.cuda.empty_cache()
print(f"device = {device}")
model = model.to(device)
model = model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-7)
loss_weights = torch.ones(num_equivalent_token_classes)
# The dataset label distribution is unbalanced towards the 0 label.
# I give it a small weight to label 0 to keep the model from learning to predict
# label 0 for all inputs.
loss_weights[0] = 1e-4
optim_criterion = torch.nn.CrossEntropyLoss(weight=loss_weights).to(device)
num_training_steps = len(dataloader)
progress_bar = tqdm(range(num_training_steps), "Train")
lr_scheduler = transformers.get_scheduler(name="linear",
                                          optimizer=optimizer,
                                          num_warmup_steps=0,
                                          num_training_steps=num_training_steps)
running_loss_window = 500
running_loss = 0
for b, batch in enumerate(dataloader):
  batch = {k: v.to(device) for k, v in batch.items()}
  outputs = model(input_ids=batch['input_ids'],
                  attention_mask=batch['attention_mask'])
  loss = optim_criterion(outputs.logits.permute((0, 2, 1)), batch['labels'])
  running_loss += loss
  loss.backward()

  optimizer.step()
  lr_scheduler.step()
  optimizer.zero_grad()
  
  if b % running_loss_window == 0 and b > 0:
    print(f"{b:05d}: {running_loss / running_loss_window}")
    running_loss = 0
    model.save_pretrained("./checkpoints")
  progress_bar.update(1)

model.save_pretrained("./checkpoints")

In [None]:
test_dataset = preprocessed_dataset['test']
print(f"Test datset: {test_dataset}")
tokenized_test_dataset = test_dataset.map(helper)
print(tokenized_test_dataset)


Test datset: Dataset({
    features: ['with', 'without'],
    num_rows: 28471
})


  0%|          | 0/28471 [00:00<?, ?ex/s]

Dataset({
    features: ['with', 'without', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 28471
})


In [None]:
def gen_header_and_labels():
  header = [chr(char) for char in range(ord('a'), ord('z') + 1)] 
  header += [chr(char) for char in range(ord('A'), ord('Z') + 1)] 
  header += list("ăâĂÂîÎșȘț")
  labels = {char : i for i, char in enumerate(header)}
  header += ["*"]
  return header, labels

def compute_confusion_matrix(dataset, model):
  _, labels = gen_header_and_labels()
  num_labels = len(labels)
  conf_mat = torch.zeros((num_labels, num_labels), dtype=torch.int64)
  model = model.eval()
  for sample in tqdm(dataset, "Compute f1:"):
    for ids, mask in zip(sample['input_ids'], sample['attention_mask']):
      input_ids = torch.tensor([ids], dtype=torch.int64).to(device)
      attention_mask = torch.tensor([mask], dtype=torch.int64).to(device)
      predicted = model(input_ids=input_ids, attention_mask=attention_mask)
      input_ids = input_ids.cpu()
      attention_mask = attention_mask.cpu()
      predicted.logits = predicted.logits.cpu()
      predicted_labels = torch.argmax(predicted.logits, dim=-1)
      input_ids = replace_tokens(input_ids, predicted_labels, equivalent_ids)
      decoded_text = decoder_tokenizer.batch_decode(input_ids)[0]
      target = re.sub(r'\ +', '', sample['with'])
      predicted = re.sub(r'\ +', '', decoded_text)
      for gt, pred in zip(target, predicted):
        conf_mat[labels.get(gt, num_labels - 1), labels.get(pred, num_labels - 1)] += 1
  return conf_mat

def compute_f1_score(conf_mat): 
  tp = torch.sum(torch.eye(conf_mat.size()[0]) * conf_mat, dim=1)
  fn = torch.sum((torch.ones(conf_mat.size()[0]) - torch.eye(conf_mat.size()[0])) * conf_mat, dim=1)
  fp = torch.sum((torch.ones(conf_mat.size()[0]) - torch.eye(conf_mat.size()[0])) * conf_mat, dim=0)
  f1_score = (2 * tp) / (2 * tp + fp + fn)
  return f1_score


def print_f1_score(f1_score):
  header, _ = gen_header_and_labels()
  print(f"F1 score: {torch.mean(f1_score[torch.logical_not(torch.isnan(f1_score))]) : .3f}")
  for i, f1 in enumerate(f1_score):
    print(f"{header[i]} : {f1 : .2f}")

def print_confusion_mat(conf_mat, threshold=0.0):
  header, _ = gen_header_and_labels()
  print(f"Confusion mat:")
  conf_mat = conf_mat / torch.sum(conf_mat, keepdim=True, dim=1)
  for row_i, row in enumerate(conf_mat):
    print(f"{header[row_i]}: " + " ".join([f"({header[i]}, {val : >5.2f})"
     for i, val in enumerate(row) if val > threshold]))

confusion_matrix = compute_confusion_matrix(tokenized_test_dataset, model)
f1_score = compute_f1_score(confusion_matrix)
print_f1_score(f1_score)
print_confusion_mat(confusion_matrix)

Compute f1::   0%|          | 0/28471 [00:00<?, ?it/s]

AttributeError: ignored

# New Section