We already have millions of labeled words (valid vs invalid). Why don't we try to train a simple a binary classifier that we can further use to help us identify valid words among new datasets.

In [1]:
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
from transformers import DataCollatorWithPadding
from transformers import EarlyStoppingCallback
from transformers import get_scheduler
from torch.optim import AdamW
from evaluate import load




## Data

Let's load `1M` valid words and `1M` invalid words and train our model with this data.

In [2]:
# Load the list of invalid words. Shuffle them.
invalid_words = pd.read_csv('data/6_project_stork_invalid_words.txt.gz', encoding='windows-1251', header=None, compression='gzip').sample(frac=1)[0].tolist()
print(f"Number of invalid words: {len(invalid_words):,}")

# Load the names list (it also contains invalid words). Shuffle them.
names_and_invalid_words = pd.read_csv('data/6_project_stork_names.txt.gz', encoding='windows-1251', header=None, compression='gzip').sample(frac=1)[0].tolist()
names_and_invalid_words = invalid_words + names_and_invalid_words
print(f"Number of names & invalid words: {len(names_and_invalid_words):,}")

Number of invalid words: 16,425
Number of names & invalid words: 3,901,279


In [3]:
# Load valid words vocabulary. Shuffle the words.
vocab_words = pd.read_csv('data/words.txt.gz', encoding='windows-1251', header=None, compression='gzip', names=['word']).sample(frac=1)['word'].tolist()
print(f"Number of valid words: {len(vocab_words):,}")

Number of valid words: 1,230,247


## Convert to `Dataset`

Prepare the train set (`2.4M` examples in total).

In [4]:
# Positive classification dataset (1.2M words)
valid_train_words = vocab_words[:1200000]
# Negative classification dataset (1.2M words)
invalid_train_words = names_and_invalid_words[:1200000]
# Add labels to the datasets
valid_train_words_labeled = [{"text":w, "label":1} for w in valid_train_words]
invalid_train_words_labeled = [{"text":w, "label":0} for w in invalid_train_words]
# Merge two lists
train_ds = Dataset.from_list(valid_train_words_labeled + invalid_train_words_labeled)
# Print some results
print(train_ds.shuffle().take(3).to_list())

[{'text': 'вариетно', 'label': 1}, {'text': 'запеням', 'label': 1}, {'text': 'шевалария', 'label': 0}]


Prepare the test set (`60K` examples in total).

In [5]:
# Positive classification dataset (30K words)
valid_test_words = vocab_words[1200000:1230000]
# Negative classification dataset (30K words)
invalid_test_words = names_and_invalid_words[1200000:1230000]
# Add labels to the datasets
valid_test_words_labeled = [{"text":w, "label":1} for w in valid_test_words]
invalid_test_words_labeled = [{"text":w, "label":0} for w in invalid_test_words]
# Merge two lists
test_ds = Dataset.from_list(valid_test_words_labeled + invalid_test_words_labeled)
# Print some results
print(test_ds.shuffle().take(3).to_list())

[{'text': 'разшумях', 'label': 1}, {'text': 'попритъпявани', 'label': 1}, {'text': 'услаждал', 'label': 1}]


## Tokenizer

We'll use `usmiva/bert-web-bg-cased` since this language model has been specifically pre-trained with Bulgarian text. Furthermore it is small and performant on consumer-grade GPU.

In [6]:
# Initialize the BERT-based tokenizer
tokenizer = AutoTokenizer.from_pretrained("usmiva/bert-web-bg-cased")

# Tokenize the datasets (max_length is the max length of the tokenized input ... the smaller, the faster is the training)
tokenize_function = lambda x: tokenizer(x['text'], padding="max_length", truncation=True, max_length=37)
tokenized_datasets = {}
tokenized_datasets["train"] = train_ds.map(tokenize_function, batched=True)
tokenized_datasets["test"] = test_ds.map(tokenize_function, batched=True)

print(f"Number of train examples: {len(tokenized_datasets['train']):,}")
print(f"Number of test examples: {len(tokenized_datasets['test']):,}")
print(tokenized_datasets["train"][0])

Map:   0%|          | 0/2400000 [00:00<?, ? examples/s]

Map:   0%|          | 0/60000 [00:00<?, ? examples/s]

Number of train examples: 2,400,000
Number of test examples: 60,000
{'text': 'стърсещи', 'label': 1, 'input_ids': [2, 27319, 1990, 2027, 3, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000, 30000], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


## Model

We'll use the same model and finetune it specifically for binary classification task.

In [7]:
# Initialize a BERT model for binary classification
model_name = "usmiva/bert-web-bg-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Binary classification

print(model.config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at usmiva/bert-web-bg-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "usmiva/bert-web-bg-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.49.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



## Prepare the training

Set the hyperparameters (e.g. learning rate, batch_size, optimizer, scheduler, etc.)

In [8]:
# Define training arguments
learning_rate = 5e-7
training_args = TrainingArguments(
    optim="adamw_torch",
    output_dir="./results",             # Directory for saving model checkpoints
    evaluation_strategy="steps",        # Evaluate after each 500 steps
    learning_rate=learning_rate,        # Start with a small learning rate
    per_device_train_batch_size=256,    # Batch size per GPU (this is the max for my GPU)
    per_device_eval_batch_size=256,
    gradient_accumulation_steps=4,      # Increase batch size without increasing memory usage
    num_train_epochs=3,                 # Number of epochs
    weight_decay=0.01,                  # Regularization
    save_total_limit=2,                 # Limit checkpoints to save space
    load_best_model_at_end=True,        # Automatically load the best checkpoint
    logging_dir="./logs",               # Directory for logs
    logging_steps=50,                   # Log every 50 steps
    fp16=True                           # Enable mixed precision for faster training
)

print(training_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=50,
eval_strategy=IntervalStrategy.STEPS,
eval_us



In [9]:
# Load a metric (F1-score in this case)
metric = load("f1")

# Define a custom compute_metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [10]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [11]:
trainer = Trainer(
    model=model,                        # Pre-trained BERT model
    args=training_args,                 # Training arguments
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,        # Efficient batching
    compute_metrics=compute_metrics     # Custom metric
)

# Set standard optimizer
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Scheduler setup. Set warmup steps to prevent the model directly losing its pretrained knowledge.
scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=1000,
    num_training_steps=len(trainer.train_dataset) * training_args.num_train_epochs
)

# Set early stopping in case the model is not improving
trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=5))

  trainer = Trainer(


## Train

This step might take several hours.

In [12]:
# Start training
trainer.train()

Step,Training Loss,Validation Loss,F1
50,0.6732,0.608885,0.748664
100,0.5816,0.495133,0.811322
150,0.4861,0.404146,0.833057
200,0.4095,0.353762,0.849769
250,0.3665,0.323599,0.863571
300,0.3442,0.304882,0.873453
350,0.3246,0.293589,0.878792
400,0.312,0.281561,0.887645
450,0.3028,0.27387,0.889826
500,0.2963,0.266991,0.894267


Could not locate the best model at ./results\checkpoint-3700\pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.


TrainOutput(global_step=3950, training_loss=0.26363600549818594, metrics={'train_runtime': 5523.4067, 'train_samples_per_second': 1303.543, 'train_steps_per_second': 1.273, 'total_flos': 7.690249392720384e+16, 'train_loss': 0.26363600549818594, 'epoch': 1.6852266666666667})

## Test the trained model

Let's check and analyze the wrong predictions our model made on the test set.

In [19]:
predictions = trainer.predict(tokenized_datasets["test"])

def is_valid(prediction, aggressiveness=0.1):
    delta = prediction[1] - prediction[0]
    return 1 if delta > aggressiveness else 0

cnt = 0
for idx in range(len(tokenized_datasets["test"])):
    w = tokenized_datasets["test"][idx]
    prediction = predictions[0][idx]
    valid = is_valid(prediction)
    if valid != w['label']:
        print(f"{w['text']}, {w['label']}, {valid}, {prediction}")
        cnt += 1
print(cnt)

щекотлив, 1, 0, [ 0.52978516 -0.7163086 ]
бълболиш, 1, 0, [ 1.8408203 -1.9121094]
нешлифовано, 1, 0, [ 0.14770508 -0.51464844]
адаптоген, 1, 0, [ 1.9775391 -1.8105469]
акватинти, 1, 0, [ 0.4025879  -0.13232422]
секссъвет, 1, 0, [ 0.20129395 -0.22558594]
литиев, 1, 0, [ 1.7148438 -1.8154297]
мимо, 1, 0, [ 1.3837891 -1.4638672]
кореняко, 1, 0, [-0.14819336 -0.09259033]
пекторис, 1, 0, [ 1.7734375 -1.7275391]
многомашиннико, 1, 0, [-0.06585693 -0.14294434]
девоенизацийка, 1, 0, [ 0.59814453 -0.6557617 ]
чучура, 1, 0, [ 0.84228516 -1.0927734 ]
тепани, 1, 0, [ 0.0958252  -0.20397949]
четигигабайтов, 1, 0, [ 1.1484375  -0.88623047]
жановисти, 1, 0, [ 1.1259766 -1.0556641]
риф, 1, 0, [ 1.9130859 -1.8076172]
продухът, 1, 0, [-0.1348877  -0.45874023]
изкукуригаш, 1, 0, [ 0.18664551 -0.26416016]
еделвайси, 1, 0, [ 1.5058594 -1.5322266]
зубкано, 1, 0, [ 0.45776367 -0.7475586 ]
пъстървово, 1, 0, [-0.08917236 -0.03234863]
пейпал, 1, 0, [ 2.09375   -2.2617188]
спокойничък, 1, 0, [-0.12988281 -0.0791

## Process Stork words

Let's validate this model by checking how it classifies the words from the `Stork` dataset with TF-IDF between $0.7$ and $0.75$. I expect 40-50% of the words in this segment to be valid.

In [14]:
# Let's process all the words with TF-IDF over 0.70
stork_words_df = pd.read_csv('data/6_project_stork.csv.gz', encoding='windows-1251', compression='gzip', header=None, names=['word', 'tfidf'])
stork_words_set = set(stork_words_df[(stork_words_df['tfidf'] > 0.70) & (stork_words_df['tfidf'] <= 0.75)].word.tolist())
stork_words_to_check = sorted(list(stork_words_set - set(vocab_words)))
print(f"Number of words to check: {len(stork_words_to_check):,}")
print(stork_words_to_check[30:40])

Number of words to check: 4,200
['адиос', 'адипонектин', 'административнонаказателни', 'адренергични', 'адренергичните', 'адрияна', 'адсорбент', 'адсорбенти', 'адювант', 'аериране']


In [15]:
stork_word_ds = Dataset.from_list([{"text":w} for w in stork_words_to_check])
stork_word_ds_tokenized = stork_word_ds.map(tokenize_function)
predictions = trainer.predict(stork_word_ds_tokenized)
print(f"Number of predictions: {len(predictions[0])}")

Map:   0%|          | 0/4200 [00:00<?, ? examples/s]

Number of predictions: 4200


In [18]:
cnt = 0 
for idx in range(len(stork_word_ds_tokenized)):
    w = stork_word_ds_tokenized[idx]
    prediction = predictions[0][idx]
    valid = is_valid(prediction, aggressiveness=2)
    if valid:
        print(w['text'], prediction, valid)
        cnt += 1
print(f"Number of valid words: {cnt}")

аакимите [-2.15625    2.2695312] 1
або [-2.4765625  2.3984375] 1
абокат [-2.3847656  2.2246094] 1
абсолю [-2.6875     2.7226562] 1
абсорбация [-1.3867188  1.5146484] 1
аваз [-2.2695312  2.2617188] 1
автоагресия [-2.0546875  1.8798828] 1
автоклиматици [-2.7597656  2.6035156] 1
автоключар [-1.140625    0.91748047] 1
автохемотерапия [-1.0146484  1.0771484] 1
агапе [-2.640625   3.0039062] 1
аглаонема [-1.9951172  1.8457031] 1
агнозия [-2.9746094  2.9941406] 1
агонист [-2.2910156  2.0527344] 1
агрегиране [-1.84375    1.9736328] 1
агросектора [-1.3291016  1.0566406] 1
агях [-2.7324219  2.7519531] 1
адаптогените [-2.9082031  2.8769531] 1
адаптол [-2.421875   2.5839844] 1
адвент [-1.7558594  1.7773438] 1
адвенчър [-1.9736328  1.8564453] 1
аденовирусна [-2.1328125  2.1230469] 1
аденокарциномът [-1.2255859  1.1923828] 1
адио [-2.9902344  3.0410156] 1
адиос [-2.3789062  2.3105469] 1
адипонектин [-2.6074219  2.4511719] 1
административнонаказателни [-2.4570312  2.3691406] 1
адренергични [-1.2246094

* **OBSERVATION**: Unfortunately, this model still makes a lot of wrong predictions, so we can't fully automate the unknown word triage process. But it was worth trying. :D
