# Help BOBAI: Classify an unknown language

<img src="https://drive.google.com/uc?id=1Hvgrrah-T7yFTzDP002XuRodhyfY1Hju" width="750">

## Background
Bob's AI start-up, Bobai, builds AI solutions for other companies which have to process large volumes of text in their daily tasks. Bobai serve companies from all over the world, and they pride themselves on their ability to handle a variety of languages, from English, through Arabic to Mandarin. The secret to Bobai's success is that all of their products are based on a strong multilingual language encoder, mBERT. Bobai's infrastructure is actually highly optimized for this specific language encoder, which makes their products super fast and efficient, i.e. very attractive to clients.

## Task

But mBERT is trained on just 101 languages. So what happens when one of Bobai's biggest clients, Amoira, requests support for a new language X that is not among those 101 languages? Bob and his team have to find a way to meet this request, as they cannot risk losing the client.

The data Amoira has provided consists of a small labeled dataset for text classification and a larger corpus or raw text in the language.

To make things even more complicated, Amoira has encrypted the data, as they don't want to risk competitors finding out which new market they are targetting.

Bob has found out that at this time his team has no bandwidth to develop this product, so he is asking for your help. He has shared the baseline solution he uses for languages that mBERT already has support for, so you can start by checking how well this solution does and modify it to obtain better results. You should not waste any efforts on trying to decrypt the data - this will not help you build a better classifier and it will get you in trouble with Bob!

Your task is to build the best text classifier for language X that you can, while operating within the constraints of Bobai:

*   The classifier has to be based on mBERT (and cannot use any additional pre-trained language encoder).
*   The classifier has to train in under 8 hours using an L4 GPU as the compute resources of the company are limited.
*   The classifier has to perform inference on any random 500 data samples in under 5 minutes (Bobai will then apply their optimization tricks to bring this time even further down).

## Deliverables

You need to submit:


*   Your model predictions on the test inputs that we will provide 48 hours before the deadline.
  * saved as a text file in the format shown at the bottom of the notebook
*   Your best trained model.
  * as a link to the Huggingface Hub (read up on `push_to_hub` [here](push_to_hub)).
*   Working code that can be used to reproduce your best trained model.
  * In this Colab notebook.


## Prerequisites


### HuggingFace configuration

The steps below need to be completed by the team leader:

1. Create a team account on [HuggingFace](https://huggingface.co/) using the Gmail account provided by the IOAI organizers.

2. Go to the [IOAI HuggingFace repo](https://huggingface.co/InternationalOlympiadAI) and request access to all datasets.

3. In settings, create two Access Tokens, one with read rights, one with write rights, and store those in [Colab Secrets](https://www.youtube.com/watch?v=q87i2LZbbPc) as `hf_read` and `hf_write`, respectively.

In [1]:
from google.colab import userdata

read_access_token = userdata.get('hf_read')
write_access_token = userdata.get('hf_write')

### Dependencies

In [2]:
import importlib
import torch, transformers

if '2.3.0' not in torch.__version__:
  !pip install torch==2.3.0
if transformers.__version__!='4.41.2':
  !pip install transformers==4.41.2

if importlib.util.find_spec('datasets') is None:
  !pip install datasets==2.18.0s
  !pip install evaluate==0.4.2
  !pip install accelerate -U


If you've just installed `accelerate`, execute `Runtime > Restart session and run all` in the Colab UI menu above.

# Data

In [3]:
# load the data

from datasets import load_dataset, Dataset, DatasetDict

classification_dataset = load_dataset('InternationalOlympiadAI/NLP_problem', token=read_access_token)
raw_text = load_dataset('InternationalOlympiadAI/NLP_problem_raw', token=read_access_token)

Downloading readme:   0%|          | 0.00/393 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/218 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/218 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/281 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/90.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/611245 [00:00<?, ? examples/s]

# Baseline

In [4]:
# load the pre-trained tokenizer and use it to process the data

from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

tokenized_data = classification_dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/872k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.72M [00:00<?, ?B/s]

Map:   0%|          | 0/218 [00:00<?, ? examples/s]

Map:   0%|          | 0/218 [00:00<?, ? examples/s]

In [5]:
# define the evaluation metric

import evaluate
import numpy as np

f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels, average='macro')

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [6]:
!pip install transformers[torch]
!pip install accelerate -U

Collecting accelerate>=0.21.0 (from transformers[torch])
  Using cached accelerate-0.32.1-py3-none-any.whl (314 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch->transformers[t

In [7]:
# # define the model and the training configuration

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-multilingual-uncased", num_labels=5
)


# training_args = TrainingArguments(
#     output_dir="basiline_bobai",
#     learning_rate=1e-5,
#     per_device_train_batch_size=64,
#     per_device_eval_batch_size=64,
#     num_train_epochs=20,
#     weight_decay=0.01,
#     eval_strategy="epoch",
#     save_strategy="epoch",
#     save_total_limit=5,
#     metric_for_best_model='f1',
#     load_best_model_at_end=True,
#     push_to_hub=True,
#     hub_strategy="checkpoint",
#     hub_token=write_access_token,
#     hub_private_repo=True,
#     hub_model_id='baseline_bobai'

# )

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=tokenized_data["train"],
#     eval_dataset=tokenized_data["dev"],
#     tokenizer=tokenizer,
#     data_collator=data_collator,
#     compute_metrics=compute_metrics,
# )

model.safetensors:   0%|          | 0.00/672M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
# trainer.train()

目指せ脱Trainer

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
from torch.functional import F
from tqdm.auto import tqdm
import numpy as np
import os

from transformers import get_scheduler

import evaluate

In [20]:
# モデルを訓練する

tokenized_data = tokenized_data.remove_columns(["text"])
tokenized_data = tokenized_data.rename_column("label", "labels")
tokenized_data.set_format("torch")

# dataset
train_dataset = tokenized_data["train"]
eval_dataset = tokenized_data["dev"]
train_loader = DataLoader(train_dataset, batch_size=64, num_workers=0, pin_memory=True, collate_fn=data_collator)
eval_loader = DataLoader(eval_dataset, batch_size=64, num_workers=0, pin_memory=True, collate_fn=data_collator)
# num_workers=os.cpu_count()

print(train_dataset)

epochs = 20
num_training_steps = epochs * len(train_loader)
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-5)
scheduler = get_scheduler(name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 218
})


In [21]:
print(tokenized_data["train"].features)

{'labels': Value(dtype='int64', id=None), 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None), 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None), 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None)}


In [22]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

def train_one_epoch(model, scheduler, train_loader, criterion, optimizer):
    model.train()
    running_loss = 0.0
    for batch in tqdm(train_loader):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

def evaluate_model(model, test_loader):
    model.eval()
    for batch in eval_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = model(**batch)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
    return f1.compute(predictions=predictions, references=batch["labels"], average='macro')

# Train and evaluate the model
model.to(device)
for i in range(epochs):
  train_one_epoch(model, scheduler, train_loader, criterion, optimizer)
  accuracy = evaluate_model(model, eval_loader)
  print(f'Epoch {i+1} {accuracy}')

  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 1 accuracy: {'f1': 0.09411764705882353}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 2 accuracy: {'f1': 0.15692307692307692}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 3 accuracy: {'f1': 0.14965034965034962}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 4 accuracy: {'f1': 0.17564102564102563}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 5 accuracy: {'f1': 0.15844155844155844}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 6 accuracy: {'f1': 0.16956521739130434}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 7 accuracy: {'f1': 0.16}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 8 accuracy: {'f1': 0.2065359477124183}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 9 accuracy: {'f1': 0.2318295739348371}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 10 accuracy: {'f1': 0.3230769230769231}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 11 accuracy: {'f1': 0.31990231990231993}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 12 accuracy: {'f1': 0.3230769230769231}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 13 accuracy: {'f1': 0.29797979797979796}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 14 accuracy: {'f1': 0.29411764705882354}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 15 accuracy: {'f1': 0.29411764705882354}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 16 accuracy: {'f1': 0.29411764705882354}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 17 accuracy: {'f1': 0.29411764705882354}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 18 accuracy: {'f1': 0.29732620320855613}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 19 accuracy: {'f1': 0.29732620320855613}%


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch 20 accuracy: {'f1': 0.29732620320855613}%


# Inference

In [None]:
# run the trained model on a dev/test split
data_split = "dev"
eval_out = trainer.predict(tokenized_data[data_split])
predictions = eval_out.predictions.argmax(1)
labels = eval_out.label_ids
dev_f1 = f1.compute(predictions=predictions, references=labels, average='macro')

In [None]:
# 精度をみる

print(dev_f1)
print()
print(predictions)
print(labels)

total = 0
correct = 0
for i in range(len(predictions)):
  if predictions[i] == labels[i]:
    correct += 1
  total += 1
print()
print(correct/total)

In [None]:
# write the predictions to a file
with open('{}_predictions.txt'.format(data_split), 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in predictions.tolist()]))