<a href="https://colab.research.google.com/github/chizuchizu/IOAI/blob/main/Task2/redrock_001_task2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Help BOBAI: Classify an unknown language

<img src="https://drive.google.com/uc?id=1Hvgrrah-T7yFTzDP002XuRodhyfY1Hju" width="750">

## Background
Bob's AI start-up, Bobai, builds AI solutions for other companies which have to process large volumes of text in their daily tasks. Bobai serve companies from all over the world, and they pride themselves on their ability to handle a variety of languages, from English, through Arabic to Mandarin. The secret to Bobai's success is that all of their products are based on a strong multilingual language encoder, mBERT. Bobai's infrastructure is actually highly optimized for this specific language encoder, which makes their products super fast and efficient, i.e. very attractive to clients.

## Task

But mBERT is trained on just 101 languages. So what happens when one of Bobai's biggest clients, Amoira, requests support for a new language X that is not among those 101 languages? Bob and his team have to find a way to meet this request, as they cannot risk losing the client.

The data Amoira has provided consists of a small labeled dataset for text classification and a larger corpus or raw text in the language.

To make things even more complicated, Amoira has encrypted the data, as they don't want to risk competitors finding out which new market they are targetting.

Bob has found out that at this time his team has no bandwidth to develop this product, so he is asking for your help. He has shared the baseline solution he uses for languages that mBERT already has support for, so you can start by checking how well this solution does and modify it to obtain better results. You should not waste any efforts on trying to decrypt the data - this will not help you build a better classifier and it will get you in trouble with Bob!

Your task is to build the best text classifier for language X that you can, while operating within the constraints of Bobai:

*   The classifier has to be based on mBERT (and cannot use any additional pre-trained language encoder).
*   The classifier has to train in under 8 hours using an L4 GPU as the compute resources of the company are limited.
*   The classifier has to perform inference on any random 500 data samples in under 5 minutes (Bobai will then apply their optimization tricks to bring this time even further down).

## Deliverables

You need to submit:


*   Your model predictions on the test inputs that we will provide 48 hours before the deadline.
  * saved as a text file in the format shown at the bottom of the notebook
*   Your best trained model.
  * as a link to the Huggingface Hub (read up on `push_to_hub` [here](push_to_hub)).
*   Working code that can be used to reproduce your best trained model.
  * In this Colab notebook.


## Prerequisites


### HuggingFace configuration

The steps below need to be completed by the team leader:

1. Create a team account on [HuggingFace](https://huggingface.co/) using the Gmail account provided by the IOAI organizers.

2. Go to the [IOAI HuggingFace repo](https://huggingface.co/InternationalOlympiadAI) and request access to all datasets.

3. In settings, create two Access Tokens, one with read rights, one with write rights, and store those in [Colab Secrets](https://www.youtube.com/watch?v=q87i2LZbbPc) as `hf_read` and `hf_write`, respectively.

In [1]:
from google.colab import userdata

read_access_token = userdata.get('hf_read')
write_access_token = userdata.get('hf_write')

### Dependencies

In [2]:
import importlib
import torch, transformers

if '2.3.0' not in torch.__version__:
  !pip install torch==2.3.0
if transformers.__version__!='4.41.2':
  !pip install transformers==4.41.2

if importlib.util.find_spec('datasets') is None:
  !pip install datasets==2.18.0s
  !pip install evaluate==0.4.2
  !pip install accelerate -U


If you've just installed `accelerate`, execute `Runtime > Restart session and run all` in the Colab UI menu above.

# Data

In [3]:
# load the data

from datasets import load_dataset, Dataset, DatasetDict

classification_dataset = load_dataset('InternationalOlympiadAI/NLP_problem', token=read_access_token)
raw_text = load_dataset('InternationalOlympiadAI/NLP_problem_raw', token=read_access_token)

# Baseline

In [4]:
# load the pre-trained tokenizer and use it to process the data

from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-multilingual-uncased")

brahmi_to_devanagari = {
    '𑀓': 'क', '𑀔': 'ख', '𑀕': 'ग', '𑀖': 'घ', '𑀗': 'ङ', '𑀘': 'च', '𑀙': 'छ',
    '𑀚': 'ज', '𑀛': 'झ', '𑀜': 'ञ', '𑀝': 'ट', '𑀞': 'ठ', '𑀟': 'ड', '𑀠': 'ढ',
    '𑀡': 'ण', '𑀢': 'त', '𑀣': 'थ', '𑀤': 'द', '𑀥': 'ध', '𑀦': 'न', '𑀧': 'प',
    '𑀨': 'फ', '𑀩': 'ब', '𑀪': 'भ', '𑀫': 'म', '𑀬': 'य', '𑀭': 'र', '𑀮': 'ल',
    '𑀯': 'व', '𑀰': 'श', '𑀱': 'ष', '𑀲': 'स', '𑀳': 'ह', '𑁦':'०', '𑁣': '90'
}

def transliterate_brahmi_to_devanagari(text):
    transliterated_text = ''
    for char in text:
        if char in brahmi_to_devanagari:
            transliterated_text += brahmi_to_devanagari[char]
        else:
            transliterated_text += char
    return transliterated_text

transliteration_dict = {
    'क': 'k', 'ख': 'kh', 'ग': 'ga', 'घ': 'gh', 'ङ': 'ng',
    'च': 'k', 'छ': 'ch', 'ज': 'j', 'झ': 'jh', 'ञ': 'ny',
    'ट': 't', 'ठ': 'th', 'ड': 'd', 'ढ': 'dh', 'ण': 'n',
    'त': 't', 'थ': 'th', 'द': 'd', 'ध': 'dh', 'न': 'n',
    'प': 'p', 'फ': 'f', 'ब': 'b', 'भ': 'bh', 'म': 'm',
    'य': 'y', 'र': 'r', 'ल': 'l', 'व': 'v', 'श': 'sh',
    'ष': 's', 'स': 's', 'ह': 'h', 'क़': 'k', 'ख़': 'kh',
    'ग़': 'g', 'ऩ': 'n', 'ड़': 'd', 'ढ': 'dh', 'ढ़': 'rh',
    'ऱ': 'r', 'य़': 'ye', 'ळ': 'l', 'ऴ': 'll', 'फ़': 'f',
    'ज़': 'z', 'ऋ': 'ri', 'ा': 'aa', 'ि': 'i', 'ी': 'i',
    'ु': 'u', 'ू': 'u', 'ॅ': 'e', 'ॆ': 'e', 'े': 'e',
    'ै': 'ai', 'ॉ': 'o', 'ॊ': 'o', 'ो': 'o', 'ौ': 'au',
    'अ': 'a', 'आ': 'aa', 'इ': 'i', 'ई': 'i', 'उ': 'u',
    'ऊ': 'oo', 'ए': 'e', 'ऐ': 'ai', 'ऑ': 'au', 'ओ': 'o',
    'औ': 'au', 'ँ': 'n', 'ं': 'n', 'ः': 'ah', '़': 'e',
    '्': '', '०': '0', '१': '1', '२': '2', '३': '3',
    '४': '4', '५': '5', '६': '6', '७': '7', '८': '8',
    '९': '9', '।': '.', 'ऍ': 'e', 'ृ': 'ri', 'ॄ': 'rr',
    'ॠ': 'r', 'ऌ': 'l', 'ॣ': 'l', 'ॢ': 'l', 'ॡ': 'l',
    'ॿ': 'b', 'ॾ': 'd', 'ॽ': '', 'ॼ': 'j', 'ॻ': 'g',
    'ॐ': 'om', 'ऽ': "'", 'e.a': 'a', '\n': '\n'
}

def transliterate_text(text):
    for key, value in transliteration_dict.items():
        text = text.replace(key, value)
    return text

def transliterate_brahmi_to_latin(text):
    transliterated_text = ''
    for char in text:
        if char in brahmi_to_devanagari:
            transliterated_text += brahmi_to_devanagari[char]
        else:
            transliterated_text += char
    return transliterated_text

def preprocess_function(examples):
    examples["text"] = [transliterate_brahmi_to_latin(text) for text in examples["text"]]
    return tokenizer(examples["text"], truncation=True)

tokenized_data = classification_dataset.map(preprocess_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/218 [00:00<?, ? examples/s]

In [5]:
# define the evaluation metric

import evaluate
import numpy as np

f1 = evaluate.load("f1")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels, average='macro')

In [6]:
!pip install transformers[torch]
!pip install accelerate -U



In [7]:
# define the model and the training configuration

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "google-bert/bert-base-multilingual-uncased", num_labels=5
)

training_args = TrainingArguments(
    output_dir="basiline_bobai",
    learning_rate=1e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=20,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=5,
    metric_for_best_model='f1',
    load_best_model_at_end=True,
    push_to_hub=True,
    hub_strategy="checkpoint",
    hub_token=write_access_token,
    hub_private_repo=True,
    hub_model_id='baseline_bobai'

)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["dev"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-multilingual-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
# execute the model training
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,No log,1.600234,0.156712
2,No log,1.568969,0.181892
3,No log,1.54895,0.188739
4,No log,1.521567,0.184506
5,No log,1.481484,0.231446
6,No log,1.47472,0.248667
7,No log,1.453247,0.321109
8,No log,1.444793,0.270465
9,No log,1.427314,0.286884
10,No log,1.400559,0.372912


TrainOutput(global_step=80, training_loss=1.4687346458435058, metrics={'train_runtime': 145.5979, 'train_samples_per_second': 29.945, 'train_steps_per_second': 0.549, 'total_flos': 179968698224400.0, 'train_loss': 1.4687346458435058, 'epoch': 20.0})

# Inference

In [9]:
# run the trained model on a dev/test split
data_split = "dev"
eval_out = trainer.predict(tokenized_data[data_split])
predictions = eval_out.predictions.argmax(1)
labels = eval_out.label_ids
dev_f1 = f1.compute(predictions=predictions, references=labels, average='macro')

In [10]:
dev_f1

{'f1': 0.3804444818284412}

In [11]:
# write the predictions to a file
with open('{}_predictions.txt'.format(data_split), 'w') as outfile:
  outfile.write('\n'.join([str(p) for p in predictions.tolist()]))

# Transliterate

In [12]:
#データから文字列だけを取る

train_data_len = len(tokenized_data["train"])
train_data_texts = []
for i in range(train_data_len):
    train_single_text = tokenized_data["train"][i]["text"]
    train_data_texts.append(train_single_text)

print(train_data_texts[0])

dev_data_len = len(tokenized_data["dev"])
dev_data_texts = []
for i in range(dev_data_len):
    dev_single_text = tokenized_data["dev"][i]["text"]
    dev_data_texts.append(dev_single_text)

print(dev_data_texts[0])

ढचबनतभ० थच खचभचड० ढच दच हन ढनबच षचहचड ढचड न ढच
ढचबनतभ० थच खचभचड० ढच दच हन ढनबच षचहचड ढचड न ढच


In [13]:
brahmi_to_devanagari = {
    '𑀓': 'क', '𑀔': 'ख', '𑀕': 'ग', '𑀖': 'घ', '𑀗': 'ङ', '𑀘': 'च', '𑀙': 'छ',
    '𑀚': 'ज', '𑀛': 'झ', '𑀜': 'ञ', '𑀝': 'ट', '𑀞': 'ठ', '𑀟': 'ड', '𑀠': 'ढ',
    '𑀡': 'ण', '𑀢': 'त', '𑀣': 'थ', '𑀤': 'द', '𑀥': 'ध', '𑀦': 'न', '𑀧': 'प',
    '𑀨': 'फ', '𑀩': 'ब', '𑀪': 'भ', '𑀫': 'म', '𑀬': 'य', '𑀭': 'र', '𑀮': 'ल',
    '𑀯': 'व', '𑀰': 'श', '𑀱': 'ष', '𑀲': 'स', '𑀳': 'ह', '𑁦':'०', '𑁣': '90'
}

def transliterate_brahmi_to_latin(text):
    transliterated_text = ''
    for char in text:
        if char in brahmi_to_devanagari:
            transliterated_text += brahmi_to_devanagari[char]
        else:
            transliterated_text += char
    return transliterated_text

transliteration_dict = {
    'क': 'k', 'ख': 'kh', 'ग': 'ga', 'घ': 'gh', 'ङ': 'ng',
    'च': 'k', 'छ': 'ch', 'ज': 'j', 'झ': 'jh', 'ञ': 'ny',
    'ट': 't', 'ठ': 'th', 'ड': 'd', 'ढ': 'dh', 'ण': 'n',
    'त': 't', 'थ': 'th', 'द': 'd', 'ध': 'dh', 'न': 'n',
    'प': 'p', 'फ': 'f', 'ब': 'b', 'भ': 'bh', 'म': 'm',
    'य': 'y', 'र': 'r', 'ल': 'l', 'व': 'v', 'श': 'sh',
    'ष': 's', 'स': 's', 'ह': 'h', 'क़': 'k', 'ख़': 'kh',
    'ग़': 'g', 'ऩ': 'n', 'ड़': 'd', 'ढ': 'dh', 'ढ़': 'rh',
    'ऱ': 'r', 'य़': 'ye', 'ळ': 'l', 'ऴ': 'll', 'फ़': 'f',
    'ज़': 'z', 'ऋ': 'ri', 'ा': 'aa', 'ि': 'i', 'ी': 'i',
    'ु': 'u', 'ू': 'u', 'ॅ': 'e', 'ॆ': 'e', 'े': 'e',
    'ै': 'ai', 'ॉ': 'o', 'ॊ': 'o', 'ो': 'o', 'ौ': 'au',
    'अ': 'a', 'आ': 'aa', 'इ': 'i', 'ई': 'i', 'उ': 'u',
    'ऊ': 'oo', 'ए': 'e', 'ऐ': 'ai', 'ऑ': 'au', 'ओ': 'o',
    'औ': 'au', 'ँ': 'n', 'ं': 'n', 'ः': 'ah', '़': 'e',
    '्': '', '०': '0', '१': '1', '२': '2', '३': '3',
    '४': '4', '५': '5', '६': '6', '७': '7', '८': '8',
    '९': '9', '।': '.', 'ऍ': 'e', 'ृ': 'ri', 'ॄ': 'rr',
    'ॠ': 'r', 'ऌ': 'l', 'ॣ': 'l', 'ॢ': 'l', 'ॡ': 'l',
    'ॿ': 'b', 'ॾ': 'd', 'ॽ': '', 'ॼ': 'j', 'ॻ': 'g',
    'ॐ': 'om', 'ऽ': "'", 'e.a': 'a', '\n': '\n'
}

def transliterate_text(text):
    for key, value in transliteration_dict.items():
        text = text.replace(key, value)
    return text


In [14]:
transliterate_train_data = []
# print(len(train_data_texts))
for brahmi in train_data_texts:
    devanagari = transliterate_brahmi_to_latin(brahmi)
    latin = transliterate_text(devanagari)
    transliterate_train_data.append(latin)

print(transliterate_train_data[2])

klsdkhhbh pk hknthtnnk dk dkskbhttd bh90dklth90n dhkn0bhd dhndttm dk h90d bhtt0
