<h1>Fine Tuning Of Pretrained MalayalamBERT Model</h1>

## 1.Installing Required Dependencies

In [None]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow
# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'
# transformers version at notebook update --- 2.9.1
# tokenizers version at notebook update --- 0.7.0

In [None]:
!pip install sh
!pip install numba
!conda install pytorch torchvision cudatoolkit=10.2 -c pytorch 

## 2. Find a dataset

First, let us find a corpus of text in Malayalam. Here we’ll use the Malayalam portion of the [OSCAR corpus](https://traces1.inria.fr/oscar/) from INRIA.
OSCAR is a huge multilingual corpus obtained by language classification and filtering of [Common Crawl](https://commoncrawl.org/) dumps of the Web.

<img src="https://huggingface.co/blog/assets/01_how-to-train/oscar.png" style="margin: auto; display: block; width: 260px;">


The final training corpus has a size of 5 GB, which is still small – for your model, you will get better results the more data you can get to pretrain on. 



In [None]:
from sh import gunzip
!wget -c https://traces1.inria.fr/oscar/files/compressed-orig/ml.txt.gz 
gunzip('ml.txt.gz')

In [None]:
!wget -c https://raw.githubusercontent.com/eliasedwin7/BertBible/master/bible.txt

## 3.Setting Up Weights & Baises for Montioring

In [None]:
!pip install wandb 

In [None]:
!wandb login 30b67a19185f32fd2eedde1e86e341eb34c7f381

In [None]:
!Init wandb

In [None]:
import wandb
wandb.init(project="malayalamberto")

In [None]:

!WANDB_API_KEY=30b67a19185f32fd2eedde1e86e341eb34c7f381

## 4. Train a tokenizer

We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000.

We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more `<unk>` tokens!).


In [None]:
%%time 
from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Now let's save files to disk

In [None]:
!mkdir MalayalamBERTo
tokenizer.save("MalayalamBERTo")

We now have both a vocab.json, which is a list of the most frequent tokens ranked by frequency, and a merges.txt list of merges.

What is great is that our tokenizer is optimized for Malayalam. Compared to a generic tokenizer trained for English, more native words are represented by a single, unsplit token. We also represent sequences in a more efficient manner. 

Here’s how you can use it in tokenizers, including handling the RoBERTa special tokens – of course, you’ll also be able to use it directly from transformers.

In [1]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./MalayalamBERTo/vocab.json",
    "./MalayalamBERTo/merges.txt",
)

In [2]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [3]:
tokenizer.encode("ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാനമാണ് കേരളം.")

Encoding(num_tokens=47, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [4]:
tokenizer.encode("ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാനമാണ് കേരളം.").tokens

['<s>',
 'à´ĩà´¨',
 'àµį',
 'à´¤',
 'àµį',
 'à´¯à´¯',
 'àµģ',
 'à´Ł',
 'àµĨ',
 'Ġà´¤',
 'àµĨ',
 'à´ķ',
 'àµį',
 'à´ķ',
 'àµģ',
 'à´ªà´Ł',
 'à´¿',
 'à´ŀ',
 'àµį',
 'à´ŀ',
 'à´¾',
 'à´±',
 'àµĨ',
 'Ġà´ħà´±',
 'àµį',
 'à´±à´¤',
 'àµį',
 'à´¤',
 'àµģ',
 'à´³',
 'àµį',
 'à´³',
 'Ġà´¸',
 'à´Ĥ',
 'à´¸',
 'àµį',
 'à´¥',
 'à´¾',
 'à´¨à´®',
 'à´¾',
 'à´£',
 'àµį',
 'Ġà´ķ',
 'àµĩ',
 'à´°à´³',
 'à´Ĥ.',
 '</s>']

## 5. Train a language model from scratch

As the model is BERT-like, we’ll train it on a task of Masked language modeling, i.e. the predict how to fill arbitrary tokens that we randomly mask in the dataset. This is taken care of by the example script.

In [None]:
# Check that we have a GPU
!nvidia-smi

In [None]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

In [None]:
#empty Gpu cache
import torch
torch.cuda.empty_cache()

In [None]:
#clear gpu
from numba import cuda
cuda.select_device(0)
cuda.close()

### We'll define the following config for the model

In [None]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Now let's re-create our tokenizer in transformers

In [None]:
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("./MalayalamBERTo",model_max_length=512)

Finally let's initialize our model.

**Important:**

As we are training from scratch, we only initialize from a config, not from an existing pretrained model or checkpoint.

In [None]:
from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

In [None]:
model.num_parameters()
# => 84 million parameters

### Now let's build our training Dataset

We'll build our dataset by applying our tokenizer to our text file.

Here, as we only have one text file, we don't even need to customize our `Dataset`. We'll just use the `LineByLineDataset` out-of-the-box.

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./ml.txt",
    block_size=128,
)

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

Finally, we are all set to initialize our Trainer

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./MalayalamBERTo",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

### Start training

In [None]:
%%time
trainer.train()

#### 🎉 Save final model (+ tokenizer + config) to disk

In [None]:
trainer.save_model("./MalayalamBERTo")

## 6. Check that the LM actually trained

Aside from looking at the training and eval losses going down, the easiest way to check whether our language model is learning anything interesting is via the `FillMaskPipeline`.

Pipelines are simple wrappers around tokenizers and models, and the 'fill-mask' one will let you input a sequence containing a masked token (here, `<mask>`) and return a list of the most probable filled sequences, with their probabilities.



In [6]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="./MalayalamBERTo",
    tokenizer="./MalayalamBERTo"
)

In [9]:
fill_mask("വെളിച്ചം നല്ലതു എന്നു ദൈവം കണ്ടു ദൈവം വെളിച്ചവും ഇരുളും തമ്മിൽ വേർപ<mask>രിച്ചു.")

# This is the beginning of a beautiful <mask>.
# =>

[{'sequence': '<s> വെളിച്ചം നല്ലതു എന്നു ദൈവം കണ്ടു ദൈവം വെളിച്ചവും ഇരുളും തമ്മിൽ വേർപിരിച്ചു.</s>',
  'score': 0.9935956597328186,
  'token': 265},
 {'sequence': '<s> വെളിച്ചം നല്ലതു എന്നു ദൈവം കണ്ടു ദൈവം വെളിച്ചവും ഇരുളും തമ്മിൽ വേർപൊരിച്ചു.</s>',
  'score': 0.004073238465934992,
  'token': 316},
 {'sequence': '<s> വെളിച്ചം നല്ലതു എന്നു ദൈവം കണ്ടു ദൈവം വെളിച്ചവും ഇരുളും തമ്മിൽ വേർപെരിച്ചു.</s>',
  'score': 0.0008503152057528496,
  'token': 276},
 {'sequence': '<s> വെളിച്ചം നല്ലതു എന്നു ദൈവം കണ്ടു ദൈവം വെളിച്ചവും ഇരുളും തമ്മിൽ വേർപോരിച്ചു.</s>',
  'score': 0.0005664682248607278,
  'token': 288},
 {'sequence': '<s> വെളിച്ചം നല്ലതു എന്നു ദൈവം കണ്ടു ദൈവം വെളിച്ചവും ഇരുളും തമ്മിൽ വേർപാരിച്ചു.</s>',
  'score': 0.00036152597749605775,
  'token': 270}]

In [10]:
fill_mask("ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാ<mask>മാണ് കേരളം.")

# This is the beginning of a beautiful <mask>.
# =>

[{'sequence': '<s> ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാനമമാണ് കേരളം.</s>',
  'score': 0.3131335377693176,
  'token': 445},
 {'sequence': '<s> ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാനങമാണ് കേരളം.</s>',
  'score': 0.2956354022026062,
  'token': 408},
 {'sequence': '<s> ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാനമാണ് കേരളം.</s>',
  'score': 0.13453751802444458,
  'token': 266},
 {'sequence': '<s> ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാനവമാണ് കേരളം.</s>',
  'score': 0.08703269064426422,
  'token': 556},
 {'sequence': '<s> ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാനതമാണ് കേരളം.</s>',
  'score': 0.03202608972787857,
  'token': 320}]

Ok, simple syntax/grammar works. Let’s try a slightly more interesting prompt:


## 7.Exporting the model out of sage maker

In [None]:
!zip -r MalayalamBERT_Final.zip MalayalamBERTo/

In [None]:
!apt-get install zip unzip

In [None]:
ls -l --block-size=M

In [None]:
! zip MalayalamBERT_Final.zip --out BERT_Final.zip -s 100m

In [None]:
! zip MalayalamBERT3000.zip --out BERT_330000.zip -s 100m

In [None]:
!tar -czvf BERT_33.tar.gz checkpoint-330000/

In [None]:
!tar -czvf BERT_F.tar.gz MalayalamBERTo/

In [None]:
!split -b 100m "BERT_33.tar.gz" "BERT33.tar.gz.part-"

In [None]:
!split -b 100m "BERT_F.tar.gz" "BERTF.tar.gz.part-"

In [None]:
#use >> cat BERT33.tar.gz* > BERT_33.tar.gz
#use >> cat BERTF.tar.gz* > BERT_F.tar.gz

## 5. Shareing our model 🎉

Finally, since we had a nice model, we thoght about sharing it with the community:

- uploaded our model using the CLI: `transformers-cli upload` to the hugging face library

Our model has a page on http://huggingface.co/models and everyone can load it using `AutoModel.from_pretrained("eliasedwin7/MalayalamBERTo")`.

[![tb](./capture.png)](https://huggingface.co/eliasedwin7/MalayalamBERTo)


In [2]:
import tensorflow as tf
tf.estimator.Estimator(
    model_fn, model_dir=None, config=None, params=None, warm_start_from=None
)

ModuleNotFoundError: No module named 'tensorflow'