<a href="https://colab.research.google.com/github/xprilion/muril-indicvarna-build-with-ai-sample/blob/main/MurilBertIndicVarna1k_SentimentModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Muril - IndicVarna (tiny) Sentiment Classification model

In this notebook, we show you how to easily get started with training multilingual sentiment model with Google Muril tokenizer with Dynopii's IndicVarna dataset.

For sake of presenting this at a conference, we're working with the `tiny` subset of the model.

This notebook can run on any FREE GPU of Google Colab.

## Install necessary libraries and load them

In [1]:
%%capture
!pip install datasets
!pip install transformers[torch]

In [2]:
import torch

In [3]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset, load_metric, Dataset

In [4]:
import pandas as pd
from datasets import load_metric
import numpy as np

## Load the dataset.

In [5]:
dataset = load_dataset("dynopii/IndicVarna-1k-tiny") # For full, use "dynopii/IndicVarna-100k"

Downloading readme:   0%|          | 0.00/337 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/151k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1020 [00:00<?, ? examples/s]

## [Optional] Observe the dataset

In [6]:
df = dataset["train"].to_pandas()

In [7]:
df.head()

Unnamed: 0,text,label,uuid
0,i go online i always feel pissed,0,69a98787-ea52-4eb6-af19-293ea5baf87b-en
1,मैं ऑनलाइन जाता हूं और हमेशा परेशान रहता हूं,0,69a98787-ea52-4eb6-af19-293ea5baf87b-hi
2,আমি অনলাইনে যাই আমি সবসময় বিরক্ত বোধ করি,0,69a98787-ea52-4eb6-af19-293ea5baf87b-bn
3,"நான் ஆன்லைனில் செல்கிறேன், நான் எப்போதும் கோபம...",0,69a98787-ea52-4eb6-af19-293ea5baf87b-ta
4,मी ऑनलाइन जातो मला नेहमी राग येतो,0,69a98787-ea52-4eb6-af19-293ea5baf87b-mr


## Create label map

And reverse 🔃 it!

In [8]:
label_map = {'positive': 2, 'negative': 0, 'neutral': 1}

In [9]:
reverse_label_map = {v: k for k, v in label_map.items()}

## Load the Tokenizer

More about muril at: https://huggingface.co/google/muril-base-cased

In [10]:
tokenizer = BertTokenizer.from_pretrained("google/muril-base-cased")

tokenizer_config.json:   0%|          | 0.00/206 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/3.16M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/113 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [11]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

## Tokenize the dataset

In [12]:
tokenized_dataset = dataset["train"].map(tokenize_function, batched=True)

Map:   0%|          | 0/1020 [00:00<?, ? examples/s]

In [13]:
tokenized_dataset

Dataset({
    features: ['text', 'label', 'uuid', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1020
})

## Prepare dataset for the classification model pipeline

In [14]:
tokenized_dataset = tokenized_dataset.remove_columns(["uuid"])
tokenized_dataset.set_format("torch")

In [15]:
train_test_split = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]

## Load Muril classification pre-trained model

In [16]:
model = BertForSequenceClassification.from_pretrained("google/muril-base-cased", num_labels=3)

pytorch_model.bin:   0%|          | 0.00/953M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google/muril-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
!rm -rf /content/results

## Set training config

In [35]:
training_args = TrainingArguments(
    output_dir="./results",
    logging_steps=10,
    per_device_train_batch_size=48,
    per_device_eval_batch_size=48,
    eval_steps=20,
    save_steps = 20,
    learning_rate=2e-5,
    num_train_epochs=10,
    weight_decay=0.01,
    evaluation_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)



In [36]:
metric = load_metric("accuracy")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [37]:
def compute_metrics(p):
    return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)

In [38]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

## Start training!

In [39]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
20,0.6824,0.701155,0.921569
40,0.6477,0.652602,0.946078
60,0.6101,0.610726,0.955882
80,0.5736,0.604098,0.941176
100,0.5464,0.573735,0.95098
120,0.5208,0.552058,0.955882
140,0.5097,0.541849,0.95098
160,0.5022,0.552211,0.95098


TrainOutput(global_step=170, training_loss=0.5756817845737233, metrics={'train_runtime': 217.4674, 'train_samples_per_second': 37.523, 'train_steps_per_second': 0.782, 'total_flos': 2147005488660480.0, 'train_loss': 0.5756817845737233, 'epoch': 10.0})

## Evaluate

Hehe. Yeah, just 5 samples in the test dataset 😅

In [40]:
eval_results = trainer.evaluate()
eval_results

{'eval_loss': 0.5418485403060913,
 'eval_accuracy': 0.9509803921568627,
 'eval_runtime': 1.1833,
 'eval_samples_per_second': 172.399,
 'eval_steps_per_second': 4.225,
 'epoch': 10.0}

## [Optional] Store model to Hugging Face

However tiny, it took some compute hours, we'll store it!

You'll need to get a Hugging Face token for this. If not already set, feel free to skip to the next section!

In [24]:
from huggingface_hub import HfApi

In [26]:
api = HfApi()

In [27]:
model.push_to_hub("dynopii/muril-indicvarna-tiny-sentiment")

model.safetensors:   0%|          | 0.00/950M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/dynopii/muril-indicvarna-tiny-sentiment/commit/c4ef211e02d24c2efd37779bf030aa16e4a5e8b8', commit_message='Upload BertForSequenceClassification', commit_description='', oid='c4ef211e02d24c2efd37779bf030aa16e4a5e8b8', pr_url=None, pr_revision=None, pr_num=None)

In [28]:
tokenizer.push_to_hub("dynopii/muril-indicvarna-tiny-sentiment")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/dynopii/muril-indicvarna-tiny-sentiment/commit/162359adb5a7d880ef12640a0faa303ec34642ff', commit_message='Upload tokenizer', commit_description='', oid='162359adb5a7d880ef12640a0faa303ec34642ff', pr_url=None, pr_revision=None, pr_num=None)

## Environmental Impact

We love this little green blob we live on, before anything else, let's check our environmental impact!

Experiments were conducted using Google Cloud Platform, which has a carbon efficiency of 0.5 kgCO$_2$eq/kWh. A cumulative of 15 mins of computation was performed on hardware of type A100 PCIe 40GB (TDP of 250W).

Total emissions are estimated to be 0.04 kgCO$_2$eq of which 100 percent were directly offset by the cloud provider.

## Observe the model we trained

In [29]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# Move the model to the device
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(197285, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1

## Prediction time! Create a predict function

In [30]:
def predict(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)

    # Move the inputs to the same device as the model
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Perform inference
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the logits and predicted class
    logits = outputs.logits
    predicted_class_id = torch.argmax(logits, dim=1).item()
    # return predicted_class_id
    predicted_label = reverse_label_map[predicted_class_id]
    return predicted_label

## Predictions below!

In [31]:
predict("नहीं जी")

'negative'

In [32]:
predict("हां जी हां जी बोलिये")

'positive'

In [33]:
predict("আমি জানতে চাই না")

'negative'

In [34]:
predict("আমি জানতে চাই")

'neutral'