# Model Controller Tutorial: EnviBert model

> This notebook contains some example of how to use the EnviBert-based models in this NLP library

- skip_showdoc: true
- skip_exec: true

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
#| hide
from nbdev.showdoc import *

# Load data

We will reuse the sample data in [this tutorial](https://anhquan0412.github.io/that-nlp-library/text_main.html) to experiment with the models

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main import *

In [None]:
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from importlib.machinery import SourceFileLoader
import os
from transformers import DataCollatorWithPadding
import torch

Define some necessary text augmentations and text transformations

> For Text Transformation

In [None]:
txt_tfms=[text_normalize]

> For Text Augmentation

In [None]:
over_nonown_tfm = partial(sampling_with_condition,query='Source=="non owned"',frac=0.5,seed=42,apply_to_all=False)
over_nonown_tfm.__name__ = 'Oversampling Non Owned'

over_own_tfm = partial(sampling_with_condition,query='Source=="owned"',frac=2,seed=42,apply_to_all=False)
over_own_tfm.__name__ = 'Oversampling Owned'

over_hc_tfm = partial(sampling_with_condition,query='Source=="hc search"',frac=2.5,seed=42,apply_to_all=False)
over_hc_tfm.__name__ = 'Oversampling HC search'

remove_accent_tfm = partial(remove_vnmese_accent,frac=1,seed=42,apply_to_all=True)
remove_accent_tfm.__name__ = 'Add No-Accent Text'

aug_tfms = [over_nonown_tfm,over_own_tfm,over_hc_tfm,remove_accent_tfm]

Create a TextDataMain object

In [None]:
DATA_PATH = Path('sample_data')

In [None]:
tdm = TextDataMain.from_csv(DATA_PATH/'sample_large.csv',
                            return_df=False,
                            main_content='Content',
                            metadatas='Source',
                            label_names='L1',
                            val_ratio=0.2,
                            split_cols='L1',
                            content_tfms = txt_tfms,
                            aug_tfms = aug_tfms,
                            process_metadatas=True,
                            seed=42,
                            shuffle_trn=True)

----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 16 rows


In [None]:
tdm.df.tail()

Unnamed: 0,Source,Content,L1,L2
2264,Google Play,Đăng xuất tài khoản thì không đăng nhập lại được.,Shopee account,Sign up/Log in
2265,Non Owned,Triển lãm Thương mại Điện tử Việt Nam với sự t...,Others,Branding
2266,Google Play,Như cứtttttttt,Others,Cannot defined
2267,HC search,áo khoác,Others,Cannot defined
2268,Non Owned,[https://shopee.vn/jocastore.vn](https://shope...,Others,Cannot defined


Define our tokenizer for EnviBert

In [None]:
cache_dir=Path('./envibert_tokenizer')
tokenizer = SourceFileLoader("envibert.tokenizer", 
                             str(cache_dir/'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)

In [None]:
# # EnviBert a data collator to work. We will save this as an attribute in TDM
# data_collator = DataCollatorWithPadding(tokenizer,padding=True,max_length=512)
# tdm.set_data_collator(data_collator)

Create our DatasetDict from TextDataMain (as our `ModelController` class can also work with DatasetDict)

In [None]:
main_ddict= tdm.to_datasetdict(tokenizer,
                                max_length=512,
                              )

-------------------- Start Main Text Processing --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
----- Label Encoding -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 2269/2269 [00:00<00:00, 3928.47it/s]


-------------------- Train Test Split --------------------
Previous Validation Percentage: 20.009%
- Before leak check
Size: 454
- After leak check
Size: 447
- Number of rows leaked: 7, or 1.54% of the original validation (or test) data
Current Validation Percentage: 19.7%
-------------------- Text Augmentation --------------------
Train data size before augmentation: 1822
----- Oversampling Non Owned -----
Train data size after THIS augmentation: 2020
----- Oversampling Owned -----
Train data size after THIS augmentation: 2248
----- Oversampling HC search -----
Train data size after THIS augmentation: 2390
----- Add No-Accent Text -----


100%|████████████████████████████████████| 2390/2390 [00:00<00:00, 19058.57it/s]


Train data size after THIS augmentation: 4780
Train data size after ALL augmentation: 4780
-------------------- Map Tokenize Function --------------------


Map:   0%|          | 0/4780 [00:00<?, ? examples/s]

Map:   0%|          | 0/447 [00:00<?, ? examples/s]

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4780
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 447
    })
})

Let's see some examples of outputs the TDM produces:

In [None]:
two_steps_tokenization_explain('Tôi đặt hàng mà chẳng thấy giao 1 năm rồi.Làm với chả ăn,chán 🤬 🤬 🤬 🤬 🤬 🤬',
                               tokenizer,content_tfms=[text_normalize])

----- Text Transformation Explained -----
--- Raw sentence ---
Tôi đặt hàng mà chẳng thấy giao 1 năm rồi.Làm với chả ăn,chán 🤬 🤬 🤬 🤬 🤬 🤬
--- text_normalize ---
Tôi đặt hàng mà chẳng thấy giao 1 năm rồi . Làm với chả ăn , chán 🤬 🤬 🤬 🤬 🤬 🤬

----- Tokenizer Explained -----
--- Input ---
Tôi đặt hàng mà chẳng thấy giao 1 năm rồi . Làm với chả ăn , chán 🤬 🤬 🤬 🤬 🤬 🤬

--- Tokenized results --- 
{'input_ids': [0, 842, 642, 114, 145, 1371, 289, 363, 139, 93, 539, 5, 3798, 39, 7225, 380, 4, 5925, 3529, 3, 3529, 3, 3529, 3, 3529, 3, 3529, 3, 3529, 3, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

--- Results from tokenizer.convert_ids_to_tokens ---
['<s>', '▁Tôi', '▁đặt', '▁hàng', '▁mà', '▁chẳng', '▁thấy', '▁giao', '▁1', '▁năm', '▁rồi', '▁.', '▁Làm', '▁với', '▁chả', '▁ăn', '▁,', '▁chán', '▁', '<unk>', '▁', '<unk>', 

In [None]:
tdm.tokenizer_explain_single(tokenizer) # Pick a random text in train set to explain

----- Tokenizer Explained -----
--- Input ---
owned - # ShopeePaychamguibaiSS3 # ShopeePay1111 link : https://bit.ly/UuDaiShopeePay11-11 Username : chipheo2306 Thương hiệu l’oreal rất hay chạy sale những ngày lễ lớn và nhất ngày 11.11 có thể sẽ có voucher giảm 50 % giảm tối đa 100 k nếu săn được voucher mình rất muốn sản phẩm ￼ Nước tẩy trang cho mọi loại da L'Oreal Paris 3 in1 Micellar Water 400 ml bởi sản phẩm giá thành bình dân nếu săn sale nữa thì càng rẻ cho dung tích lớn 400 ml lận và sản phẩm có chứa công nghệ mi-xen đột phá Chiết xuất thảo dược và Glycerin bổ sung độ ẩm cho da Hút tất cả bụi bẩn , cặn dơ của lớp make-up mà không gây khô da . Với công nghệ mới , mang đến các tẩy trang , làm sạch , giữ ẩm và dưỡng mềm da đồng thời chỉ trong một sản phẩm . Mình luôn ưu tiên thanh toán qua ShopeePay để thanh toán được tiện lợi và nhanh chóng

--- Tokenized results --- 
{'input_ids': [0, 3507, 13, 2481, 1888, 51603, 53097, 1501, 1509, 10976, 4020, 2327, 31682, 1245, 2481, 1888, 5160

# Model Experiment: EnviBert Single-Head Classification

In [None]:
from that_nlp_library.models.roberta.classifiers import *
from that_nlp_library.model_main import *
from sklearn.metrics import f1_score, accuracy_score

comet_ml is installed but `COMET_API_KEY` is not set.


In [None]:
import os

In [None]:
#This will specify a (or a list) of GPUs for training
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

## Train EnviBert using TDM

In [None]:
tdm.label_lists

[['Buyer complained seller',
  'Commercial',
  'Delivery',
  'Feature',
  'Order/Item',
  'Others',
  'Payment',
  'Return/Refund',
  'Services',
  'Shopee account']]

In [None]:
num_classes = len(tdm.label_lists[0]) # 10
num_classes

10

Let's define our model and model controller. First, we will initialize the pretrained `body` model

In [None]:
from transformers.models.roberta.modeling_roberta import RobertaModel

In [None]:
model_name='nguyenvulebinh/envibert'

In [None]:
envibert_body = RobertaModel.from_pretrained(model_name)

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Then we can define a simple class as the head for our classification task, something like this:

In [None]:
class SimpleClassificationHead(torch.nn.Module):
    def __init__(self,
                 config, # HuggingFace model configuration
                 classifier_dropout=0.1, # Dropout ratio (for dropout layer right before the last nn.Linear)
                 num_labels=None, # Number of label output. Every classification class must have this exact variable
                ):
        super().__init__()
        self.dropout = torch.nn.Dropout(classifier_dropout)
        self.out_proj = torch.nn.Linear(config.hidden_size, num_labels)
    def forward(self, inp, **kwargs):
        x = inp
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

In [None]:
_model_kwargs={
    # overall model hyperparams
    'is_multilabel':tdm.is_multilabel, # False
    'is_multihead':tdm.is_multihead, # False
    'head_class_sizes':num_classes,
    'head_class': SimpleClassificationHead,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

model = model_init_classification(model_class = RobertaBaseForSequenceClassification,
                                  cpoint_path = 'nguyenvulebinh/envibert', 
                                  output_hidden_states=False, # since we are not using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=envibert_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(model,tdm,metric_funcs)

Loading body weights. This assumes the body is the very first first-layer block of your custom architecture


And we can start training our model

In [None]:
lr = 8e-5
bs=4
wd=0.01
epochs= 2

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_classification,
              )



Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1
0,No log,1.059704,0.350748,0.677852
1,No log,1.007712,0.462641,0.697987


### Logging your training

You can log your training using HuggingFace:

- Supported platforms are "azure_ml", "comet_ml", "mlflow", "neptune", "tensorboard","clearml" and "wandb"

- References:

    - https://huggingface.co/docs/transformers/v4.28.0/en/main_classes/trainer#transformers.TrainingArguments
    
    - https://docs.wandb.ai/guides/integrations/huggingface#:~:text=Logging%20your%20Hugging%20Face%20model,every%20save_steps%20in%20the%20TrainingArguments%20.

```python
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_classification,
               hf_report_to='wandb'
              )
```

You can save your model weights at the end of your training

```python
controller.trainer.model.save_pretrained('./sample_weights/model_progress')
```

Or you can save your weights at every epochs during your training

```python
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=True,
               o_dir='sample_weights',
               compute_metrics=compute_metrics_classification,
              )
```

## Train EnviBert with tokenized DatasetDict

This part assumes you already have your tokenized datasetdict. You must have your tokenizer as well

In [None]:
tokenizer

RobertaTokenizer(name_or_path='', vocab_size=59993, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True)

Note that your DatasetDict must contain tokens besides raw text (which typically includes 'input_ids', 'token_type_ids', 'attention_mask')

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4780
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 447
    })
})

In [None]:
num_classes = 10

In [None]:
model_name='nguyenvulebinh/envibert'
_model_kwargs={
    # overall model hyperparams
    'is_multilabel':False, # False
    'is_multihead':False, # False
    'head_class_sizes':num_classes,
    'head_class': SimpleClassificationHead,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

In [None]:
envibert_body = RobertaModel.from_pretrained(model_name)

In [None]:
model = model_init_classification(model_class = RobertaBaseForSequenceClassification,
                                  cpoint_path = 'nguyenvulebinh/envibert', 
                                  output_hidden_states=False, # since we are not using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=envibert_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(model,
                             metric_funcs=metric_funcs)

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loading body weights. This assumes the body is the very first first-layer block of your custom architecture


In [None]:
lr = 8e-5
bs=4
wd=0.01
epochs= 2

controller.fit(epochs,lr,
               ddict=main_ddict, # Put in your tokenized datasetdict here
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_classification,
               tokenizer=tokenizer,
               label_names='L1'
              )



Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1
0,No log,1.059704,0.350748,0.677852
1,No log,1.007712,0.462641,0.697987


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/model_progress')

## Predict using trained model, using TDM

### Load trained model

In [None]:
_model_kwargs

{'is_multilabel': False,
 'is_multihead': False,
 'head_class_sizes': 10,
 'head_class': __main__.SimpleClassificationHead,
 'classifier_dropout': 0.1}

In [None]:
trained_model = model_init_classification(model_class = RobertaBaseForSequenceClassification,
                                          cpoint_path = Path('./sample_weights/model_progress'), 
                                          output_hidden_states=False,
                                          seed=42,
                                          model_kwargs = _model_kwargs)

Some weights of the model checkpoint at sample_weights/model_progress were not used when initializing RobertaBaseForSequenceClassification: ['body_model.pooler.dense.weight', 'body_model.pooler.dense.bias']
- This IS expected if you are initializing RobertaBaseForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaBaseForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
trained_model

RobertaBaseForSequenceClassification(
  (body_model): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(59993, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
        

In [None]:
metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(trained_model,tdm,metric_funcs)

### Predict Train/Validation set

Make prediction on all validation set

In [None]:
df_val = controller.predict_ddict(ds_type='validation')

-------------------- Start making predictions --------------------


Map:   0%|          | 0/447 [00:00<?, ? examples/s]

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1
0,google play - Chơi gam rất là lác,3,google play,Commercial,0.837496
1,google play - Zq,5,google play,Others,0.927602
2,non owned - Làn sóng kỹ thuật số và sự lựa chọ...,5,non owned,Others,0.918241
3,google play - Hàng quốc tế không còn ship COD ...,6,google play,Delivery,0.804539
4,google play - Quá tệ . Giao hàng chậm như rùa ...,2,google play,Delivery,0.758327


To convert the label index to string, we can use the ```label_lists``` attribute of tdm

In [None]:
df_val['label']= df_val['label'].apply(lambda x: tdm.label_lists[0][x]).values

df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1
0,google play - Chơi gam rất là lác,Feature,google play,Commercial,0.837496
1,google play - Zq,Others,google play,Others,0.927602
2,non owned - Làn sóng kỹ thuật số và sự lựa chọ...,Others,non owned,Others,0.918241
3,google play - Hàng quốc tế không còn ship COD ...,Payment,google play,Delivery,0.804539
4,google play - Quá tệ . Giao hàng chậm như rùa ...,Delivery,google play,Delivery,0.758327


You can try to get your metric to see if it matches your last traing epoch's above

In [None]:
f1_score(df_val.label,df_val.pred_L1,average='macro')

0.4634417008698494

You can also make predictions on all training set, by changing argument ```ds_type``` to "train"

### Predict Test set

We will go through details on how to make a prediction on a completely new and raw dataset using our trained model. For now, let's reuse the sample csv and pretend it's our test set

In [None]:
df_test = TextDataMain.from_csv(Path('sample_data')/'sample_large.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 16 rows


We will remove all the labels and unnecessary columns

In [None]:
true_label = df_test['L1'].values

In [None]:
df_test = df_test.drop(['L1','L2'],axis=1)

In [None]:
df_test.head()

Unnamed: 0,Source,Content
0,Google Play,"App ncc lúc nào cx lag đơ, phần tìm kiếm thì v..."
1,Non Owned,..❗️ GÓC THANH LÝ Tính ra rẻ hơn cả mua #Shope...
2,Google Play,Mắc gì người ta đặt hàng toàn lỗi 😃????
3,Owned,#GhienShopeePayawardT8 Khi bạn chơi shopee quá...
4,Google Play,Rất bức xúc khi dùng . mã giảm giá người dùng ...


We will create a DatasetDict for this test dataframe

In [None]:
test_ddict = tdm.get_test_datasetdict_from_df(df_test)

-------------------- Getting Test Set --------------------
----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 19 rows
-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 2269/2269 [00:00<00:00, 3954.30it/s]


-------------------- Test Leak Checking --------------------
- Before leak check
Size: 2269
- After leak check
Size: 0
- Number of rows leaked: 2269, or 100.00% of the original validation (or test) data
-------------------- Construct DatasetDict --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

Remember the ***Leak Check*** we did in TextDataMain? Our ```df_test``` only has 70 rows, and it also shows that 70 rows of our data is leaked (100%), which is correct because this test dataset is actually a small sample of the training data.

In [None]:
test_ddict

DatasetDict({
    test: Dataset({
        features: ['text', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2269
    })
})

Our test data has been processed + transformed (but not augmented) the same way as the validation set. Now we can start making the prediction

In [None]:
controller = ModelController(trained_model,tdm)
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test')

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
df_result.head()

Unnamed: 0,text,Source,pred_L1,pred_prob_L1
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,Feature,0.878221
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,Others,0.930981
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,Feature,0.849374
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,Commercial,0.915552
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,Shopee account,0.571471


Let's quickly check the f1 score to make sure everything works correctly

In [None]:
f1_score(true_label,df_result.pred_L1,average='macro')

0.5303012712104336

Since we are getting the predictions on the entire training+validation set, the F1 score is expected to be slightly higher than validation's F1 score.

We can even predict top k results

In [None]:
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test',topk=3)
df_result.head()

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1,pred_prob_L1,pred_L1_top1,pred_L1_top2,pred_L1_top3,pred_prob_L1_top1,pred_prob_L1_top2,pred_prob_L1_top3
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,"[3, 5, 9]","[0.87822074, 0.023822138, 0.02159522]",Feature,Others,Shopee account,0.878221,0.023822,0.021595
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,"[5, 1, 0]","[0.9309808, 0.015578598, 0.009805982]",Others,Commercial,Buyer complained seller,0.930981,0.015579,0.009806
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,"[3, 5, 9]","[0.8493735, 0.050054528, 0.021759989]",Feature,Others,Shopee account,0.849374,0.050055,0.02176
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,"[1, 6, 7]","[0.9155516, 0.01255093, 0.010521941]",Commercial,Payment,Return/Refund,0.915552,0.012551,0.010522
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,"[9, 3, 1]","[0.57147133, 0.25687057, 0.03061041]",Shopee account,Feature,Commercial,0.571471,0.256871,0.03061


If we just want to make a prediction on a small amount of data (single sentence, or a few sentences), we can use `ModelController.predict_raw_text`

In [None]:
# Since we have some metadatas, we need to define a dictionary (to imitate a DatasetDict)
raw_content={
    'Source': 'Google play',
    'Content':'Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'
}

If we don't use metadata, we can use something like this: 

```raw_content='Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'```

In [None]:
df_result = controller.predict_raw_text(raw_content,topk=1)
df_result

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 4718.00it/s]


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1,pred_prob_L1
0,google play - Tôi không thích Shopee . Tại vì ...,google play,Feature,0.876843


## Predict using trained model, using tokenized DatasetDict

### Load trained model

In [None]:
num_classes = 10

model_name='nguyenvulebinh/envibert'
_model_kwargs={
    # overall model hyperparams
    'is_multilabel':False, # False
    'is_multihead':False, # False
    'head_class_sizes':num_classes,
    'head_class': SimpleClassificationHead,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

In [None]:
trained_model = model_init_classification(model_class = RobertaBaseForSequenceClassification,
                                          cpoint_path = Path('./sample_weights/model_progress'), 
                                          output_hidden_states=False,
                                          seed=42,
                                          model_kwargs = _model_kwargs)


metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(trained_model,metric_funcs) # notice that we don't use tdm here

Some weights of the model checkpoint at sample_weights/model_progress were not used when initializing RobertaBaseForSequenceClassification: ['body_model.pooler.dense.weight', 'body_model.pooler.dense.bias']
- This IS expected if you are initializing RobertaBaseForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaBaseForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Predict validation set

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4780
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask', 'pred_L1', 'pred_prob_L1'],
        num_rows: 447
    })
})

In [None]:
my_label_name = 'L1'
my_class_predefined = ['Buyer complained seller',
 'Commercial',
 'Delivery',
 'Feature',
 'Order/Item',
 'Others',
 'Payment',
 'Return/Refund',
 'Services',
 'Shopee account']

In [None]:
df_val = controller.predict_ddict(main_ddict,
                                  ds_type='validation',
                                  is_multilabel=False,
                                  tokenizer=tokenizer,
                                  label_names = my_label_name,
                                  class_names_predefined=my_class_predefined
                                  )

-------------------- Start making predictions --------------------


Map:   0%|          | 0/447 [00:00<?, ? examples/s]

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1
0,google play - Chơi gam rất là lác,3,google play,Commercial,0.837496
1,google play - Zq,5,google play,Others,0.927602
2,non owned - Làn sóng kỹ thuật số và sự lựa chọ...,5,non owned,Others,0.918241
3,google play - Hàng quốc tế không còn ship COD ...,6,google play,Delivery,0.804539
4,google play - Quá tệ . Giao hàng chậm như rùa ...,2,google play,Delivery,0.758327


### Predict test set

In [None]:
test_ddict

DatasetDict({
    test: Dataset({
        features: ['text', 'Source', 'input_ids', 'token_type_ids', 'attention_mask', 'pred_L1', 'pred_prob_L1'],
        num_rows: 2269
    })
})

It would be cumbersome to preprocess your test data the same way you preprocess your validation set, without the use of tdm (which stores the preprocess pipeline). In short, you need to produce the test datasetdict `test_ddict` containing processed `'input_ids', 'token_type_ids', 'attention_mask'`, then call

```python
df_results = controller.predict_ddict(ddict=test_ddict,
                                      ds_type='test',
                                      is_multilabel=False,
                                      tokenizer=tokenizer,
                                      label_names = my_label_name,
                                      class_names_predefined=my_class_predefined     
                                     )
```