# Model Controller Tutorial: Classification

> This notebook contains an end-to-end process of preprocess + tokenizing your text, and build a classification model based on Roberta architecture

- skip_showdoc: true
- skip_exec: true

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main import *

In [None]:
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from importlib.machinery import SourceFileLoader
import os
from transformers import DataCollatorWithPadding,RobertaTokenizer
import torch
import nlpaug.augmenter.char as nac
from datasets import load_dataset
import random

# Create a TextDataController object

We will reuse the data and the preprocessings in [this tutorial](https://anhquan0412.github.io/that-nlp-library/text_main.html) 

In [None]:
def nlp_aug_stochastic(x,aug=None,p=0.5):
    results = aug.augment(x)
    if not isinstance(x,list): return results[0] if random.random()<p else x
    return [a if random.random()<p else b for a,b in zip(results,x)]

In [None]:
aug = nac.KeyboardAug(aug_char_max=3,aug_char_p=0.1,aug_word_p=0.07)
nearby_aug_func = partial(nlp_aug_stochastic,aug=aug,p=0.5)

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')

tdc = TextDataController(dset,
                         main_text='Review Text',
                         label_names='Department Name',
                         filter_dict={'Review Text': lambda x: x is not None,
                                      'Department Name': lambda x: x is not None,
                                     },
                         metadatas=['Title','Division Name'],
                         content_transformations=[text_normalize,str.lower],
                         content_augmentations= [nearby_aug_func,str.lower], 
                         # add "str.lower" here because nearby_aug might return uppercase character
                         val_ratio=0.2,
                         batch_size=1000,
                         seed=42,
                         convert_training_to_iterable=False,
                         num_proc=16,
                         verbose=True
                        )

Found cached dataset csv (/home/quan/.cache/huggingface/datasets/csv/sample_data-f893627565d98cd2/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)


Define our tokenizer for Roberta

In [None]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

Process and tokenize our dataset

In [None]:
tdc.process_and_tokenize(tokenizer,max_length=512,shuffle_trn=True)

Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-f893627565d98cd2/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-e9c222e72d2d212d_*_of_00016.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-f893627565d98cd2/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-d1cd989e4813d263_*_of_00016.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-f893627565d98cd2/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-8db764ea689c4a2f_*_of_00016.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-f893627565d98cd2/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-5cfa4d6b4bbf77b8_*_of_00016.arrow
Loading cached processed dataset at /home/quan/.cache/huggingface/datasets/csv/sample_data-f893627565d98cd2/0.0.0/6954658bab

-------------------- Start Main Text Processing --------------------
-------------------- Data Filtering --------------------
----- Do <lambda> on Review Text -----
----- Do <lambda> on Department Name -----
Done
----- Metadata Simple Processing & Concatenating to Main Content -----
Done
----- Label Encoding -----
Done
-------------------- Text Transformation --------------------
----- text_normalize -----
----- lower -----
Done
-------------------- Train Test Split --------------------
Validation split based on val_ratio
Done
-------------------- Dropping unused features --------------------
Done
- Number of rows leaked: 0, which is 0.00% of training set
-------------------- Text Augmentation --------------------
----- nlp_aug_stochastic -----


Map (num_proc=16):   0%|          | 0/18102 [00:00<?, ? examples/s]

----- lower -----


Map (num_proc=16):   0%|          | 0/18102 [00:00<?, ? examples/s]

Done
-------------------- Shuffling train set --------------------
Done
-------------------- Tokenization --------------------


Map:   0%|          | 0/18102 [00:00<?, ? examples/s]

Map:   0%|          | 0/4526 [00:00<?, ? examples/s]

Done


In [None]:
tdc.main_ddict

DatasetDict({
    train: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 18102
    })
    validation: Dataset({
        features: ['Title', 'Review Text', 'Division Name', 'Department Name', 'label', 'input_ids', 'attention_mask'],
        num_rows: 4526
    })
})

Let's see one example of how those content transformations and augmentations affect our input

In [None]:
sample_txt = 'This is not what I expected 🤬. I gulped when I put this in my bag during retailer days because the price was still too much ... but thought this has to be wonderful to charge so much,right??'

In [None]:
two_steps_tokenization_explain(sample_txt,tokenizer,
                               content_tfms=[text_normalize,str.lower],
                               aug_tfms=[partial(nlp_aug_stochastic,aug=aug,p=1),str.lower]
                              )

		------- Text Transformation Explained -------
----- Raw sentence -----
This is not what I expected 🤬. I gulped when I put this in my bag during retailer days because the price was still too much ... but thought this has to be wonderful to charge so much,right??

----- Content Transformations (on both train and test) -----
--- text_normalize ---
This is not what I expected 🤬 . I gulped when I put this in my bag during retailer days because the price was still too much ... but thought this has to be wonderful to charge so much , right ? ?

--- lower ---
this is not what i expected 🤬 . i gulped when i put this in my bag during retailer days because the price was still too much ... but thought this has to be wonderful to charge so much , right ? ?


----- Augmentations (on train only) -----
--- nlp_aug_stochastic ---
tNis is not what i expected 🤬. i gulpeR when i put this in my bag during rrtailer Cays because the price was still too much. .. but thought this has to be wonderful to charg

# Model Experiment: EnviBert Single-Head Classification

In [None]:
from that_nlp_library.models.roberta.classifiers import *
from that_nlp_library.model_main import *
from sklearn.metrics import f1_score, accuracy_score

comet_ml is installed but `COMET_API_KEY` is not set.


In [None]:
import os

In [None]:
#This will specify a (or a list) of GPUs for training
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

## Train EnviBert using TDM

In [None]:
tdm.label_lists

[['Buyer complained seller',
  'Commercial',
  'Delivery',
  'Feature',
  'Order/Item',
  'Others',
  'Payment',
  'Return/Refund',
  'Services',
  'Shopee account']]

In [None]:
num_classes = len(tdm.label_lists[0]) # 10
num_classes

10

Let's define our model and model controller. First, we will initialize the pretrained `body` model

In [None]:
from transformers.models.roberta.modeling_roberta import RobertaModel

In [None]:
model_name='nguyenvulebinh/envibert'

In [None]:
envibert_body = RobertaModel.from_pretrained(model_name)

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Then we can define a simple class as the head for our classification task, something like this:

In [None]:
class SimpleClassificationHead(torch.nn.Module):
    def __init__(self,
                 config, # HuggingFace model configuration
                 classifier_dropout=0.1, # Dropout ratio (for dropout layer right before the last nn.Linear)
                 num_labels=None, # Number of label output. Every classification class must have this exact variable
                ):
        super().__init__()
        self.dropout = torch.nn.Dropout(classifier_dropout)
        self.out_proj = torch.nn.Linear(config.hidden_size, num_labels)
    def forward(self, inp, **kwargs):
        x = inp
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

In [None]:
_model_kwargs={
    # overall model hyperparams
    'is_multilabel':tdm.is_multilabel, # False
    'is_multihead':tdm.is_multihead, # False
    'head_class_sizes':num_classes,
    'head_class': SimpleClassificationHead,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

model = model_init_classification(model_class = RobertaBaseForSequenceClassification,
                                  cpoint_path = 'nguyenvulebinh/envibert', 
                                  output_hidden_states=False, # since we are not using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=envibert_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(model,tdm,metric_funcs)

Loading body weights. This assumes the body is the very first first-layer block of your custom architecture


And we can start training our model

In [None]:
lr = 8e-5
bs=4
wd=0.01
epochs= 2

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_classification,
              )



Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1
0,No log,1.059704,0.350748,0.677852
1,No log,1.007712,0.462641,0.697987


### Logging your training

You can log your training using HuggingFace:

- Supported platforms are "azure_ml", "comet_ml", "mlflow", "neptune", "tensorboard","clearml" and "wandb"

- References:

    - https://huggingface.co/docs/transformers/v4.28.0/en/main_classes/trainer#transformers.TrainingArguments
    
    - https://docs.wandb.ai/guides/integrations/huggingface#:~:text=Logging%20your%20Hugging%20Face%20model,every%20save_steps%20in%20the%20TrainingArguments%20.

```python
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_classification,
               hf_report_to='wandb'
              )
```

You can save your model weights at the end of your training

```python
controller.trainer.model.save_pretrained('./sample_weights/model_progress')
```

Or you can save your weights at every epochs during your training

```python
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=True,
               o_dir='sample_weights',
               compute_metrics=compute_metrics_classification,
              )
```

## Train EnviBert with tokenized DatasetDict

This part assumes you already have your tokenized datasetdict. You must have your tokenizer as well

In [None]:
tokenizer

RobertaTokenizer(name_or_path='', vocab_size=59993, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True)

Note that your DatasetDict must contain tokens besides raw text (which typically includes 'input_ids', 'token_type_ids', 'attention_mask')

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4780
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 447
    })
})

In [None]:
num_classes = 10

In [None]:
model_name='nguyenvulebinh/envibert'
_model_kwargs={
    # overall model hyperparams
    'is_multilabel':False, # False
    'is_multihead':False, # False
    'head_class_sizes':num_classes,
    'head_class': SimpleClassificationHead,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

In [None]:
envibert_body = RobertaModel.from_pretrained(model_name)

In [None]:
model = model_init_classification(model_class = RobertaBaseForSequenceClassification,
                                  cpoint_path = 'nguyenvulebinh/envibert', 
                                  output_hidden_states=False, # since we are not using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=envibert_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(model,
                             metric_funcs=metric_funcs)

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Loading body weights. This assumes the body is the very first first-layer block of your custom architecture


In [None]:
lr = 8e-5
bs=4
wd=0.01
epochs= 2

controller.fit(epochs,lr,
               ddict=main_ddict, # Put in your tokenized datasetdict here
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_classification,
               tokenizer=tokenizer,
               label_names='L1'
              )



Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1
0,No log,1.059704,0.350748,0.677852
1,No log,1.007712,0.462641,0.697987


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/model_progress')

## Predict using trained model, using TDM

### Load trained model

In [None]:
_model_kwargs

{'is_multilabel': False,
 'is_multihead': False,
 'head_class_sizes': 10,
 'head_class': __main__.SimpleClassificationHead,
 'classifier_dropout': 0.1}

In [None]:
trained_model = model_init_classification(model_class = RobertaBaseForSequenceClassification,
                                          cpoint_path = Path('./sample_weights/model_progress'), 
                                          output_hidden_states=False,
                                          seed=42,
                                          model_kwargs = _model_kwargs)

Some weights of the model checkpoint at sample_weights/model_progress were not used when initializing RobertaBaseForSequenceClassification: ['body_model.pooler.dense.weight', 'body_model.pooler.dense.bias']
- This IS expected if you are initializing RobertaBaseForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaBaseForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
trained_model

RobertaBaseForSequenceClassification(
  (body_model): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(59993, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-5): 6 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
        

In [None]:
metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(trained_model,tdm,metric_funcs)

### Predict Train/Validation set

Make prediction on all validation set

In [None]:
df_val = controller.predict_ddict(ds_type='validation')

-------------------- Start making predictions --------------------


Map:   0%|          | 0/447 [00:00<?, ? examples/s]

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1
0,google play - Chơi gam rất là lác,3,google play,Commercial,0.837496
1,google play - Zq,5,google play,Others,0.927602
2,non owned - Làn sóng kỹ thuật số và sự lựa chọ...,5,non owned,Others,0.918241
3,google play - Hàng quốc tế không còn ship COD ...,6,google play,Delivery,0.804539
4,google play - Quá tệ . Giao hàng chậm như rùa ...,2,google play,Delivery,0.758327


To convert the label index to string, we can use the ```label_lists``` attribute of tdm

In [None]:
df_val['label']= df_val['label'].apply(lambda x: tdm.label_lists[0][x]).values

df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1
0,google play - Chơi gam rất là lác,Feature,google play,Commercial,0.837496
1,google play - Zq,Others,google play,Others,0.927602
2,non owned - Làn sóng kỹ thuật số và sự lựa chọ...,Others,non owned,Others,0.918241
3,google play - Hàng quốc tế không còn ship COD ...,Payment,google play,Delivery,0.804539
4,google play - Quá tệ . Giao hàng chậm như rùa ...,Delivery,google play,Delivery,0.758327


You can try to get your metric to see if it matches your last traing epoch's above

In [None]:
f1_score(df_val.label,df_val.pred_L1,average='macro')

0.4634417008698494

You can also make predictions on all training set, by changing argument ```ds_type``` to "train"

### Predict Test set

We will go through details on how to make a prediction on a completely new and raw dataset using our trained model. For now, let's reuse the sample csv and pretend it's our test set

In [None]:
df_test = TextDataMain.from_csv(Path('sample_data')/'sample_large.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 16 rows


We will remove all the labels and unnecessary columns

In [None]:
true_label = df_test['L1'].values

In [None]:
df_test = df_test.drop(['L1','L2'],axis=1)

In [None]:
df_test.head()

Unnamed: 0,Source,Content
0,Google Play,"App ncc lúc nào cx lag đơ, phần tìm kiếm thì v..."
1,Non Owned,..❗️ GÓC THANH LÝ Tính ra rẻ hơn cả mua #Shope...
2,Google Play,Mắc gì người ta đặt hàng toàn lỗi 😃????
3,Owned,#GhienShopeePayawardT8 Khi bạn chơi shopee quá...
4,Google Play,Rất bức xúc khi dùng . mã giảm giá người dùng ...


We will create a DatasetDict for this test dataframe

In [None]:
test_ddict = tdm.get_test_datasetdict_from_df(df_test)

-------------------- Getting Test Set --------------------
----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 19 rows
-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 2269/2269 [00:00<00:00, 3954.30it/s]


-------------------- Test Leak Checking --------------------
- Before leak check
Size: 2269
- After leak check
Size: 0
- Number of rows leaked: 2269, or 100.00% of the original validation (or test) data
-------------------- Construct DatasetDict --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

Remember the ***Leak Check*** we did in TextDataMain? Our ```df_test``` only has 70 rows, and it also shows that 70 rows of our data is leaked (100%), which is correct because this test dataset is actually a small sample of the training data.

In [None]:
test_ddict

DatasetDict({
    test: Dataset({
        features: ['text', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2269
    })
})

Our test data has been processed + transformed (but not augmented) the same way as the validation set. Now we can start making the prediction

In [None]:
controller = ModelController(trained_model,tdm)
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test')

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
df_result.head()

Unnamed: 0,text,Source,pred_L1,pred_prob_L1
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,Feature,0.878221
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,Others,0.930981
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,Feature,0.849374
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,Commercial,0.915552
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,Shopee account,0.571471


Let's quickly check the f1 score to make sure everything works correctly

In [None]:
f1_score(true_label,df_result.pred_L1,average='macro')

0.5303012712104336

Since we are getting the predictions on the entire training+validation set, the F1 score is expected to be slightly higher than validation's F1 score.

We can even predict top k results

In [None]:
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test',topk=3)
df_result.head()

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1,pred_prob_L1,pred_L1_top1,pred_L1_top2,pred_L1_top3,pred_prob_L1_top1,pred_prob_L1_top2,pred_prob_L1_top3
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,"[3, 5, 9]","[0.87822074, 0.023822138, 0.02159522]",Feature,Others,Shopee account,0.878221,0.023822,0.021595
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,"[5, 1, 0]","[0.9309808, 0.015578598, 0.009805982]",Others,Commercial,Buyer complained seller,0.930981,0.015579,0.009806
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,"[3, 5, 9]","[0.8493735, 0.050054528, 0.021759989]",Feature,Others,Shopee account,0.849374,0.050055,0.02176
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,"[1, 6, 7]","[0.9155516, 0.01255093, 0.010521941]",Commercial,Payment,Return/Refund,0.915552,0.012551,0.010522
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,"[9, 3, 1]","[0.57147133, 0.25687057, 0.03061041]",Shopee account,Feature,Commercial,0.571471,0.256871,0.03061


If we just want to make a prediction on a small amount of data (single sentence, or a few sentences), we can use `ModelController.predict_raw_text`

In [None]:
# Since we have some metadatas, we need to define a dictionary (to imitate a DatasetDict)
raw_content={
    'Source': 'Google play',
    'Content':'Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'
}

If we don't use metadata, we can use something like this: 

```raw_content='Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'```

In [None]:
df_result = controller.predict_raw_text(raw_content,topk=1)
df_result

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 4718.00it/s]


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1,pred_prob_L1
0,google play - Tôi không thích Shopee . Tại vì ...,google play,Feature,0.876843


## Predict using trained model, using tokenized DatasetDict

### Load trained model

In [None]:
num_classes = 10

model_name='nguyenvulebinh/envibert'
_model_kwargs={
    # overall model hyperparams
    'is_multilabel':False, # False
    'is_multihead':False, # False
    'head_class_sizes':num_classes,
    'head_class': SimpleClassificationHead,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

In [None]:
trained_model = model_init_classification(model_class = RobertaBaseForSequenceClassification,
                                          cpoint_path = Path('./sample_weights/model_progress'), 
                                          output_hidden_states=False,
                                          seed=42,
                                          model_kwargs = _model_kwargs)


metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(trained_model,metric_funcs) # notice that we don't use tdm here

Some weights of the model checkpoint at sample_weights/model_progress were not used when initializing RobertaBaseForSequenceClassification: ['body_model.pooler.dense.weight', 'body_model.pooler.dense.bias']
- This IS expected if you are initializing RobertaBaseForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaBaseForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Predict validation set

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4780
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask', 'pred_L1', 'pred_prob_L1'],
        num_rows: 447
    })
})

In [None]:
my_label_name = 'L1'
my_class_predefined = ['Buyer complained seller',
 'Commercial',
 'Delivery',
 'Feature',
 'Order/Item',
 'Others',
 'Payment',
 'Return/Refund',
 'Services',
 'Shopee account']

In [None]:
df_val = controller.predict_ddict(main_ddict,
                                  ds_type='validation',
                                  is_multilabel=False,
                                  tokenizer=tokenizer,
                                  label_names = my_label_name,
                                  class_names_predefined=my_class_predefined
                                  )

-------------------- Start making predictions --------------------


Map:   0%|          | 0/447 [00:00<?, ? examples/s]

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1
0,google play - Chơi gam rất là lác,3,google play,Commercial,0.837496
1,google play - Zq,5,google play,Others,0.927602
2,non owned - Làn sóng kỹ thuật số và sự lựa chọ...,5,non owned,Others,0.918241
3,google play - Hàng quốc tế không còn ship COD ...,6,google play,Delivery,0.804539
4,google play - Quá tệ . Giao hàng chậm như rùa ...,2,google play,Delivery,0.758327


### Predict test set

In [None]:
test_ddict

DatasetDict({
    test: Dataset({
        features: ['text', 'Source', 'input_ids', 'token_type_ids', 'attention_mask', 'pred_L1', 'pred_prob_L1'],
        num_rows: 2269
    })
})

It would be cumbersome to preprocess your test data the same way you preprocess your validation set, without the use of tdm (which stores the preprocess pipeline). In short, you need to produce the test datasetdict `test_ddict` containing processed `'input_ids', 'token_type_ids', 'attention_mask'`, then call

```python
df_results = controller.predict_ddict(ddict=test_ddict,
                                      ds_type='test',
                                      is_multilabel=False,
                                      tokenizer=tokenizer,
                                      label_names = my_label_name,
                                      class_names_predefined=my_class_predefined     
                                     )
```