# Model Controller Tutorial: EnviBert model with Conditional Probability

> This notebook contains some example of how to use the EnviBert-based models in this NLP library

- skip_showdoc: true
- skip_exec: true

We will walk through other cases of classification: multi-head and multi-label. Since we will showcase the capabiilty of this label in these cases, there won't be as detailed as [this tutorial](https://anhquan0412.github.io/that-nlp-library/model_main_envibert.html)

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
#| hide
from nbdev.showdoc import *

## Load data

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main import *

In [None]:
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from importlib.machinery import SourceFileLoader
from transformers import DataCollatorWithPadding
import pandas as pd

Define some necessary text augmentations and text transformations

> For Text Transformation

In [None]:
txt_tfms=[text_normalize]

> For Text Augmentation

In [None]:
over_nonown_tfm = partial(sampling_with_condition,query='Source=="non owned"',frac=0.5,seed=42,apply_to_all=False)
over_nonown_tfm.__name__ = 'Oversampling Non Owned'

over_own_tfm = partial(sampling_with_condition,query='Source=="owned"',frac=2,seed=42,apply_to_all=False)
over_own_tfm.__name__ = 'Oversampling Owned'

over_hc_tfm = partial(sampling_with_condition,query='Source=="hc search"',frac=2.5,seed=42,apply_to_all=False)
over_hc_tfm.__name__ = 'Oversampling HC search'

remove_accent_tfm = partial(remove_vnmese_accent,frac=1,seed=42,apply_to_all=True)
remove_accent_tfm.__name__ = 'Add No-Accent Text'

aug_tfms = [over_nonown_tfm,over_own_tfm,over_hc_tfm,remove_accent_tfm]

Let's load and preprocess our data

In [None]:
DATA_PATH = Path('secret_data')

In [None]:
df = TextDataMain.from_csv(DATA_PATH/'buyer_listening_with_all_raw_data_w2223.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains missing values!
-----> List of columns and the number of missing values for each
is_valid    65804
dtype: int64
DataFrame contains duplicated values!
-----> Number of duplications: 7 rows


In [None]:
df.head(3)

Unnamed: 0,Week,Group,Source,Content,L1,L2,L3,L4,is_valid,iteration
0,1.0,Google Play,Google Play,Tại sao cứ hiện thông báo,Services,Shopee communication channels,Annoying pop-up ads,Non-tech,,1
1,1.0,Google Play,Google Play,Mlem,Others,Cannot defined,-,-,,1
2,1.0,Google Play,Google Play,1 số sản phẩm trong giỏ hàng vừa đc cập nhật t...,Feature,Cart & Order,Cart issues/suggestions,Tech,,1


Quick preprocess of data and train/validation split. Due to custom logic, we will sample our data here instead of using the `train_ratio` from the `to_datasetdict` function

In [None]:
df_rare = df[df.L2.isin(['Chatbot', 'Commercial Others'])].copy()

df_final = pd.concat([df.query('iteration==1').sample(500,random_state=42),
                      df.query('iteration>=7 & iteration<13').sample(1200,random_state=42),
                      df_rare,
                      df.query('iteration>=13'),
                     ],axis=0).reset_index(drop=True)

val_idxs = df_final[df_final.iteration>=13].index.values # from week 9

In [None]:
tdm = TextDataMain(df_final,
                    main_content='Content',
                    metadatas='Source',
                    label_names=['L1','L2'],
                    val_ratio=val_idxs,
                    split_cols='L1',
                    content_tfms = txt_tfms,
                    aug_tfms = aug_tfms,
                    process_metadatas=True,
                    seed=42,
                    shuffle_trn=True)

----- Input Validation Precheck -----
DataFrame contains missing values!
-----> List of columns and the number of missing values for each
is_valid    498
dtype: int64


Define our tokenizer for EnviBert

In [None]:
cache_dir=Path('./envibert_tokenizer')
tokenizer = SourceFileLoader("envibert.tokenizer", 
                             str(cache_dir/'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)

Create our DatasetDict from TextDataMain (as our `ModelController` class can also work with DatasetDict)

In [None]:
main_ddict= tdm.to_datasetdict(tokenizer,
                               max_length=512,
                               )

-------------------- Start Main Text Processing --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
----- Label Encoding -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 6649/6649 [00:01<00:00, 3594.75it/s]


-------------------- Train Test Split --------------------
Previous Validation Percentage: 74.101%
- Before leak check
Size: 4927
- After leak check
Size: 4885
- Number of rows leaked: 42, or 0.85% of the original validation (or test) data
Current Validation Percentage: 73.47%
-------------------- Text Augmentation --------------------
Train data size before augmentation: 1764
----- Oversampling Non Owned -----
Train data size after THIS augmentation: 2229
----- Oversampling Owned -----
Train data size after THIS augmentation: 2789
----- Oversampling HC search -----
Train data size after THIS augmentation: 2904
----- Add No-Accent Text -----


100%|████████████████████████████████████| 2904/2904 [00:00<00:00, 10563.57it/s]


Train data size after THIS augmentation: 5808
Train data size after ALL augmentation: 5808
-------------------- Map Tokenize Function --------------------


Map:   0%|          | 0/5808 [00:00<?, ? examples/s]

Map:   0%|          | 0/4885 [00:00<?, ? examples/s]

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5808
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4885
    })
})

In [None]:
main_ddict['validation']['label'][:5]

[[5, 12], [5, 12], [5, 12], [2, 23], [3, 5]]

# Model Experiment: EnviBert Multi-Head Classification (with Hidden Layer Concatenation)

In [None]:
from that_nlp_library.models.roberta.conditional_prob_classifiers import *
from that_nlp_library.models.roberta.classifiers import ConcatHeadSimple
from that_nlp_library.model_main import *
from sklearn.metrics import f1_score, accuracy_score
import os
import torch
from transformers.models.roberta.modeling_roberta import RobertaModel

This will specify a (or a list) of GPUs for training

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

## Build Conditional Mask

In [None]:
df_labels = tdm.df[tdm.label_names]

In [None]:
df_labels.head()

Unnamed: 0,L1,L2
0,1,59
1,5,12
2,1,59
3,1,43
4,5,12


In [None]:
standard_mask = build_standard_condition_mask(df_labels,*tdm.label_names)

In [None]:
standard_mask.shape

torch.Size([10, 76])

Explain the first row of the mask

In [None]:
standard_mask[0]

tensor([ True, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False, False,
         True, False, False, False, False, False, False, False, False, False,
         True, False, False, False, False, False, False, False, False, False,
        False,  True,  True, False, False, False, False, False, False, False,
        False, False,  True,  True,  True, False, False, False, False, False,
        False, False, False, False, False, False])

Slicing the first portion for level 1, show string for True mask

In [None]:
for i in torch.where(standard_mask[0][:len(tdm.label_lists[0])]==True)[0]:
    print(tdm.label_lists[0][i])

Buyer complained seller


Slicing the first portion for level 2, show string for True mask. The results are the sub-category of level 1 class `Buyer complained seller`

In [None]:
for i in torch.where(standard_mask[0][len(tdm.label_lists[0]):]==True)[0]:
    print(tdm.label_lists[1][i])

Customer service (didn't respond/impolite)
Illegal/counterfeit products
Product description
Product quality
Sellers cancelled order without any advanced notice/reason
Sellers cheated Buyers (Sellers tried to reach me outside of Shopee App)
Sellers packed fake orders


## Train EnviBert (with hidden layer concatenation), using TDM

In [None]:
model_name='nguyenvulebinh/envibert'
envibert_body = RobertaModel.from_pretrained(model_name)

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's create our model controller

In [None]:
_model_kwargs={
    # overall model hyperparams
    'size_l1':len(tdm.label_lists[0]),
    'size_l2':len(tdm.label_lists[1]),
    'standard_mask':standard_mask,
    'layer2concat':4,
    'head_class': ConcatHeadSimple,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

model = model_init_classification(model_class = RobertaHSCCProbSequenceClassification,
                                  cpoint_path = model_name, 
                                  output_hidden_states=True, # since we are not using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=envibert_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(model,tdm,metric_funcs)

Loading body weights. This assumes the body is the very first first-layer block of your custom architecture


And we can start training our model

In [None]:
lr = 8.2e-5
bs=4
wd=0.01
epochs= 4

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_classification,
              )



Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1,F1 Score L2,Accuracy Score L2
1,No log,0.262642,0.085617,0.394678,0.024601,0.195292
2,0.177300,0.248724,0.098848,0.392835,0.040627,0.222723
3,0.177300,0.260753,0.103095,0.393245,0.043971,0.242784
4,0.042900,0.270646,0.104099,0.392631,0.045365,0.24176


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/my_model')

## Predict using trained model, using TDM

### Load trained model

In [None]:
_model_kwargs

{'size_l1': 10,
 'size_l2': 66,
 'standard_mask': tensor([[ True, False, False, False, False, False, False, False, False, False,
          False, False, False, False, False, False, False, False, False, False,
          False, False, False, False, False, False, False, False, False, False,
           True, False, False, False, False, False, False, False, False, False,
           True, False, False, False, False, False, False, False, False, False,
          False,  True,  True, False, False, False, False, False, False, False,
          False, False,  True,  True,  True, False, False, False, False, False,
          False, False, False, False, False, False],
         [False,  True, False, False, False, False, False, False, False, False,
          False, False, False, False,  True, False, False, False, False, False,
          False, False,  True, False, False, False,  True, False, False, False,
          False, False, False, False, False, False, False, False,  True,  True,
          False, F

In [None]:
trained_model = model_init_classification(model_class = RobertaHSCCProbSequenceClassification,
                                          cpoint_path = Path('./sample_weights/my_model'), 
                                          output_hidden_states=True,
                                          seed=42,
                                          model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(trained_model,tdm,metric_funcs)

Some weights of the model checkpoint at sample_weights/my_model were not used when initializing RobertaHSCCProbSequenceClassification: ['body_model.pooler.dense.bias', 'body_model.pooler.dense.weight']
- This IS expected if you are initializing RobertaHSCCProbSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaHSCCProbSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Predict Train/Validation set

Make prediction on all validation set

In [None]:
df_val = controller.predict_ddict(ds_type='validation',batch_size=8)

-------------------- Start making predictions --------------------


Map:   0%|          | 0/4885 [00:00<?, ? examples/s]

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2
0,google play - lam phien,"[5, 12]",google play,Others,0.289139,Promotions,0.553118
1,google play - .. t . À mà họ nữ ưu m,"[5, 12]",google play,Others,0.547048,Cannot defined,0.769387
2,google play - Cc lùa dao,"[5, 12]",google play,Others,0.480877,Cannot defined,0.680652
3,google play - Mặt hàng sp mình đều nhỡ với Gia...,"[2, 23]",google play,Others,0.534984,Cannot defined,0.597015
4,google play - Chưa tối ưu tốt cho Android Oppo...,"[3, 5]",google play,Others,0.564101,Cannot defined,0.852026


To convert the label index to string, we can use the ```label_lists``` attribute of tdm

In [None]:
import pandas as pd

In [None]:
df_val[['label_L1','label_L2']] = pd.DataFrame(df_val.label.tolist(), index= df_val.index)

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2,label_L1,label_L2
0,google play - lam phien,"[5, 12]",google play,Others,0.289139,Promotions,0.553118,5,12
1,google play - .. t . À mà họ nữ ưu m,"[5, 12]",google play,Others,0.547048,Cannot defined,0.769387,5,12
2,google play - Cc lùa dao,"[5, 12]",google play,Others,0.480877,Cannot defined,0.680652,5,12
3,google play - Mặt hàng sp mình đều nhỡ với Gia...,"[2, 23]",google play,Others,0.534984,Cannot defined,0.597015,2,23
4,google play - Chưa tối ưu tốt cho Android Oppo...,"[3, 5]",google play,Others,0.564101,Cannot defined,0.852026,3,5


In [None]:
df_val['label_L1']= df_val['label_L1'].apply(lambda x: tdm.label_lists[0][x]).values
df_val['label_L2']= df_val['label_L2'].apply(lambda x: tdm.label_lists[1][x]).values

In [None]:
f1_score(df_val.label_L1,df_val.pred_L1,average='macro'),f1_score(df_val.label_L2,df_val.pred_L2,average='macro')

(0.10407897411613716, 0.045350502179752714)

### Predict Test set

We will go through details on how to make a prediction on a completely new and raw dataset using our trained model. For now, let's reuse the sample csv and pretend it's our test set

In [None]:
df_test = TextDataMain.from_csv(Path('sample_data')/'sample_large.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 16 rows


We will remove all the labels and unnecessary columns

In [None]:
df_test = df_test.drop(['L1','L2'],axis=1)

In [None]:
df_test.head()

Unnamed: 0,Source,Content
0,Google Play,"App ncc lúc nào cx lag đơ, phần tìm kiếm thì v..."
1,Non Owned,..❗️ GÓC THANH LÝ Tính ra rẻ hơn cả mua #Shope...
2,Google Play,Mắc gì người ta đặt hàng toàn lỗi 😃????
3,Owned,#GhienShopeePayawardT8 Khi bạn chơi shopee quá...
4,Google Play,Rất bức xúc khi dùng . mã giảm giá người dùng ...


We will create a DatasetDict for this test dataframe

In [None]:
test_ddict = tdm.get_test_datasetdict_from_df(df_test)

-------------------- Getting Test Set --------------------
----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 19 rows
-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 2269/2269 [00:00<00:00, 3951.52it/s]


-------------------- Test Leak Checking --------------------
- Before leak check
Size: 2269
- After leak check
Size: 2080
- Number of rows leaked: 189, or 8.33% of the original validation (or test) data
-------------------- Construct DatasetDict --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
test_ddict

DatasetDict({
    test: Dataset({
        features: ['text', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2269
    })
})

Our test data has been processed + transformed (but not augmented) the same way as the validation set. Now we can start making the prediction

In [None]:
controller = ModelController(model,tdm)
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test')

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
df_result.head()

Unnamed: 0,text,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,Others,0.537956,Seller,0.182006
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,Others,0.592825,Cannot defined,0.822957
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,Others,0.435749,Dispute,0.628866
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,Commercial,0.906357,Shopee Programs,0.785214
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,Others,0.266986,Promotions,0.322534


We can even predict top k results

In [None]:
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test',topk=3)
df_result.head()

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2,pred_L1_top1,pred_L1_top2,pred_L1_top3,pred_prob_L1_top1,pred_prob_L1_top2,pred_prob_L1_top3,pred_L2_top1,pred_L2_top2,pred_L2_top3,pred_prob_L2_top1,pred_prob_L2_top2,pred_prob_L2_top3
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,"[5, 2, 1]","[0.5379561, 0.1072576, 0.10138138]","[51, 17, 35]","[0.18200605, 0.16479525, 0.14931145]",Others,Delivery,Commercial,0.537956,0.107258,0.101381,Seller,Contact Agent,Order cancelled,0.182006,0.164795,0.149311
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,"[5, 3, 2]","[0.59282476, 0.11958726, 0.093639985]","[12, 5, 15]","[0.82295734, 0.07322491, 0.020063285]",Others,Feature,Delivery,0.592825,0.119587,0.09364,Cannot defined,App performance,Chatbot,0.822957,0.073225,0.020063
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,"[5, 3, 2]","[0.43574947, 0.11776977, 0.11724175]","[25, 43, 35]","[0.62886584, 0.17175609, 0.044966247]",Others,Feature,Delivery,0.435749,0.11777,0.117242,Dispute,Promotions,Order cancelled,0.628866,0.171756,0.044966
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,"[1, 6, 5]","[0.9063568, 0.035481054, 0.014191813]","[59, 63, 45]","[0.78521365, 0.13409148, 0.024677942]",Commercial,Payment,Others,0.906357,0.035481,0.014192,Shopee Programs,ShopeePay,Return/Refund Others,0.785214,0.134091,0.024678
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,"[5, 1, 7]","[0.26698554, 0.24040055, 0.12578185]","[43, 35, 17]","[0.3225337, 0.21654503, 0.12851809]",Others,Commercial,Return/Refund,0.266986,0.240401,0.125782,Promotions,Order cancelled,Contact Agent,0.322534,0.216545,0.128518


If we just want to make a prediction on a small amount of data (single sentence, or a few sentences), we can use `ModelController.predict_raw_text`

In [None]:
# Since we have some metadatas, we need to define a dictionary (to imitate a DatasetDict)
raw_content={
    'Source': 'Google play',
    'iteration':21,
    'Content':'Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'
}

If we don't use metadata, we can use something like this: 

```raw_content='Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'```

In [None]:
df_result = controller.predict_raw_text(raw_content,topk=1)
df_result

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 4793.49it/s]


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2
0,google play - Tôi không thích Shopee . Tại vì ...,google play,Others,0.544602,Cannot defined,0.789728
