# Model Controller Tutorial: EnviBert model (Single Head)

> This notebook contains some example of how to use the EnviBert-based models in this NLP library

- skip_showdoc: true
- skip_exec: true

We will walk through other cases of classification: multi-head and multi-label. Since we will showcase the capabiilty of this label in these cases, there won't be as detailed as [this tutorial](https://anhquan0412.github.io/that-nlp-library/model_main_envibert.html)

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
#| hide
from nbdev.showdoc import *

## Load data

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main import *

In [None]:
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from importlib.machinery import SourceFileLoader
from transformers import DataCollatorWithPadding
import pandas as pd

Define some necessary text augmentations and text transformations

> For Text Transformation

In [None]:
txt_tfms=[text_normalize]

> For Text Augmentation

In [None]:
over_nonown_tfm = partial(sampling_with_condition,query='Source=="non owned"',frac=0.5,seed=42,apply_to_all=False)
over_nonown_tfm.__name__ = 'Oversampling Non Owned'

over_own_tfm = partial(sampling_with_condition,query='Source=="owned"',frac=2,seed=42,apply_to_all=False)
over_own_tfm.__name__ = 'Oversampling Owned'

over_hc_tfm = partial(sampling_with_condition,query='Source=="hc search"',frac=2.5,seed=42,apply_to_all=False)
over_hc_tfm.__name__ = 'Oversampling HC search'

remove_accent_tfm = partial(remove_vnmese_accent,frac=1,seed=42,apply_to_all=True)
remove_accent_tfm.__name__ = 'Add No-Accent Text'

aug_tfms = [over_nonown_tfm,over_own_tfm,over_hc_tfm,remove_accent_tfm]

Let's load and preprocess our data

In [None]:
DATA_PATH = Path('secret_data')

In [None]:
df = TextDataMain.from_csv(DATA_PATH/'buyer_listening_with_all_raw_data_w2223.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains missing values!
-----> List of columns and the number of missing values for each
is_valid    65804
dtype: int64
DataFrame contains duplicated values!
-----> Number of duplications: 7 rows


In [None]:
df.head(3)

Unnamed: 0,Week,Group,Source,Content,L1,L2,L3,L4,is_valid,iteration
0,1.0,Google Play,Google Play,Tại sao cứ hiện thông báo,Services,Shopee communication channels,Annoying pop-up ads,Non-tech,,1
1,1.0,Google Play,Google Play,Mlem,Others,Cannot defined,-,-,,1
2,1.0,Google Play,Google Play,1 số sản phẩm trong giỏ hàng vừa đc cập nhật t...,Feature,Cart & Order,Cart issues/suggestions,Tech,,1


Quick preprocess of data and train/validation split. Due to custom logic, we will sample our data here instead of using the `train_ratio` from the `to_datasetdict` function

In [None]:
df_rare = df[df.L2.isin(['Chatbot', 'Commercial Others'])].copy()

df_final = pd.concat([df.query('iteration==1').sample(500,random_state=42),
                      df.query('iteration>=7 & iteration<13').sample(1200,random_state=42),
                      df_rare,
                      df.query('iteration>=13'),
                     ],axis=0).reset_index(drop=True)

val_idxs = df_final[df_final.iteration>=13].index.values # from week 9

In [None]:
tdm = TextDataMain(df_final,
                    main_content='Content',
                    metadatas='Source',
                    label_names='L1',
                    val_ratio=val_idxs,
                    split_cols='L1',
                    content_tfms = txt_tfms,
                    aug_tfms = aug_tfms,
                    process_metadatas=True,
                    seed=42,
                    cols_to_keep=['Content','Source','iteration','L1'], 
                   # Note that the text column (e.g.`Content`) must be the first item in the `cols_to_keep`
                    shuffle_trn=True)

----- Input Validation Precheck -----
DataFrame contains missing values!
-----> List of columns and the number of missing values for each
is_valid    498
dtype: int64


Define our tokenizer for EnviBert

In [None]:
cache_dir=Path('./envibert_tokenizer')
tokenizer = SourceFileLoader("envibert.tokenizer", 
                             str(cache_dir/'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)

Create our DatasetDict from TextDataMain (as our `ModelController` class can also work with DatasetDict)

In [None]:
main_ddict= tdm.to_datasetdict(tokenizer,
                               max_length=512)

-------------------- Start Main Text Processing --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
----- Label Encoding -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 6649/6649 [00:01<00:00, 3640.53it/s]


-------------------- Train Test Split --------------------
Previous Validation Percentage: 74.101%
- Before leak check
Size: 4927
- After leak check
Size: 4885
- Number of rows leaked: 42, or 0.85% of the original validation (or test) data
Current Validation Percentage: 73.47%
-------------------- Text Augmentation --------------------
Train data size before augmentation: 1764
----- Oversampling Non Owned -----
Train data size after THIS augmentation: 2229
----- Oversampling Owned -----
Train data size after THIS augmentation: 2789
----- Oversampling HC search -----
Train data size after THIS augmentation: 2904
----- Add No-Accent Text -----


100%|████████████████████████████████████| 2904/2904 [00:00<00:00, 10205.97it/s]


Train data size after THIS augmentation: 5808
Train data size after ALL augmentation: 5808
-------------------- Map Tokenize Function --------------------


Map:   0%|          | 0/5808 [00:00<?, ? examples/s]

Map:   0%|          | 0/4885 [00:00<?, ? examples/s]

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'iteration', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5808
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'iteration', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4885
    })
})

# Model Experiment: EnviBert Multi-Head Classification (with Hidden Layer Concatenation)

In [None]:
from that_nlp_library.models.roberta.classifiers import *
from that_nlp_library.model_main import *
from sklearn.metrics import f1_score, accuracy_score
import os
from transformers.models.roberta.modeling_roberta import RobertaModel

comet_ml is installed but `COMET_API_KEY` is not set.


This will specify a (or a list) of GPUs for training

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

## Train EnviBert (with hidden layer concatenation), using TDM

Let's create our model controller

In [None]:
model_name='nguyenvulebinh/envibert'
envibert_body = RobertaModel.from_pretrained(model_name)
num_classes = len(tdm.label_lists[0])

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
_model_kwargs={
    # overall model hyperparams
    'layer2concat':4,
    'is_multilabel':tdm.is_multilabel, # False
    'is_multihead':tdm.is_multihead, # False
    'head_class_sizes':num_classes,
    'head_class': ConcatHeadSimple,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

model = model_init_classification(model_class = RobertaHiddenStateConcatForSequenceClassification,
                                  cpoint_path = model_name, 
                                  output_hidden_states=True, # since we are not using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=envibert_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(model,tdm,metric_funcs)

Loading body weights. This assumes the body is the very first first-layer block of your custom architecture


And we can start training our model

In [None]:
lr = 8.2e-5
bs=4
wd=0.01
epochs= 4

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_classification,
              )



Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1
1,No log,1.372778,0.344335,0.621085
2,0.789100,1.3751,0.450448,0.637871
3,0.789100,1.668144,0.501494,0.67042
4,0.136500,1.773287,0.503311,0.671238


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/my_model')

## Predict using trained model, using TDM

### Load trained model

In [None]:
_model_kwargs

{'layer2concat': 4,
 'is_multilabel': False,
 'is_multihead': False,
 'head_class_sizes': 10,
 'head_class': that_nlp_library.models.roberta.classifiers.ConcatHeadSimple,
 'classifier_dropout': 0.1}

In [None]:
trained_model = model_init_classification(model_class = RobertaHiddenStateConcatForSequenceClassification,
                                          cpoint_path = Path('./sample_weights/my_model'), 
                                          output_hidden_states=True,
                                          seed=42,
                                          model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(trained_model,tdm,metric_funcs)

Some weights of the model checkpoint at sample_weights/my_model were not used when initializing RobertaHiddenStateConcatForSequenceClassification: ['body_model.pooler.dense.weight', 'body_model.pooler.dense.bias']
- This IS expected if you are initializing RobertaHiddenStateConcatForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaHiddenStateConcatForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Predict Train/Validation set

Make prediction on all validation set

In [None]:
df_val = controller.predict_ddict(ds_type='validation',batch_size=8)

-------------------- Start making predictions --------------------


Map:   0%|          | 0/4885 [00:00<?, ? examples/s]

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,iteration,pred_L1,pred_prob_L1
0,google play - lam phien,5,google play,13,Feature,0.991197
1,google play - .. t . À mà họ nữ ưu m,5,google play,13,Others,0.850162
2,google play - Cc lùa dao,5,google play,13,Others,0.998358
3,google play - Mặt hàng sp mình đều nhỡ với Gia...,2,google play,13,Delivery,0.984496
4,google play - Chưa tối ưu tốt cho Android Oppo...,3,google play,13,Feature,0.983986


To convert the label index to string, we can use the ```label_lists``` attribute of tdm

In [None]:
df_val['label']= df_val['label'].apply(lambda x: tdm.label_lists[0][x]).values

In [None]:
f1_score(df_val.label,df_val.pred_L1,average='macro')

0.5034901410417664

### Predict Test set

We will go through details on how to make a prediction on a completely new and raw dataset using our trained model. For now, let's reuse the sample csv and pretend it's our test set

In [None]:
df_test = TextDataMain.from_csv(Path('sample_data')/'sample_large.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 16 rows


We added the required columns as we defined in training process, and remove all the labels

In [None]:
df_test = df_test.drop(['L1','L2'],axis=1)
df_test['iteration']=df_val.iteration.max()+1

In [None]:
df_test.head()

Unnamed: 0,Source,Content,iteration
0,Google Play,"App ncc lúc nào cx lag đơ, phần tìm kiếm thì v...",21
1,Non Owned,..❗️ GÓC THANH LÝ Tính ra rẻ hơn cả mua #Shope...,21
2,Google Play,Mắc gì người ta đặt hàng toàn lỗi 😃????,21
3,Owned,#GhienShopeePayawardT8 Khi bạn chơi shopee quá...,21
4,Google Play,Rất bức xúc khi dùng . mã giảm giá người dùng ...,21


We will create a DatasetDict for this test dataframe

In [None]:
test_ddict = tdm.get_test_datasetdict_from_df(df_test)

-------------------- Getting Test Set --------------------
----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 19 rows
-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 2269/2269 [00:00<00:00, 3120.31it/s]


-------------------- Test Leak Checking --------------------
- Before leak check
Size: 2269
- After leak check
Size: 2080
- Number of rows leaked: 189, or 8.33% of the original validation (or test) data
-------------------- Construct DatasetDict --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
test_ddict

DatasetDict({
    test: Dataset({
        features: ['text', 'Source', 'iteration', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2269
    })
})

Our test data has been processed + transformed (but not augmented) the same way as the validation set. Now we can start making the prediction

In [None]:
controller = ModelController(model,tdm)
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test')

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
df_result.head()

Unnamed: 0,text,Source,iteration,pred_L1,pred_prob_L1
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,21,Feature,0.997287
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,21,Others,0.999744
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,21,Feature,0.959076
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,21,Commercial,0.999123
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,21,Feature,0.997365


We can even predict top k results

In [None]:
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test',topk=3)
df_result.head()

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,iteration,pred_L1,pred_prob_L1,pred_L1_top1,pred_L1_top2,pred_L1_top3,pred_prob_L1_top1,pred_prob_L1_top2,pred_prob_L1_top3
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,21,"[3, 9, 5]","[0.99728715, 0.0009524937, 0.00058025133]",Feature,Shopee account,Others,0.997287,0.000952,0.00058
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,21,"[5, 1, 3]","[0.9997445, 8.3330306e-05, 4.366889e-05]",Others,Commercial,Feature,0.999744,8.3e-05,4.4e-05
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,21,"[3, 5, 9]","[0.95907587, 0.03484495, 0.0030704038]",Feature,Others,Shopee account,0.959076,0.034845,0.00307
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,21,"[1, 6, 3]","[0.99912256, 0.0005796181, 6.469468e-05]",Commercial,Payment,Feature,0.999123,0.00058,6.5e-05
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,21,"[3, 9, 5]","[0.9973652, 0.0006989358, 0.00043120742]",Feature,Shopee account,Others,0.997365,0.000699,0.000431


If we just want to make a prediction on a small amount of data (single sentence, or a few sentences), we can use `ModelController.predict_raw_text`

In [None]:
# Since we have some metadatas, we need to define a dictionary (to imitate a DatasetDict)
raw_content={
    'Source': 'Google play',
    'iteration':21,
    'Content':'Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'
}

If we don't use metadata, we can use something like this: 

```raw_content='Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'```

In [None]:
df_result = controller.predict_raw_text(raw_content,topk=1)
df_result

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 4981.36it/s]


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,iteration,pred_L1,pred_prob_L1
0,google play - Tôi không thích Shopee . Tại vì ...,google play,21,Feature,0.997374


In [None]:
raw_content={
    'Source': ['Google play','Owned'],
    'iteration':[21,21],
    'Content':['Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc',
               'App này xài được. Mua đồ rẻ ghê, được voucher nhiều nữa']
            }
df_result = controller.predict_raw_text(raw_content,topk=2)
df_result

100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 6887.20it/s]


Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,iteration,pred_L1,pred_prob_L1,pred_L1_top1,pred_L1_top2,pred_prob_L1_top1,pred_prob_L1_top2
0,google play - Tôi không thích Shopee . Tại vì ...,google play,21,"[3, 9]","[0.99737394, 0.00059575593]",Feature,Shopee account,0.997374,0.000596
1,"owned - App này xài được . Mua đồ rẻ ghê , đượ...",owned,21,"[1, 3]","[0.99925727, 0.00019265412]",Commercial,Feature,0.999257,0.000193
