# Model Controller Tutorial: EnviBert model (Multi Label)

> This notebook contains some example of how to use the EnviBert-based models in this NLP library

- skip_showdoc: true
- skip_exec: true

We will walk through other cases of classification: multi-head and multi-label. Since we will showcase the capabiilty of this label in these cases, there won't be as detailed as [this tutorial](https://anhquan0412.github.io/that-nlp-library/model_main_envibert.html)

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
#| hide
from nbdev.showdoc import *

## Load data

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main import *

In [None]:
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from importlib.machinery import SourceFileLoader
import os
from transformers import DataCollatorWithPadding
import pandas as pd

Define some necessary text augmentations and text transformations

> For Text Transformation

In [None]:
txt_tfms=[text_normalize]

> For Text Augmentation

In [None]:
over_nonown_tfm = partial(sampling_with_condition,query='Source=="non owned"',frac=0.5,seed=42,apply_to_all=False)
over_nonown_tfm.__name__ = 'Oversampling Non Owned'

over_own_tfm = partial(sampling_with_condition,query='Source=="owned"',frac=2,seed=42,apply_to_all=False)
over_own_tfm.__name__ = 'Oversampling Owned'

over_hc_tfm = partial(sampling_with_condition,query='Source=="hc search"',frac=2.5,seed=42,apply_to_all=False)
over_hc_tfm.__name__ = 'Oversampling HC search'

remove_accent_tfm = partial(remove_vnmese_accent,frac=1,seed=42,apply_to_all=True)
remove_accent_tfm.__name__ = 'Add No-Accent Text'

aug_tfms = [over_nonown_tfm,over_own_tfm,over_hc_tfm,remove_accent_tfm]

Create a TextDataMain object

In [None]:
DATA_PATH = Path('secret_data')

df = TextDataMain.from_csv(DATA_PATH/'buyer_listening_with_all_raw_data_w2223.csv',return_df=True)

#Quick preprocess of data and train/validation split. 
#Due to custom logic, we will sample our data here instead of using the `train_ratio` from the `to_datasetdict` function

df_rare = df[df.L2.isin(['Chatbot', 'Commercial Others'])].copy()

df_final = pd.concat([df.query('iteration==1').sample(500,random_state=42),
                      df.query('iteration>=7 & iteration<13').sample(1200,random_state=42),
                      df_rare,
                      df.query('iteration>=13'),
                     ],axis=0).reset_index(drop=True)

val_idxs = df_final[df_final.iteration>=13].index.values # from week 9

----- Input Validation Precheck -----
DataFrame contains missing values!
-----> List of columns and the number of missing values for each
is_valid    65804
dtype: int64
DataFrame contains duplicated values!
-----> Number of duplications: 7 rows


Since the problem is not multi-label, we will make it into one by concatenate L1 label and L2 label together

In [None]:
df_final['L1L2'] = df_final[['L1','L2']].values.tolist()

In [None]:
df_final.head(5)

Unnamed: 0,Week,Group,Source,Content,L1,L2,L3,L4,is_valid,iteration,L1L2
0,46.0,Google Play,Google Play,Kog zô dx xóa tải lại củg kog zô dx luôn vãi c...,Feature,App performance,Lag when browsing in general,Tech,,1,"[Feature, App performance]"
1,32.0,Google Play,Google Play,Tôi mêt moi voi cái ung dung nay quá lam sao x...,Others,Cannot defined,-,-,,1,"[Others, Cannot defined]"
2,10.0,Google Play,Google Play,Lag vc fix đi shoper,Feature,App performance,Lag when browsing in general,Tech,,1,"[Feature, App performance]"
3,20.0,Google Play,Google Play,Gỡ,Others,Cannot defined,-,-,,1,"[Others, Cannot defined]"
4,31.0,Google Play,Google Play,"Ko mở đc chán nản, báo update, update ko dc, v...",Feature,App performance,App installment problem,Tech,,1,"[Feature, App performance]"


In [None]:
tdm = TextDataMain(df_final,
                    main_content='Content',
                    metadatas='Source',
                    label_names='L1L2',
                    val_ratio=val_idxs,
                    split_cols='L2',
                    content_tfms = txt_tfms,
                    aug_tfms = aug_tfms,
                    process_metadatas=True,
                    seed=42,
                    shuffle_trn=True)

----- Input Validation Precheck -----
DataFrame contains missing values!
-----> List of columns and the number of missing values for each
is_valid    498
dtype: int64


Define our tokenizer for EnviBert

In [None]:
cache_dir=Path('./envibert_tokenizer')
tokenizer = SourceFileLoader("envibert.tokenizer", 
                             str(cache_dir/'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)

Create our DatasetDict from TextDataMain (as our `ModelController` class can also work with DatasetDict)

In [None]:
main_ddict= tdm.to_datasetdict(tokenizer,
                               max_length=512)

-------------------- Start Main Text Processing --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
----- Label Encoding -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 6649/6649 [00:01<00:00, 3621.52it/s]


-------------------- Train Test Split --------------------
Previous Validation Percentage: 74.101%
- Before leak check
Size: 4927
- After leak check
Size: 4885
- Number of rows leaked: 42, or 0.85% of the original validation (or test) data
Current Validation Percentage: 73.47%
-------------------- Text Augmentation --------------------
Train data size before augmentation: 1764
----- Oversampling Non Owned -----
Train data size after THIS augmentation: 2229
----- Oversampling Owned -----
Train data size after THIS augmentation: 2789
----- Oversampling HC search -----
Train data size after THIS augmentation: 2904
----- Add No-Accent Text -----


100%|█████████████████████████████████████| 2904/2904 [00:00<00:00, 9960.57it/s]


Train data size after THIS augmentation: 5808
Train data size after ALL augmentation: 5808
-------------------- Map Tokenize Function --------------------


Map:   0%|          | 0/5808 [00:00<?, ? examples/s]

Map:   0%|          | 0/4885 [00:00<?, ? examples/s]

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5808
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4885
    })
})

In [None]:
print(main_ddict['validation']['label'][:2])

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]


# Model Experiment: EnviBert Multi-Head Classification (with Hidden Layer Concatenation)

In [None]:
from that_nlp_library.models.roberta.classifiers import *
from that_nlp_library.model_main import *
from sklearn.metrics import f1_score, accuracy_score
import os
from transformers.models.roberta.modeling_roberta import RobertaModel

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

## Train EnviBert (with hidden layer concatenation), using TDM

In [None]:
model_name='nguyenvulebinh/envibert'
envibert_body = RobertaModel.from_pretrained(model_name)
num_classes = len(tdm.label_lists[0])

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's create our model controller

In [None]:
_model_kwargs={
    # overall model hyperparams
    'layer2concat':4,
    'is_multilabel':tdm.is_multilabel, # False
    'is_multihead':tdm.is_multihead, # False
    'head_class_sizes':num_classes,
    'head_class': ConcatHeadSimple,
    # classfication head hyperparams
    'classifier_dropout':0.1 
}

model = model_init_classification(model_class = RobertaHiddenStateConcatForSequenceClassification,
                                  cpoint_path = model_name, 
                                  output_hidden_states=True, # since we are not using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=envibert_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score] # we will use both f1_macro and accuracy score as metrics
controller = ModelController(model,tdm,metric_funcs)

Loading body weights. This assumes the body is the very first first-layer block of your custom architecture


And we can start training our model. Pay attention to the compute_metrics function

In [None]:
lr = 6e-5
bs=4
wd=0.01
epochs= 4

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=partial(compute_metrics_classification,
                                       is_multilabel=tdm.is_multilabel,
                                       multilabel_threshold=0.5)
              )
# Epoch	Training Loss	Validation Loss	F1 Score L1l2	Accuracy Score L1l2
# 1	No log	0.082577	0.028576	0.111771
# 2	0.090200	0.076037	0.057607	0.223337
# 3	0.090200	0.072825	0.088066	0.245241
# 4	0.023500	0.073391	0.095394	0.253019



Epoch,Training Loss,Validation Loss,F1 Score L1l2,Accuracy Score L1l2
1,No log,0.085917,0.021178,0.111771
2,0.098200,0.078571,0.053714,0.21085
3,0.098200,0.074122,0.078145,0.231116
4,0.027900,0.073689,0.079069,0.237666


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/my_model')

## Predict using trained model, using TDM

### Load trained model

In [None]:
_model_kwargs

{'layer2concat': 4,
 'is_multilabel': True,
 'is_multihead': False,
 'head_class_sizes': 76,
 'head_class': that_nlp_library.models.roberta.classifiers.ConcatHeadSimple,
 'classifier_dropout': 0.1}

In [None]:
trained_model = model_init_classification(model_class = RobertaHiddenStateConcatForSequenceClassification,
                                          cpoint_path = Path('./sample_weights/my_model'), 
                                          output_hidden_states=True,
                                          seed=42,
                                          model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(trained_model,tdm,metric_funcs)

Some weights of the model checkpoint at sample_weights/my_model were not used when initializing RobertaHiddenStateConcatForSequenceClassification: ['body_model.pooler.dense.weight', 'body_model.pooler.dense.bias']
- This IS expected if you are initializing RobertaHiddenStateConcatForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaHiddenStateConcatForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Predict Train/Validation set

Make prediction on all validation set. Note that you can set the probability threshold.

In [None]:
df_val = controller.predict_ddict(ds_type='validation',multilabel_threshold=0.5,batch_size=8)

-------------------- Start making predictions --------------------


Map:   0%|          | 0/4885 [00:00<?, ? examples/s]

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1L2,pred_prob_L1L2,pred_L1L2_string
0,google play - lam phien,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",google play,"[False, False, False, False, False, False, Fal...","[0.00043046146, 0.00057243364, 0.0015474476, 0...","Cannot defined,Others"
1,google play - .. t . À mà họ nữ ưu m,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",google play,"[False, False, False, False, False, False, Fal...","[0.00048226005, 0.0006290678, 0.001644549, 0.0...","Cannot defined,Others"
2,google play - Cc lùa dao,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",google play,"[False, False, False, False, False, False, Fal...","[0.0004853605, 0.00060451776, 0.0016006823, 0....","Cannot defined,Others"
3,google play - Mặt hàng sp mình đều nhỡ với Gia...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",google play,"[False, False, False, False, False, False, Fal...","[0.004979573, 0.007575649, 0.0147191025, 0.001...",Delivery
4,google play - Chưa tối ưu tốt cho Android Oppo...,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",google play,"[False, False, False, False, False, False, Fal...","[0.00052510685, 0.0007405444, 0.0019072617, 0....","Cannot defined,Others"


To convert the label index to string, we can use the ```label_lists``` attribute of tdm

In [None]:
import pandas as pd
import numpy as np

In [None]:
def get_label_str_multilabel(row):
    indices=np.where(row==1)[0]
    return ','.join([tdm.label_lists[0][i] for i in indices])
        

In [None]:
df_val['label_str']=df_val['label'].apply(get_label_str_multilabel)

In [None]:
df_val[['label_str','pred_L1L2_string']]

Unnamed: 0,label_str,pred_L1L2_string
0,"Cannot defined,Others","Cannot defined,Others"
1,"Cannot defined,Others","Cannot defined,Others"
2,"Cannot defined,Others","Cannot defined,Others"
3,"Delivery,Delivery time",Delivery
4,"App performance,Feature","Cannot defined,Others"
...,...,...
4880,"Commercial,Flash Sale/Campaigns","Cannot defined,Others"
4881,"Delivery,Driver",
4882,"Other payment methods,Payment","Cannot defined,Others"
4883,"Commercial,Flash Sale/Campaigns","Cannot defined,Others"


### Predict Test set

We will go through details on how to make a prediction on a completely new and raw dataset using our trained model. For now, let's reuse the sample csv and pretend it's our test set

In [None]:
df_test = TextDataMain.from_csv(Path('sample_data')/'sample_large.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 16 rows


We will remove all the labels and unnecessary columns

In [None]:
df_test = df_test.drop(['L1','L2'],axis=1)

In [None]:
df_test.head()

Unnamed: 0,Source,Content
0,Google Play,"App ncc lúc nào cx lag đơ, phần tìm kiếm thì v..."
1,Non Owned,..❗️ GÓC THANH LÝ Tính ra rẻ hơn cả mua #Shope...
2,Google Play,Mắc gì người ta đặt hàng toàn lỗi 😃????
3,Owned,#GhienShopeePayawardT8 Khi bạn chơi shopee quá...
4,Google Play,Rất bức xúc khi dùng . mã giảm giá người dùng ...


We will create a DatasetDict for this test dataframe

In [None]:
test_ddict = tdm.get_test_datasetdict_from_df(df_test)

-------------------- Getting Test Set --------------------
----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 19 rows
-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|█████████████████████████████████████| 2269/2269 [00:00<00:00, 3972.37it/s]


-------------------- Test Leak Checking --------------------
- Before leak check
Size: 2269
- After leak check
Size: 2080
- Number of rows leaked: 189, or 8.33% of the original validation (or test) data
-------------------- Construct DatasetDict --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
test_ddict

DatasetDict({
    test: Dataset({
        features: ['text', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2269
    })
})

Our test data has been processed + transformed (but not augmented) the same way as the validation set. Now we can start making the prediction

In [None]:
# controller = ModelController(model,tdm)
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test',multilabel_threshold=0.5)

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
df_result.head()

Unnamed: 0,text,Source,pred_L1L2,pred_prob_L1L2,pred_L1L2_string
0,"google play - App ncc lúc nào cx lag đơ , phần...",google play,"[False, False, False, False, False, True, Fals...","[0.0025301736, 0.0055451845, 0.044778068, 0.00...","App performance,Feature"
1,non owned - .. ❗ ️ GÓC THANH LÝ Tính ra rẻ hơn...,non owned,"[False, False, False, False, False, False, Fal...","[0.00043563428, 0.00053482514, 0.0008672758, 0...","Cannot defined,Others"
2,google play - Mắc gì người ta đặt hàng toàn lỗ...,google play,"[False, False, False, False, False, False, Fal...","[0.0018123146, 0.0027692849, 0.0063686953, 0.0...",
3,owned - # GhienShopeePayawardT8 Khi bạn chơi s...,owned,"[False, False, False, False, False, False, Fal...","[0.0011939979, 0.0015298241, 0.0022472697, 0.0...","Commercial,Shopee Programs"
4,google play - Rất bức xúc khi dùng . mã giảm g...,google play,"[False, False, False, False, False, False, Fal...","[0.002711329, 0.006989172, 0.05613184, 0.00163...",Feature


If we just want to make a prediction on a small amount of data (single sentence, or a few sentences), we can use `ModelController.predict_raw_text`

In [None]:
# Since we have some metadatas, we need to define a dictionary (to imitate a DatasetDict)
raw_content={
    'Source': 'Google play',
    'Content':'Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'
}

If we don't use metadata, we can use something like this: 

```raw_content='Tôi không thích Shopee.Tại vì dùng app rất chậm,lag banh nhà lầu, thậm chí log in còn không đc'```

In [None]:
df_result = controller.predict_raw_text(raw_content,multilabel_threshold=0.5)
df_result

100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 4951.95it/s]


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1L2,pred_prob_L1L2,pred_L1L2_string
0,google play - Tôi không thích Shopee . Tại vì ...,google play,"[False, False, False, False, False, True, Fals...","[0.0017681926, 0.003631191, 0.018489178, 0.001...","App performance,Feature"
