# Model Controller Tutorial: EnviBert model with Deep Hierarchical Classification

> This notebook contains some example of how to use the EnviBert-based models in this NLP library

- skip_showdoc: true
- skip_exec: true

We will walk through other cases of classification: multi-head and multi-label. Since we will showcase the capabiilty of this label in these cases, there won't be as detailed as [this tutorial](https://anhquan0412.github.io/that-nlp-library/model_main_envibert.html)

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
#| hide
from nbdev.showdoc import *

## Load data

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main import *

In [None]:
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from importlib.machinery import SourceFileLoader
from transformers import DataCollatorWithPadding
import pandas as pd
import numpy as np

Define some necessary text augmentations and text transformations

> For Text Transformation

In [None]:
txt_tfms=[text_normalize]

> For Text Augmentation

In [None]:
over_nonown_tfm = partial(sampling_with_condition,query='Source=="non owned"',frac=0.5,seed=42,apply_to_all=False)
over_nonown_tfm.__name__ = 'Oversampling Non Owned'

over_own_tfm = partial(sampling_with_condition,query='Source=="owned"',frac=2,seed=42,apply_to_all=False)
over_own_tfm.__name__ = 'Oversampling Owned'

over_hc_tfm = partial(sampling_with_condition,query='Source=="hc search"',frac=2.5,seed=42,apply_to_all=False)
over_hc_tfm.__name__ = 'Oversampling HC search'

remove_accent_tfm = partial(remove_vnmese_accent,frac=1,seed=42,apply_to_all=True)
remove_accent_tfm.__name__ = 'Add No-Accent Text'

aug_tfms = [over_nonown_tfm,over_own_tfm,over_hc_tfm,remove_accent_tfm]

Let's load and preprocess our data

In [None]:
DATA_PATH = Path('secret_data')

In [None]:
df = TextDataMain.from_csv(DATA_PATH/'buyer_listening_with_all_raw_data_w2223.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains missing values!
-----> List of columns and the number of missing values for each
is_valid    65804
dtype: int64
DataFrame contains duplicated values!
-----> Number of duplications: 7 rows


In [None]:
df.head(3)

Unnamed: 0,Week,Group,Source,Content,L1,L2,L3,L4,is_valid,iteration
0,1.0,Google Play,Google Play,T·∫°i sao c·ª© hi·ªán th√¥ng b√°o,Services,Shopee communication channels,Annoying pop-up ads,Non-tech,,1
1,1.0,Google Play,Google Play,Mlem,Others,Cannot defined,-,-,,1
2,1.0,Google Play,Google Play,1 s·ªë s·∫£n ph·∫©m trong gi·ªè h√†ng v·ª´a ƒëc c·∫≠p nh·∫≠t t...,Feature,Cart & Order,Cart issues/suggestions,Tech,,1


Quick preprocess of data and train/validation split. Due to custom logic, we will sample our data here instead of using the `train_ratio` from the `to_datasetdict` function

In [None]:
df_rare = df[df.L2.isin(['Chatbot', 'Commercial Others'])].copy()

df_final = pd.concat([df.query('iteration==1').sample(500,random_state=42),
                      df.query('iteration>=7 & iteration<13').sample(1200,random_state=42),
                      df_rare,
                      df.query('iteration>=13'),
                     ],axis=0).reset_index(drop=True)

val_idxs = df_final[df_final.iteration>=13].index.values # from week 9

In [None]:
tdm = TextDataMain(df_final,
                    main_content='Content',
                    metadatas='Source',
                    label_names=['L1','L2'],
                    val_ratio=val_idxs,
                    split_cols='L2',
                    content_tfms = txt_tfms,
                    aug_tfms = aug_tfms,
                    process_metadatas=True,
                    seed=42,
                    shuffle_trn=True)

----- Input Validation Precheck -----
DataFrame contains missing values!
-----> List of columns and the number of missing values for each
is_valid    498
dtype: int64


Define our tokenizer for EnviBert

In [None]:
cache_dir=Path('./envibert_tokenizer')
tokenizer = SourceFileLoader("envibert.tokenizer", 
                             str(cache_dir/'envibert_tokenizer.py')).load_module().RobertaTokenizer(cache_dir)

Create our DatasetDict from TextDataMain (as our `ModelController` class can also work with DatasetDict)

In [None]:
main_ddict= tdm.to_datasetdict(tokenizer,
                               max_length=512,
                               )

-------------------- Start Main Text Processing --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
----- Label Encoding -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6649/6649 [00:01<00:00, 3560.77it/s]


-------------------- Train Test Split --------------------
Previous Validation Percentage: 74.101%
- Before leak check
Size: 4927
- After leak check
Size: 4885
- Number of rows leaked: 42, or 0.85% of the original validation (or test) data
Current Validation Percentage: 73.47%
-------------------- Text Augmentation --------------------
Train data size before augmentation: 1764
----- Oversampling Non Owned -----
Train data size after THIS augmentation: 2229
----- Oversampling Owned -----
Train data size after THIS augmentation: 2789
----- Oversampling HC search -----
Train data size after THIS augmentation: 2904
----- Add No-Accent Text -----


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2904/2904 [00:00<00:00, 10469.09it/s]


Train data size after THIS augmentation: 5808
Train data size after ALL augmentation: 5808
-------------------- Map Tokenize Function --------------------


Map:   0%|          | 0/5808 [00:00<?, ? examples/s]

Map:   0%|          | 0/4885 [00:00<?, ? examples/s]

In [None]:
main_ddict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5808
    })
    validation: Dataset({
        features: ['text', 'label', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4885
    })
})

In [None]:
main_ddict['validation']['label'][:5]

[[5, 12], [5, 12], [5, 12], [2, 23], [3, 5]]

# Model Experiment: EnviBert Multi-Head Classification (with Hidden Layer Concatenation)

In [None]:
from that_nlp_library.models.roberta.deep_hierarchical_classifiers import *
from that_nlp_library.model_main import *
from sklearn.metrics import f1_score, accuracy_score
import os
from transformers.models.roberta.modeling_roberta import RobertaModel
import torch

This will specify a (or a list) of GPUs for training

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

## Build DHC Conditional Mask

In [None]:
df_labels = tdm.df[tdm.label_names]

In [None]:
df_labels.head()

Unnamed: 0,L1,L2
0,1,59
1,5,12
2,1,59
3,1,43
4,5,12


In [None]:
dhc_mask = build_DHC_conditional_mask(df_labels,*tdm.label_names)

In [None]:
dhc_mask.shape

torch.Size([10, 66])

Explain the first row of the mask (for level 1 label "Buyer Complained Seller")

In [None]:
dhc_mask[0]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
        1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Slicing the first portion for level 2, show string for True mask

In [None]:
for i in torch.where(dhc_mask[0]==1)[0]:
    print(tdm.label_lists[1][i])

Customer service (didn't respond/impolite)
Illegal/counterfeit products
Product description
Product quality
Sellers cancelled order without any advanced notice/reason
Sellers cheated Buyers (Sellers tried to reach me outside of Shopee App)
Sellers packed fake orders


## Train EnviBert (with hidden layer concatenation), using TDM

In [None]:
model_name='nguyenvulebinh/envibert'
envibert_body = RobertaModel.from_pretrained(model_name)

Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
_model_kwargs={
    'dhc_mask':dhc_mask,
    'classifier_dropout':0.1,
    'last_hidden_size':768,  
    'linear_l1_size':389,
    'linear_l2_size':417,
    'lloss_weight':1.0,
    'dloss_weight':0.8,
    'layer2concat':4,
}

model = model_init_classification(model_class = RobertaHSCDHCSequenceClassification,
                                  cpoint_path = model_name, 
                                  output_hidden_states=True, # since we are not using 'hidden layer contatenation' technique
                                  seed=42,
                                  body_model=envibert_body,
                                  model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(model,tdm,metric_funcs)

Loading body weights. This assumes the body is the very first first-layer block of your custom architecture


And we can start training our model

In [None]:
lr = 9e-5
bs=4
wd=0.01
epochs= 4

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_separate_singleheads,
              )



Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1,F1 Score L2,Accuracy Score L2
1,No log,9.726516,0.240784,0.53654,0.032214,0.232958
2,7.853000,9.363424,0.409587,0.610645,0.067087,0.328147
3,7.853000,9.608892,0.463691,0.668987,0.092133,0.390993
4,6.093800,9.744572,0.47358,0.663664,0.099278,0.415353


In [None]:
controller.trainer.model.save_pretrained('./sample_weights/my_model')

## Predict using trained model, using TDM

### Load trained model

In [None]:
_model_kwargs

{'dhc_mask': tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
          1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0.,

In [None]:
trained_model = model_init_classification(model_class = RobertaHSCDHCSequenceClassification,
                                          cpoint_path = Path('./sample_weights/my_model'), 
                                          output_hidden_states=True,
                                          seed=42,
                                          model_kwargs = _model_kwargs)

metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(trained_model,tdm,metric_funcs)

Some weights of the model checkpoint at sample_weights/my_model were not used when initializing RobertaHSCDHCSequenceClassification: ['body_model.pooler.dense.weight', 'body_model.pooler.dense.bias']
- This IS expected if you are initializing RobertaHSCDHCSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaHSCDHCSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Predict Train/Validation set

Make prediction on all validation set

In [None]:
df_val = controller.predict_ddict(ds_type='validation',is_dhc=True,batch_size=8)

-------------------- Start making predictions --------------------


Map:   0%|          | 0/4885 [00:00<?, ? examples/s]

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2
0,google play - lam phien,"[5, 12]",google play,Feature,0.880263,App performance,0.696242
1,google play - .. t . √Ä m√† h·ªç n·ªØ ∆∞u m,"[5, 12]",google play,Feature,0.781973,App performance,0.684697
2,google play - Cc l√πa dao,"[5, 12]",google play,Others,0.986321,Cannot defined,0.979061
3,google play - M·∫∑t h√†ng sp m√¨nh ƒë·ªÅu nh·ª° v·ªõi Gia...,"[2, 23]",google play,Delivery,0.910066,Delivery time,0.616661
4,google play - Ch∆∞a t·ªëi ∆∞u t·ªët cho Android Oppo...,"[3, 5]",google play,Others,0.5883,Cannot defined,0.617693


To convert the label index to string, we can use the ```label_lists``` attribute of tdm

In [None]:
import pandas as pd

In [None]:
df_val[['label_L1','label_L2']] = pd.DataFrame(df_val.label.tolist(), index= df_val.index)

In [None]:
df_val.head()

Unnamed: 0,text,label,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2,label_L1,label_L2
0,google play - lam phien,"[5, 12]",google play,Feature,0.880263,App performance,0.696242,5,12
1,google play - .. t . √Ä m√† h·ªç n·ªØ ∆∞u m,"[5, 12]",google play,Feature,0.781973,App performance,0.684697,5,12
2,google play - Cc l√πa dao,"[5, 12]",google play,Others,0.986321,Cannot defined,0.979061,5,12
3,google play - M·∫∑t h√†ng sp m√¨nh ƒë·ªÅu nh·ª° v·ªõi Gia...,"[2, 23]",google play,Delivery,0.910066,Delivery time,0.616661,2,23
4,google play - Ch∆∞a t·ªëi ∆∞u t·ªët cho Android Oppo...,"[3, 5]",google play,Others,0.5883,Cannot defined,0.617693,3,5


In [None]:
df_val['label_L1']= df_val['label_L1'].apply(lambda x: tdm.label_lists[0][x]).values
df_val['label_L2']= df_val['label_L2'].apply(lambda x: tdm.label_lists[1][x]).values

In [None]:
f1_score(df_val.label_L1,df_val.pred_L1,average='macro'),f1_score(df_val.label_L2,df_val.pred_L2,average='macro')

(0.47357980728239585, 0.09925006584798883)

### Predict Test set

We will go through details on how to make a prediction on a completely new and raw dataset using our trained model. For now, let's reuse the sample csv and pretend it's our test set

In [None]:
df_test = TextDataMain.from_csv(Path('sample_data')/'sample_large.csv',return_df=True)

----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 16 rows


We added the required columns as we defined in training process, and remove all the labels

In [None]:
df_test = df_test.drop(['L1','L2'],axis=1)

In [None]:
df_test.head()

Unnamed: 0,Source,Content
0,Google Play,"App ncc l√∫c n√†o cx lag ƒë∆°, ph·∫ßn t√¨m ki·∫øm th√¨ v..."
1,Non Owned,..‚ùóÔ∏è G√ìC THANH L√ù T√≠nh ra r·∫ª h∆°n c·∫£ mua #Shope...
2,Google Play,M·∫Øc g√¨ ng∆∞·ªùi ta ƒë·∫∑t h√†ng to√†n l·ªói üòÉ????
3,Owned,#GhienShopeePayawardT8 Khi b·∫°n ch∆°i shopee qu√°...
4,Google Play,R·∫•t b·ª©c x√∫c khi d√πng . m√£ gi·∫£m gi√° ng∆∞·ªùi d√πng ...


We will create a DatasetDict for this test dataframe

In [None]:
test_ddict = tdm.get_test_datasetdict_from_df(df_test)

-------------------- Getting Test Set --------------------
----- Input Validation Precheck -----
DataFrame contains duplicated values!
-----> Number of duplications: 19 rows
-------------------- Start Test Set Transformation --------------------
----- Metadata Simple Processing & Concatenating to Main Content -----
-------------------- Text Transformation --------------------
----- text_normalize -----


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2269/2269 [00:00<00:00, 3940.94it/s]


-------------------- Test Leak Checking --------------------
- Before leak check
Size: 2269
- After leak check
Size: 2080
- Number of rows leaked: 189, or 8.33% of the original validation (or test) data
-------------------- Construct DatasetDict --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
test_ddict

DatasetDict({
    test: Dataset({
        features: ['text', 'Source', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2269
    })
})

Our test data has been processed + transformed (but not augmented) the same way as the validation set. Now we can start making the prediction

In [None]:
controller = ModelController(model,tdm)
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test',is_dhc=True)

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

In [None]:
df_result.head()

Unnamed: 0,text,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2
0,"google play - App ncc l√∫c n√†o cx lag ƒë∆° , ph·∫ßn...",google play,Feature,0.903318,App performance,0.67487
1,non owned - .. ‚ùó Ô∏è G√ìC THANH L√ù T√≠nh ra r·∫ª h∆°n...,non owned,Others,0.998537,Cannot defined,0.995626
2,google play - M·∫Øc g√¨ ng∆∞·ªùi ta ƒë·∫∑t h√†ng to√†n l·ªó...,google play,Feature,0.715968,App performance,0.529486
3,owned - # GhienShopeePayawardT8 Khi b·∫°n ch∆°i s...,owned,Commercial,0.989516,Shopee Programs,0.993086
4,google play - R·∫•t b·ª©c x√∫c khi d√πng . m√£ gi·∫£m g...,google play,Feature,0.840763,Apply Voucher,0.310494


We can even predict top k results

In [None]:
df_result = controller.predict_ddict(ddict=test_ddict,ds_type='test',topk=3,is_dhc=True)
df_result.head()

-------------------- Start making predictions --------------------


Map:   0%|          | 0/2269 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2,pred_L1_top1,pred_L1_top2,pred_L1_top3,pred_prob_L1_top1,pred_prob_L1_top2,pred_prob_L1_top3,pred_L2_top1,pred_L2_top2,pred_L2_top3,pred_prob_L2_top1,pred_prob_L2_top2,pred_prob_L2_top3
0,"google play - App ncc l√∫c n√†o cx lag ƒë∆° , ph·∫ßn...",google play,"[3, 9, 4]","[0.90331787, 0.070163645, 0.013683925]","[5, 6, 64]","[0.6748701, 0.12654907, 0.05017495]",Feature,Shopee account,Order/Item,0.903318,0.070164,0.013684,App performance,Apply Voucher,Sign up/Log in,0.67487,0.126549,0.050175
1,non owned - .. ‚ùó Ô∏è G√ìC THANH L√ù T√≠nh ra r·∫ª h∆°n...,non owned,"[5, 3, 2]","[0.9985366, 0.0007172561, 0.00040264206]","[12, 9, 5]","[0.99562645, 0.0017475911, 0.0010105681]",Others,Feature,Delivery,0.998537,0.000717,0.000403,Cannot defined,Branding,App performance,0.995626,0.001748,0.001011
2,google play - M·∫Øc g√¨ ng∆∞·ªùi ta ƒë·∫∑t h√†ng to√†n l·ªó...,google play,"[3, 5, 4]","[0.7159677, 0.2576615, 0.010411925]","[5, 12, 6]","[0.52948564, 0.31033903, 0.07395215]",Feature,Others,Order/Item,0.715968,0.257661,0.010412,App performance,Cannot defined,Apply Voucher,0.529486,0.310339,0.073952
3,owned - # GhienShopeePayawardT8 Khi b·∫°n ch∆°i s...,owned,"[1, 6, 0]","[0.9895163, 0.009675543, 0.00036532234]","[59, 63, 29]","[0.993086, 0.003024094, 0.0012311569]",Commercial,Payment,Buyer complained seller,0.989516,0.009676,0.000365,Shopee Programs,ShopeePay,Games/Minigames,0.993086,0.003024,0.001231
4,google play - R·∫•t b·ª©c x√∫c khi d√πng . m√£ gi·∫£m g...,google play,"[3, 9, 4]","[0.8407633, 0.102363825, 0.034727756]","[6, 5, 64]","[0.3104941, 0.3058795, 0.09680278]",Feature,Shopee account,Order/Item,0.840763,0.102364,0.034728,Apply Voucher,App performance,Sign up/Log in,0.310494,0.30588,0.096803


If we just want to make a prediction on a small amount of data (single sentence, or a few sentences), we can use `ModelController.predict_raw_text`

In [None]:
# Since we have some metadatas, we need to define a dictionary (to imitate a DatasetDict)
raw_content={
    'Source': 'Google play',
    'Content':'T√¥i kh√¥ng th√≠ch Shopee.T·∫°i v√¨ d√πng app r·∫•t ch·∫≠m,lag banh nh√† l·∫ßu, th·∫≠m ch√≠ log in c√≤n kh√¥ng ƒëc'
}

If we don't use metadata, we can use something like this: 

```raw_content='T√¥i kh√¥ng th√≠ch Shopee.T·∫°i v√¨ d√πng app r·∫•t ch·∫≠m,lag banh nh√† l·∫ßu, th·∫≠m ch√≠ log in c√≤n kh√¥ng ƒëc'```

In [None]:
df_result = controller.predict_raw_text(raw_content,topk=1,is_dhc=True)
df_result

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 4777.11it/s]


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Unnamed: 0,text,Source,pred_L1,pred_prob_L1,pred_L2,pred_prob_L2
0,google play - T√¥i kh√¥ng th√≠ch Shopee . T·∫°i v√¨ ...,google play,Feature,0.975475,App performance,0.919566


# Try different model architecture (skip)

In [None]:
# from that_nlp_library.models.classifiers import *
from that_nlp_library.models.deep_hierarchical_classifiers import *
from that_nlp_library.model_main import *
from that_nlp_library.models.classifiers import *

In [None]:
from sklearn.metrics import f1_score, accuracy_score
import os
import torch

In [None]:
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [None]:
df_labels = tdm.df[tdm.label_names]

dhc_mask = build_DHC_conditional_mask(df_labels,*tdm.label_names)

In [None]:
model_name='nguyenvulebinh/envibert'

_model_kwargs={
    'dhc_mask':dhc_mask,
    'classifier_dropout':0.1,
    'last_hidden_size':768,  
    'linear_l1_size':389,
    'linear_l2_size':417,
    'lloss_weight':1.0,
    'dloss_weight':0.8,
    
}

model = model_init_classification(model_class = RobertaHSCDHCSequenceClassification,
                                  cpoint_path = 'nguyenvulebinh/envibert', 
                                  output_hidden_states=True, # since we are using 'hidden layer contatenation'
                                  seed=42,
                                  model_kwargs = _model_kwargs)
metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(model,tdm,metric_funcs)

lr = 7e-5
bs=8
wd=0.01
epochs= 2

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_separate_singleheads,
               head_sizes=[len(tdm.label_lists[0]),len(tdm.label_lists[1])]
              )


Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaHSCDHCSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaHSCDHCSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaHSCDHCSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaHSCDHCSequenceClassification were not initialized from the model checkpoint at nguyenvulebinh/envibert and are newly initialized: ['linear_L2_logit.bias', 'linear_L1.weight', 'linear_L2_logit.weight', 'linear_L1_logit.bi

Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1,F1 Score L2,Accuracy Score L2
1,No log,7.445967,0.55545,0.740786,0.152149,0.603176
2,7.725500,7.214032,0.628025,0.763686,0.213925,0.654158




In [None]:
model_name='nguyenvulebinh/envibert'

_model_kwargs={
    'dhc_mask':dhc_mask,
    'classifier_dropout':0.1,
    'last_hidden_size':768,  
    'linear_l1_size':389,
    'linear_l2_size':417,
    'lloss_weight':1.0,
    'dloss_weight':0,
    
}

model = model_init_classification(model_class = RobertaHSCDHCSequenceClassification,
                                  cpoint_path = 'nguyenvulebinh/envibert', 
                                  output_hidden_states=True, # since we are using 'hidden layer contatenation'
                                  seed=42,
                                  model_kwargs = _model_kwargs)
metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(model,tdm,metric_funcs)

lr = 7e-5
bs=8
wd=0.01
epochs= 2

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_separate_singleheads,
               head_sizes=[len(tdm.label_lists[0]),len(tdm.label_lists[1])]
              )




Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1,F1 Score L2,Accuracy Score L2
1,No log,2.323869,0.54626,0.737066,0.142331,0.594609
2,2.539800,2.057805,0.617023,0.762307,0.216919,0.656498




In [None]:
model_name='nguyenvulebinh/envibert'

_model_kwargs={
    'dhc_mask':dhc_mask,
    'lloss_weight':1.0,
    'dloss_weight':0.8,
    
}

model = model_init_classification(model_class = RobertaHSCSimpleDHCSequenceClassification,
                                  cpoint_path = 'nguyenvulebinh/envibert', 
                                  output_hidden_states=True, # since we are using 'hidden layer contatenation'
                                  seed=42,
                                  model_kwargs = _model_kwargs)
metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(model,tdm,metric_funcs)

lr = 7e-5
bs=8
wd=0.01
epochs= 2

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_separate_singleheads,
               head_sizes=[len(tdm.label_lists[0]),len(tdm.label_lists[1])]
              )


Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaHSCSimpleDHCSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaHSCSimpleDHCSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaHSCSimpleDHCSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaHSCSimpleDHCSequenceClassification were not initialized from the model checkpoint at nguyenvulebinh/envibert and are newly initialized: ['linear_L2_logit.bias', 'linear_L2_logit.weight', 'linear_L1_logi

Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1,F1 Score L2,Accuracy Score L2
1,No log,7.617188,0.609794,0.74112,0.087502,0.542332
2,7.910900,7.381319,0.67041,0.768282,0.133683,0.608692




In [None]:
model_name='nguyenvulebinh/envibert'

_model_kwargs={
    'dhc_mask':dhc_mask,
    'lloss_weight':1.0,
    'dloss_weight':0,
    
}

model = model_init_classification(model_class = RobertaHSCSimpleDHCSequenceClassification,
                                  cpoint_path = 'nguyenvulebinh/envibert', 
                                  output_hidden_states=True, # since we are using 'hidden layer contatenation'
                                  seed=42,
                                  model_kwargs = _model_kwargs)
metric_funcs = [partial(f1_score,average='macro'),accuracy_score]
controller = ModelController(model,tdm,metric_funcs)

lr = 7e-5
bs=8
wd=0.01
epochs= 2

controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               save_checkpoint=False,
               compute_metrics=compute_metrics_separate_singleheads,
               head_sizes=[len(tdm.label_lists[0]),len(tdm.label_lists[1])]
              )


Some weights of the model checkpoint at nguyenvulebinh/envibert were not used when initializing RobertaHSCSimpleDHCSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaHSCSimpleDHCSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaHSCSimpleDHCSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaHSCSimpleDHCSequenceClassification were not initialized from the model checkpoint at nguyenvulebinh/envibert and are newly initialized: ['linear_L2_logit.bias', 'linear_L2_logit.weight', 'linear_L1_logi

Epoch,Training Loss,Validation Loss,F1 Score L1,Accuracy Score L1,F1 Score L2,Accuracy Score L2
1,No log,2.511302,0.60482,0.737568,0.080265,0.540827
2,2.747700,2.248143,0.66746,0.765692,0.131651,0.606895


