### Introduction

From data augmentation experiment for Apple Care 2 dataset, highlighted factors can ensure the quality of unlabeled dataset which enhance the performance of model.

Experiment details in [2022_01_13 biweekly discussion.pptx](https://jira.wisers.com:18090/download/attachments/82808396/2022_01_13%20biweekly%20discussion.pptx?version=1&modificationDate=1642063801000&api=v2)

For code design please browse [Confluence Proposed Module](https://jira.wisers.com:18090/display/RES/Proposed+Module2) 

In [1]:
import os
src_dir = '../nlp_pipeline'
os.getcwd()

'/home/developer/Users/hinova/canton-target-sentiment/notebooks'

In [2]:
os.chdir(src_dir)
from nlp_pipeline.pipeline import Pipeline

### Model Generation

- train a demo model for predict unlabel data in demonstration

If you have specified trained model path:
- comment below cell
- comment the last cell (used to remove demo model)

In [3]:
os.system(f"python run.py --config_dir={'../config/examples/sequence_classification/BERT_AVG_explain'}") 

None **********************************
/home/developer/Users/hinova/canton-target-sentiment/nlp_pipeline/../config/examples/sequence_classification/BERT_AVG_explain/model *********************************************************
/home/developer/Users/hinova/canton-target-sentiment/nlp_pipeline/../config/examples/sequence_classification/BERT_AVG_explain/model *****
/home/developer/Users/hinova/canton-target-sentiment/nlp_pipeline/../config/examples/sequence_classification/BERT_AVG_explain/model/tokenizer


2022-02-28 03:58:10 ***** Args *****
2022-02-28 03:58:10    task: sequence_classification
2022-02-28 03:58:10    device: 0
2022-02-28 03:58:10    data: {'output_dir': '../config/examples/sequence_classification/BERT_AVG_explain', 'data_dir': '../data/datasets/sample/sequence_classification', 'train': 'train_sample.json', 'dev': 'train_sample.json', 'test': 'train_sample.json'}
2022-02-28 03:58:10    text_prepro: {'steps': ['utf8_replace', 'simplified_chinese', 'lower_case', 'full_to_half']}
2022-02-28 03:58:10    eval: {'batch_size': 64, 'model_file': 'model.pt'}
2022-02-28 03:58:10    train: {'model_class': 'BERT_AVG', 'seed': 42, 'log_steps': 100, 'batch_size': 32, 'final_model': 'best', 'optimization_metric': 'macro_f1', 'early_stop': 5}
2022-02-28 03:58:10    model_params: {'num_train_epochs': 2, 'embedding_trainable': True, 'output_hidden_act_func': 'PReLU', 'output_hidden_dim': 128, 'tokenizer_name': 'clue/albert_chinese_tiny', 'pretrained_lm': 'clue/albert_chinese_tiny'}
2022-02

['run.yaml', 'model.yaml', 'tokenizer']


2022-02-28 03:58:15 ***** Initializing model *****
2022-02-28 03:58:15   Task = sequence_classification
2022-02-28 03:58:15   Model class = BERT_AVG
2022-02-28 03:58:16 ***** Loading pretrained language model *****
2022-02-28 03:58:16   Pretrained BERT = 'clue/albert_chinese_tiny'
2022-02-28 03:58:22 ***** Loading data *****
2022-02-28 03:58:22   Data path = /home/developer/Users/hinova/canton-target-sentiment/nlp_pipeline/../data/datasets/sample/sequence_classification/train_sample.json
3it [00:00, 82.61it/s]
2022-02-28 03:58:22   Loaded samples = 3
2022-02-28 03:58:22 ***** Loading data *****
2022-02-28 03:58:22   Data path = /home/developer/Users/hinova/canton-target-sentiment/nlp_pipeline/../data/datasets/sample/sequence_classification/train_sample.json
3it [00:00, 118.25it/s]
2022-02-28 03:58:22   Loaded samples = 3
2022-02-28 03:58:22 ***** Running training *****
2022-02-28 03:58:22   Num examples = 3
2022-02-28 03:58:22   Num Epochs = 2
2022-02-28 03:58:22   Sampler = 
2022-02-2

0

### Arguments

User can sample label ratio based on the ratio of train set, or request desired label ratio. It depends on existence of argument (either input this argument or not):

- label_ratio (if required desired label ratio)

remarks: Model directory is required to predict labels for unlabel data in order to sample data which includes
- model directory
    - run.yaml
    - model.yaml
    - label_to_id.json
    - model.pt

In [4]:
# comment if user samples label ratio based on the ratio of train set
label_ratio = {'-1': 0.4, '0': 0.2, '1': 0.4} # optional 

# unlabel path (json file name included)
unlabel_path = '../data/datasets/sample/sequence_classification/unlabeled_sample.json' 

# model directory, must include above files
model_dir = '../config/examples/sequence_classification/BERT_AVG_explain' # required argument 

# save directory
save_dir = '../data/datasets/sample/sequence_classification' # required 
save_data_file = 'sampled_unlabel_data.json' # required 
save_logit_file = 'sampled_unlabel_logits.pkl' # required 

# sample size and certainty
sample_size = 10 # required, integer and smaller than size of unlabeled data
certainty = 0 # optional, only select the data that max(p)>certainty, 
                # p is the predicted probabilities (0 - 1) over the label space
                # default certainty is 0

device = 0

### Pipeline version

##### Load data

In [5]:
import json

# comment train_raw_data and label if label_ratio is defined (self defined)

# train_raw_data = json.load(open(f"../data/datasets/sample/sequence_classification/train_sample.json", 'r'))
# label = [str(train_raw_data[i]['label']) for i in range(len(train_raw_data))]
# print('The first three labels of trainset: \n', label[:3])

unlabel_raw_data = json.load(open(unlabel_path, 'r'))

##### get label ratio

In [6]:
def get_label_ratio(label = None):
    '''
        input:
        - label: list

        output:
        - label_ratio: dict
    '''
    if label is None and 'label_ratio' in globals():
        return label_ratio
    result = {}
    for i in label:
        # i will be replaced get_label_ratio directly
        key = i
        if key not in result:
            result[key] = 0
        result[key] = result[key] + 1/len(label)

    for key in result.keys():
        result[key] = round(result[key], 2)
    return result

# comment if user self define label ratio
# label_ratio = get_label_ratio(label)

# comment if follow train set label ratio
label_ratio = get_label_ratio(None)

print('label ratio of train set: \n',label_ratio)

label ratio of train set: 
 {'-1': 0.4, '0': 0.2, '1': 0.4}


##### Run pipeline (Predict)

In [7]:

pipeline = Pipeline(
    model_dir=model_dir, 
    device=device,
)

2022-02-28 03:58:24 ***** Existing model is provided. *****
2022-02-28 03:58:24   Model directory = ../config/examples/sequence_classification/BERT_AVG_explain
2022-02-28 03:58:24 ***** Initializing pipeline *****
2022-02-28 03:58:24 ***** Loading tokenizer *****
2022-02-28 03:58:24   Tokenizer source = 'transformers'
2022-02-28 03:58:24 ***** Initializing model *****
2022-02-28 03:58:24   Task = sequence_classification
2022-02-28 03:58:24   Model class = BERT_AVG
2022-02-28 03:58:24   Model path = ../config/examples/sequence_classification/BERT_AVG_explain/model/model.pt


None **********************************
/home/developer/Users/hinova/canton-target-sentiment/nlp_pipeline/../config/examples/sequence_classification/BERT_AVG_explain/model *********************************************************
../config/examples/sequence_classification/BERT_AVG_explain/model/tokenizer
['run.yaml', 'model.yaml', 'tokenizer', 'label_to_id.json', 'model.pt']


2022-02-28 03:58:25 ***** Loading pretrained language model *****
2022-02-28 03:58:25   Pretrained BERT = 'clue/albert_chinese_tiny'


In [8]:
print("Input:")
print(unlabel_raw_data[0])

output = pipeline.predict(
    data_dict=unlabel_raw_data[0],
)

print("Output:")
print(output)

Input:
{'content': '\n\n2月9日，網上反映“一醫院領導拒絕戴口罩，途經卡點引發爭執”的視頻，新鄭市委高度重視，對此事進行了初步調查核實：\n\n2月8日22：00，新鄭市第三人民醫院副院長楚明輝從集中留觀隔離點結束工作返家途中，在龍湖雙湖大道疫情卡點接受檢查時，與卡點工作人員發生爭執，拒戴口罩，存在不當言行，造成了不良影響。新鄭市衛健委已經責成新鄭市第三人民醫院暫停楚明輝副院長職務。新鄭市紀委監委已成立調查組進行調查，調查結果及時向社會公佈。\n\n編輯：王淑\n\n聯繫記者\n'}
Output:
{'prediction_id': 0, 'prediction': '1', 'logits': [0.06629689037799835, -0.08389206230640411, 0.03349928930401802]}


In [9]:
import torch
import torch.nn.functional as F
def predict_label(dataset, pipeline):
    '''
        input:
        - dataset: list
        - pipeline
        
        output:
        - list
    '''
    result = []
    for raw_data in dataset:
        output = pipeline.predict(
            data_dict=raw_data,
        )
        output['probabilities'] = F.softmax(torch.tensor(output["logits"]), dim=-1).cpu().tolist()
        result.append(output)
    return result
prediction = predict_label(unlabel_raw_data, pipeline)

In [10]:
pseudo_label_id = [pred['prediction'] for pred in prediction]
pseudo_label_ratio = get_label_ratio(pseudo_label_id)
print('pseudo label ratio of unlabel set: \n',pseudo_label_ratio)

pseudo label ratio of unlabel set: 
 {'1': 0.69, '-1': 0.28, '0': 0.03}


In [11]:
import numpy as np
(unique, counts) = np.unique(np.array(pseudo_label_id), return_counts=True)
frequencies = np.asarray((unique, counts)).T
print('Frequency of pseudo label (left column label, right column count): \n', frequencies)

Frequency of pseudo label (left column label, right column count): 
 [['-1' '18']
 ['0' '2']
 ['1' '44']]


### sampling 



In [12]:
import json
import random
import numpy as np

def collect_probability(prediction):
    # collect probability of prediction
    prob_ls = []
    for batch_pred in prediction:
            prob_ls.append(batch_pred['probabilities'])
    prob_np = np.array(prob_ls)
    return prob_np

def get_sampled_idx(prob_np, label_ratio, certainty, sample_size, label_to_id):
    # sample size for labels
    ss_idx = []
    label_collection = {}
    remain_size = sample_size
    summary = {}
    print('Important: Sampling Statistics')

    for i, key in enumerate(label_ratio.keys()):
        key_id = label_to_id[key]

        # sample size computation
        if i != len(label_ratio.keys()) - 1:
            # sample size follows label ratio
            key_size = int(sample_size * label_ratio[key])
            remain_size = remain_size - key_size
        else:
            key_size = remain_size

        # basic information of label data
        summary[key] = []
        label_idx = np.argwhere((prob_np.argmax(axis=1)==key_id)).flatten()
        summary[key].append(label_idx.shape[0])

        # certainty index
        key_certain_idx = np.argwhere((prob_np.argmax(axis=1)==key_id) & (prob_np.max(axis=1)>=certainty)).flatten()
        summary[key].append(key_certain_idx.shape[0])

        summary[key].append(key_size)

        # warning if not able to sample enough data (filtered size is smaller than required size)
        if key_size > key_certain_idx.shape[0]:
            print('\t(Warning: only sample ',key_certain_idx.shape[0], ' example(s) for label ',key,' because required size > filtered size)')
            if key_size <= label_idx.shape[0]:
                print('label ',key,':\t(Suggested Certainty for label ', key,': ', np.sort(prob_np.max(axis=1)[(prob_np.argmax(axis=1)==key_id)])[-key_size], ')')
            else:
                print('label ',key,':\t(Suggested Ratio for label ', key,': ', label_idx.shape[0]/sample_size,')')
            key_size = key_certain_idx.shape[0]

        # append sampled index to list
        ss_idx = ss_idx + (random.sample(key_certain_idx.tolist(), key_size))
        label_collection[key] = key_size

    summary['total'] = ['', len(ss_idx), sample_size]
    print ("\nSummary Table:\n{:<25} {:<25} {:<25} {:<25}".format('label\size','all','filtered (certainty>'+str(certainty)+')','required'))
    for k, v in summary.items():
        total, filtered, required = v
        print ("{:<25} {:<25} {:<25} {:<25}".format(k, total, filtered, required))
    return ss_idx

def extract_data(data, idx):
    # indexing unlabel data
    unlabel_data = np.array(data)
    return unlabel_data[idx].tolist()

def extract_logits(prop_np, idx):
    return prop_np[idx].tolist()


##### Sampling Statisitcs

Important summary of sampling results, required data size is calculated by sample size times label ratio. Generally, total filtered data size (> certainty) should be the same with required size. Else, warning will be popped up.

- required size > label size : suggest to edit the label ratio
- required size <= label size & required size >= filtered size (> certainty) : suggest to edit certainty

In [13]:
def sampling(unlabel_dataset, prediction, label_ratio, certainty, sample_size, label_to_id):
    '''
        input:
        - unlabel_dataset: list
        - pseudo_label: list
        - label_ratio: dict
        - certainty: float
        - label_to_id: dict

        output:
        - list
    '''
    # collect probability of prediction
    prob_np = collect_probability(prediction)

    # get sampled index
    idx = get_sampled_idx(prob_np, label_ratio, certainty, sample_size, label_to_id)

    # indexing unlabel data
    sampled_data = extract_data(unlabel_dataset, idx)

    # indexing unlabel logits
    sampled_logits = extract_logits(prob_np, idx)

    return sampled_data, sampled_logits

sampled_data, sampled_logits = sampling(
    unlabel_dataset = unlabel_raw_data, 
    prediction = prediction, 
    label_ratio = label_ratio, 
    certainty = certainty,
    sample_size = sample_size,
    label_to_id = pipeline.args.label_to_id)

Important: Sampling Statistics

Summary Table:
label\size                all                       filtered (certainty>0)    required                 
-1                        18                        18                        4                        
0                         2                         2                         2                        
1                         44                        44                        4                        
total                                               10                        10                       


### Overview

In [14]:
print('Total size of sampled data: ', len(sampled_data))
print('Overview of first sampled data', sampled_data[0],'\n')
print('Total size of sampled logits: ', len(sampled_logits))
print('Overview of first sampled logits', sampled_logits[0])

Total size of sampled data:  10
Overview of first sampled data {'content': '【#紅十字會總會赴武漢工作組#：堅決徹底整改】中國紅十字會黨組書記、常務副會長梁惠玲率領總會工作組于２月１日晚奔赴武漢，調查處置輿情反映有關問題，依法規範捐贈款物接受使用和信息公開工作。對疫情防控工作進行再調度、再部署、再動員，要求深刻汲取捐贈款物管理失職失責的慘痛教訓，迅速開展自查自糾，採取切實管用措施，堅決徹底整改到位。（@人民日報 ）\n'} 

Total size of sampled logits:  10
Overview of first sampled logits [0.3454711437225342, 0.30857840180397034, 0.3459504544734955]


### save dataset and logits

In [15]:
def save_data(sample_data: list, save_path: str):
    with open(save_path, 'w') as outfile:
        json.dump(sample_data, outfile)

def save_logit(sample_logits: list, save_path: str):
    import pickle
    with open(save_path, 'wb') as outfile:
        pickle.dump(sample_logits, outfile)

save_data_path = save_dir + save_data_file
save_logit_path = save_dir + save_logit_file
save_data(sampled_data, save_data_path)
save_logit(sampled_logits, save_logit_path)

### review saved data

In [16]:
with open(save_data_path, 'rb') as outfile:
    result = json.load(outfile)
print('Total num of samples: ',len(result))
print('First sample of result: ', result[0])

Total num of samples:  10
First sample of result:  {'content': '【#紅十字會總會赴武漢工作組#：堅決徹底整改】中國紅十字會黨組書記、常務副會長梁惠玲率領總會工作組于２月１日晚奔赴武漢，調查處置輿情反映有關問題，依法規範捐贈款物接受使用和信息公開工作。對疫情防控工作進行再調度、再部署、再動員，要求深刻汲取捐贈款物管理失職失責的慘痛教訓，迅速開展自查自糾，採取切實管用措施，堅決徹底整改到位。（@人民日報 ）\n'}


### Export variables for (unittest)
- test_length
- test_pseudo_label_ratio
- test_certainty

In [17]:
def get_min_certainty(prob):
    min_prob = 1.0
    for p in prob:
        if max(p) < min_prob:
            min_prob = max(p)
    return min_prob

prediction = predict_label(result, pipeline)
result_probability = [pred['probabilities'] for pred in prediction]
min_certainty = get_min_certainty(result_probability)
result_label_id = [pred['prediction'] for pred in prediction]
result_label_ratio = get_label_ratio(result_label_id)


In [18]:
import scrapbook as sb
sb.glue("length", len(result))
sb.glue("label_ratio", result_label_ratio)
sb.glue("min_certainty", min_certainty)

  from pyarrow import HadoopFileSystem


### remove file

This part removes saved files and cleans direcotry, skip below if saving data and model
- code 0: sucessful removal
- code 256: failed removal

##### remove saved result

In [19]:
print(os.system(f"rm {save_data_path}"))
print(os.system(f"rm {save_logit_path}"))

0
0


##### remove trained model (if demo model is existed)

In [20]:
print(os.system(f"rm -rf {model_dir}/result"))
print(os.system(f"rm -rf {model_dir}/model"))
print(os.system(f"rm -rf {model_dir}/logs"))
print(os.system(f"rm {model_dir}/log"))

0
0
0
0
