### Introduction

From data augmentation experiment for Apple Care 2 dataset, highlighted factors can ensure the quality of unlabeled dataset which enhance the performance of model.

Experiment details in [2022_01_13 biweekly discussion.pptx](https://jira.wisers.com:18090/download/attachments/82808396/2022_01_13%20biweekly%20discussion.pptx?version=1&modificationDate=1642063801000&api=v2)

For code design please browse [Confluence Proposed Module](https://jira.wisers.com:18090/display/RES/Proposed+Module2) 

In [1]:
import os
os.chdir('../src/')
os.getcwd()

'/home/developer/Users/hinova/canton-target-sentiment/src'

### Model Generation

Model is required to predict labels for unlabel data in order to sample data. This step can be skipped if user already got trained model directory which exists:
- model directory
    - run.yaml
    - model.yaml
    - label_to_id.json
    - model.pt

After run below training process:
- code 0: successful training
- code 256: failed training

In [2]:
os.system(f"python run.py --config_dir='../config/examples/sequence_classification/BERT_AVG'")

0

### Arguments

User can sample label ratio based on the ratio of train set, or request desired label ratio. It depends on existence of argument (either input this argument or not):

- label_ratio (if required desired label ratio)


Optional Arguments

In [3]:
# user can sample label ratio based on the ratio of train set, or request desired label ratio

# label_ratio = {'-1': 0.3, '0': 0.4, '1': 0.3}

Required Arguments

In [4]:
# model directory, must include above files
model_dir = '../config/examples/sequence_classification/BERT_AVG/model' # model inference

# unlabel path (json file name included)
unlabel_path = '../data/datasets/sample/sequence_classification/unlabeled_sample.json' 

# save directory
save_dir = '../data/datasets/sample/sequence_classification'
save_data_file = 'sampled_unlabel_data.json'
save_logit_file = 'sampled_unlabel_logits.pkl'

# sample size and certainty
sample_size = 6 # required, integer and smaller than size of unlabeled data
certainty = 0.36 # required, only select the data that max(p)>certainty, 
                # p is the predicted probabilities (0 - 1) over the label space
                # default certainty is 0

In [5]:
import json
from pathlib import Path
from utils import load_yaml
class arg():
    def __init__(self, model_dir: str):
        # run_yaml configuration
        run_config = load_yaml(Path(model_dir) / "run.yaml")
        self.task = run_config['task']
        self.data_config = run_config['data']
        self.data_dir = Path(self.data_config['data_dir'])
        self.prepro_config = run_config['text_prepro']
        self.eval_config = run_config['eval']
        self.train_config = run_config['train']
        self.device = run_config['device']
        model_class = self.train_config['model_class']

        # model_yaml configuration
        self.model_config = load_yaml(Path(model_dir) / "model.yaml")[model_class]
        self.model_config['pretrained_lm_from_prev'] = model_dir

        # label_to_id mapping
        with open(model_dir+'/label_to_id.json', 'rb') as outfile:
            self.label_to_id = json.load(outfile)
        self.label_to_id_inv = dict(zip(self.label_to_id.values(), self.label_to_id.keys()))

        # model directory which model locates
        self.model_dir = Path(model_dir)

args = arg(model_dir)

In [6]:
print('actual arguments: \n', args.__dict__)

actual arguments: 
 {'task': 'sequence_classification', 'data_config': {'output_dir': '../config/examples/sequence_classification/BERT_AVG', 'data_dir': '../data/datasets/sample/sequence_classification', 'train': 'train_sample.json', 'dev': 'train_sample.json', 'test': 'train_sample.json'}, 'data_dir': PosixPath('../data/datasets/sample/sequence_classification'), 'prepro_config': {'steps': ['utf8_replace', 'simplified_chinese', 'lower_case', 'full_to_half']}, 'eval_config': {'batch_size': 64, 'model_file': 'model.pt'}, 'train_config': {'model_class': 'BERT_AVG', 'kd': {'use_kd': False, 'teacher_dir': '../output/post_sentiment_20210707_bert_avg/model', 'loss_type': 'mse', 'soft_lambda': 0.5, 'kl_T': 5}, 'seed': 42, 'log_steps': 100, 'batch_size': 32, 'final_model': 'best', 'optimization_metric': 'macro_f1', 'early_stop': 5}, 'device': 0, 'model_config': {'max_length': 256, 'tokenizer_source': 'transformers', 'tokenizer_name': 'bert-base-chinese', 'pretrained_lm': 'bert-base-chinese', 'o

### get label ratio of train set

This part should skip if user inputs label_to_ratio argument

In [7]:
# if following ratio of train set
from tokenizer import get_tokenizer
from dataset import get_dataset
    
tokenizer = get_tokenizer(args = args) 
train_dataset = get_dataset(dataset="train", tokenizer=tokenizer, args=args)

print('train dataset size: ', len(train_dataset))
    

../config/examples/sequence_classification/BERT_AVG/model
['run.yaml', 'model.yaml', 'tokenizer', 'label_to_id.json', 'model.pt']


3it [00:00, 62.49it/s]

train dataset size:  3





In [8]:
label = [train_dataset[i]['label'].item() for i in range(len(train_dataset))]
print('The first three labels of trainset: \n', label[:3])

The first three labels of trainset: 
 [0, 1, 2]


In [9]:
def get_label_ratio(label):
    '''
        input:
        - label: list

        output:
        - label_ratio: dict
    '''
    result = {}
    for i in label:
        key = args.label_to_id_inv[i]
        if key not in result:
            result[key] = 0
        result[key] = result[key] + 1/len(label)
    return result

label_ratio = get_label_ratio(label)
print('label ratio of train set: \n',label_ratio)

label ratio of train set: 
 {'1': 0.3333333333333333, '0': 0.3333333333333333, '-1': 0.3333333333333333}


### generate pseudo prediction label 

In [10]:
args.data_config['data_dir'] = '/'.join(unlabel_path.split('/')[:-1])
args.data_dir = Path(args.data_config['data_dir'])
args.data_config['unlabeled'] = unlabel_path.split('/')[-1]
unlabel_dataset = get_dataset(dataset="unlabeled", tokenizer=tokenizer, args=args)
print('unlabel dataset size: ', len(unlabel_dataset))

64it [00:00, 172.61it/s]

unlabel dataset size:  64





In [11]:
from model import get_model
model = get_model(args)

In [12]:
from torch.utils.data import DataLoader
from trainer import prediction_step
from itertools import chain

def predict_label(dataset, model):
    '''
        input:
        - dataset: torch.dataset
        - model: torch.model
        
        output:
        - list
    '''
    dataloader = DataLoader(
        dataset,
        shuffle=False,
        batch_size=args.eval_config["batch_size"],
        # collate_fn=eval_dataset.pad_collate,
    )

    results = []
    for batch in dataloader:
        result = prediction_step(model, batch, args=args)
        results.append(result)
    return results

prediction = predict_label(unlabel_dataset, model)
pseudo_label = prediction[0]['prediction']
print('The first three labels of unlabel: \n', pseudo_label[:3])

The first three labels of unlabel: 
 ['-1', '-1', '-1']


##### overview pseudo label

In [13]:
pseudo_label_id = prediction[0]['prediction_id']
pseudo_label_ratio = get_label_ratio(pseudo_label_id)
print('pseudo label ratio of unlabel set: \n',pseudo_label_ratio)

pseudo label ratio of unlabel set: 
 {'-1': 0.984375, '0': 0.015625}


In [14]:
import numpy as np
(unique, counts) = np.unique(np.array(pseudo_label), return_counts=True)
frequencies = np.asarray((unique, counts)).T
print('Frequency of pseudo label (left column label, right column count): \n', frequencies)

Frequency of pseudo label (left column label, right column count): 
 [['-1' '63']
 ['0' '1']]


In [15]:
label_ratio

{'1': 0.3333333333333333, '0': 0.3333333333333333, '-1': 0.3333333333333333}

### sampling 
- label ratio
- certainty

In [16]:
import json
import random
import numpy as np

def collect_probability(prediction):
    # collect probability of prediction
    prob_ls = []
    for batch_pred in prediction:
            prob_ls = prob_ls + batch_pred['probabilities']
    prob_np = np.array(prob_ls)
    return prob_np

def get_sampled_idx(prob_np, label_ratio, certainty, sample_size):
    # sample size for labels
    ss_idx = []
    label_collection = {}
    remain_size = sample_size

    for i, key in enumerate(label_ratio.keys()):
        key_id = args.label_to_id[key]

        # sample size computation
        if i != len(label_ratio.keys()) - 1:
            # sample size follows label ratio
            key_size = int(sample_size * label_ratio[key])
            remain_size = remain_size - key_size
        else:
            key_size = remain_size

        # basic information of label data
        print('label: ',key)
        label_idx = np.argwhere((prob_np.argmax(axis=1)==key_id)).flatten()
        print('unlabeled data size of label', key, ': ', label_idx.shape[0])

        # certainty index
        key_certain_idx = np.argwhere((prob_np.argmax(axis=1)==key_id) & (prob_np.max(axis=1)>=certainty)).flatten()
        print('filtered data size of label', key, ' that certainty >= ',certainty,': ', key_certain_idx.shape[0])
        print('required data size of label ',key,': ',key_size)

        # warning if not able to sample enough data (filtered size is smaller than required size)
        if key_size > key_certain_idx.shape[0]:
            print('\t(Warning: only sample ',key_certain_idx.shape[0], ' example(s) for label ',key,' because required size > filtered size)')
            if key_size <= label_idx.shape[0]:
                print('\t(Suggested Certainty for label ', key,': ', prob_np.max(axis=1)[(prob_np.argmax(axis=1)==key_id)][-key_size], ')')
            else:
                print('\t(Suggested Ratio for label ', key,': ', label_idx.shape[0]/sample_size,')')
            key_size = key_certain_idx.shape[0]

        # append sampled index to list
        ss_idx = ss_idx + (random.sample(key_certain_idx.tolist(), key_size))
        label_collection[key] = key_size
        print('\n')

    # ss_idx = random.sample(ss_idx, sample_size)
    print('Final sampled size: ', len(ss_idx))
    print('Final label count: ',label_collection)
    return ss_idx

def extract_data(dir, idx):
    # indexing unlabel data
    with open(dir, 'rb') as outfile:
        unlabel_data = np.array(json.load(outfile))
    return unlabel_data[idx].tolist()

def extract_logits(prop_np, idx):
    return prop_np[idx].tolist()


In [17]:
def sampling(unlabel_path, prediction, label_ratio, certainty, sample_size):
    '''
        input:
        - unlabel_dataset: torch.dataset
        - pseudo_label: list
        - label_ratio: dict
        - certainty: float

        output:
        - list
    '''
    # collect probability of prediction
    prob_np = collect_probability(prediction)

    # get sampled index
    idx = get_sampled_idx(prob_np, label_ratio, certainty, sample_size)

    # indexing unlabel data
    sampled_data = extract_data(unlabel_path, idx)

    # indexing unlabel logits
    sampled_logits = extract_logits(prob_np, idx)

    return sampled_data, sampled_logits

sampled_data, sampled_logits = sampling(
    unlabel_path = unlabel_path, 
    prediction = prediction, 
    label_ratio = label_ratio, 
    certainty = certainty,
    sample_size = 10)

label:  1
unlabeled data size of label 1 :  0
filtered data size of label 1  that certainty >=  0.36 :  0
required data size of label  1 :  3
	(Suggested Ratio for label  1 :  0.0 )


label:  0
unlabeled data size of label 0 :  1
filtered data size of label 0  that certainty >=  0.36 :  0
required data size of label  0 :  3
	(Suggested Ratio for label  0 :  0.1 )


label:  -1
unlabeled data size of label -1 :  63
filtered data size of label -1  that certainty >=  0.36 :  63
required data size of label  -1 :  4


Final sampled size:  4
Final label count:  {'1': 0, '0': 0, '-1': 4}


### Overview

In [18]:
print('Total size of sampled data: ', len(sampled_data))
print('Overview of first sampled data', sampled_data[0],'\n')
print('Total size of sampled logits: ', len(sampled_logits))
print('Overview of first sampled logits', sampled_logits[0])

Total size of sampled data:  4
Overview of first sampled data {'content': '【#27人一月檢測樣本超3000# 他們的戰場，在實驗室】#周刊君與你共同戰疫# 武漢肺科醫院有這樣一個檢驗團隊，他們藏身實驗室內，每天與看不見的病毒打交道。團隊27名同志，已經連續晝夜奮戰長達一個多月，累計檢測的新冠病毒樣本數量超過3000多例。他們每批次的提取檢測，有10多個步驟，全程要高度集中精神，尤其最容易感染的標本處理及核酸提取階段，絲毫不能分神。詳戳↓ @央視財經 http://t.cn/A6hvao74\n'} 

Total size of sampled logits:  4
Overview of first sampled logits [0.3296697437763214, 0.23053191602230072, 0.43979835510253906]


### save dataset and logits

In [19]:
def save_data(sample_data: list, save_path: str):
    with open(save_path, 'w') as outfile:
        json.dump(sample_data, outfile)

def save_logit(sample_logits: list, save_path: str):
    import pickle
    with open(save_path, 'wb') as outfile:
        pickle.dump(sample_logits, outfile)

save_data_path = save_dir + save_data_file
save_logit_path = save_dir + save_logit_file
save_data(sampled_data, save_data_path)
save_logit(sampled_logits, save_logit_path)

### review saved data

In [20]:
with open(save_data_path, 'rb') as outfile:
    result = json.load(outfile)
print(len(result))
print(result[0])

4
{'content': '【#27人一月檢測樣本超3000# 他們的戰場，在實驗室】#周刊君與你共同戰疫# 武漢肺科醫院有這樣一個檢驗團隊，他們藏身實驗室內，每天與看不見的病毒打交道。團隊27名同志，已經連續晝夜奮戰長達一個多月，累計檢測的新冠病毒樣本數量超過3000多例。他們每批次的提取檢測，有10多個步驟，全程要高度集中精神，尤其最容易感染的標本處理及核酸提取階段，絲毫不能分神。詳戳↓ @央視財經 http://t.cn/A6hvao74\n'}


### remove file

This part removes saved files for cleaning direcotry, skip below if saving data
- code 0: sucessful removal
- code 256: failed removal

In [21]:
print(os.system(f"rm {save_data_path}"))
print(os.system(f"rm {save_logit_path}"))

0
0
