We used dataset titled "Korean-English Parallel Corpus of Specialized Domains" 

https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=111

However, it has a limited access to nationals (Korean).

The structure of dataset is as follows : 

> data
>> Training
>>> ko2en_training_json.zip
>>>> ko2en_medical_1_training.json (and so on)

>> Validation
>>> ko2en_validation_json.zip
>>>> ko2en_medical_1_validation.json (and so on)
 
Each domain data consists following key-values (I added translation on each key):

&nbsp; "sid": 3\
&nbsp; "분야(domain)": "의료/보건",\
&nbsp; "한국어(ko)": "쿠퍼만지수의 대표 증상인 안면홍조, 손발저림, 신경과민, 우울증, 가슴 두근거림, 근관절통, 피로 등에서 뛰어난 효과를 나타냈다고 회사 측은 설명했다.",\
&nbsp; "영어(en)": "The company explained that it had an excellent effect on the representative symptoms of Kupperman's index such as hot flashes, numbness in the hands and feet, nervousness, depression, palpitations in the chest, muscle joint pain, and fatigue.",\
&nbsp; "한국어_어절수(ko_word_count)": 18,\
&nbsp; "영어_단어수(en_word_count)": 37,\
&nbsp; "길이_분류(length_category)": 3,\
&nbsp; "난이도(difficulty)": "중",\
&nbsp; "수행기관(work_done_by)": "에버트란"

In [1]:
debug = True # decide debug mode

Environments are less likely to make dependency problem, but torch>=2.0.0 is needed to use `optimum.bettertransformer`

Requirements:

!pip install transformers, datasets, tqdm, optimum, comet, sklearn

`input_path = './domain_datasets'` should be split into train-test

In [2]:
import os
from transformers import set_seed
from torch import device as set_device

set_seed(42) # seed 40, 41, 42 were used
batch_size = 8 # available bsz in gtx 3090 (8~10 for VRAM 16GB)
device = set_device('cuda')
pretrained_model_path = 'facebook/nllb-200-distilled-1.3B'

input_path = './domain_datasets'
output_path = './pruned_datasets'

try:
    os.mkdir(f'{output_path}')
    os.mkdir(f'{output_path}/travel')
    os.mkdir(f'{output_path}/sports')
    os.mkdir(f'{output_path}/law')
    os.mkdir(f'{output_path}/medical')
except Exception :
    pass

import necessary libraries

In [3]:
from pruning_fuctions import el2n_algorithm, entropy_algorithm, get_offset_mapping
from transformers import M2M100ForConditionalGeneration, AutoTokenizer, DataCollatorWithPadding, AutoModelForTokenClassification, AutoModel, TokenClassificationPipeline
from transformers.pipelines.pt_utils import KeyDataset
from datasets import Dataset, load_from_disk
from tqdm import tqdm
from torch.utils.data import DataLoader
import torch
from optimum.bettertransformer import BetterTransformer
from comet import download_model, load_from_checkpoint
import torch.nn.functional as F
from sklearn.cluster import KMeans
import gc

define model & tokenizers.

In [4]:
pretrained_model = M2M100ForConditionalGeneration.from_pretrained(pretrained_model_path).eval()
pretrained_model = BetterTransformer.transform(pretrained_model)
pretrained_model.to(device)
NLLB_tokenizer_src  = AutoTokenizer.from_pretrained(pretrained_model_path, src_lang='kor_Hang')
NLLB_tokenizer_tgt = AutoTokenizer.from_pretrained(pretrained_model_path, src_lang='eng_Latn')

# mean + 3 sigma is set for efficient handling
def tokenize_source(row):
    return NLLB_tokenizer_src(row['ko'], truncation=True, max_length=72, return_offsets_mapping=True)

def tokenize_target(row):
    return NLLB_tokenizer_tgt(row['en'], truncation=True, max_length=144, return_offsets_mapping=True)

The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.


define detailed tokenizing functions

In [5]:
def set_dataset(which_domain):
    train_dataset = load_from_disk(f'{input_path}/{which_domain}')['train']
    if debug is True : train_dataset = train_dataset.shard(500,0)
    tokenized_train_dataset = train_dataset.map(tokenize_source, batched=True)
    tokenized_train_dataset = tokenized_train_dataset.map(lambda row : {'len': len(row['input_ids'])})

    # for efficient pruning, we sorted data by length
    sorted_train_dataset = tokenized_train_dataset.sort('len', reverse=True)

    sorted_tokenized_answer = sorted_train_dataset.map(tokenize_target, batched=True)
    src_sentences = sorted_train_dataset['ko']
    tgt_sentences = sorted_train_dataset['en']
    sorted_tokenized_answer = sorted_tokenized_answer.remove_columns(['ko', 'en', 'attention_mask','len'])
    sorted_input_dataset = sorted_train_dataset.remove_columns(['ko', 'en', 'offset_mapping', 'len'])

    return sorted_input_dataset, sorted_tokenized_answer, src_sentences, tgt_sentences

define collator & dataloader

In [6]:
def set_dataloader(input_dataset, tokenizer):
    collator = DataCollatorWithPadding(tokenizer, return_tensors='pt')
    dataloader = DataLoader(input_dataset, batch_size, collate_fn=collator, pin_memory=True)
    return dataloader

prune methods using NER first

In [7]:
def translate(model, dataloader, sorted_tokenized_answer, tgt_lang_id):
    n = 0
    translated_sentences = []
    el2n_list = []
    entropy_list = []
    outputs_offset_mappings = []

    with torch.no_grad():
        for inputs in tqdm(dataloader):
            inputs = {k: v.to(device) for k, v in inputs.items()}
            batch_len = inputs['input_ids'].shape[0]

            translated_tokens = model.generate(
                **inputs, forced_bos_token_id=tgt_lang_id, max_length=144,
                return_dict_in_generate=True, output_scores=True)

            # offset mappings of translated tokens
            res_decoded_originals = [NLLB_tokenizer_tgt.convert_ids_to_tokens(res_token)
                                     for res_token in translated_tokens.sequences]
            outputs_offset_mappings += [get_offset_mapping(res_decoded_original)
                                       for res_decoded_original in res_decoded_originals]

            # logit score for logit method
            scores = torch.stack([x for x in translated_tokens.scores]).movedim(0, 1)
            tokenized_answer_split = sorted_tokenized_answer[n:n + batch_len]['input_ids']
            answer_ids_onehot = [F.one_hot(torch.tensor(x), num_classes=scores.shape[2]).to(device) for x in
                                 tokenized_answer_split]

            el2n_list+=el2n_algorithm(scores, translated_tokens.sequences, answer_ids_onehot)
            entropy_list+=entropy_algorithm(scores, translated_tokens.sequences)

            # 'translated' outputs
            translated_sentences += NLLB_tokenizer_tgt.batch_decode(translated_tokens.sequences, skip_special_tokens=True)
            n += batch_len

    return translated_sentences, el2n_list, entropy_list, outputs_offset_mappings

# Do translate
and save easily made metrics
**translated_dict** works as a global translated dictionary.

In [8]:
domains = ['medical','sports','law','travel']
translated_dict = {x:{} for x in domains}
tgt_lang_id = NLLB_tokenizer_tgt.lang_code_to_id["eng_Latn"]
pretrained_model = M2M100ForConditionalGeneration.from_pretrained(pretrained_model_path).eval()
pretrained_model = BetterTransformer.transform(pretrained_model)
pretrained_model.to(device)

for which_domain in tqdm(domains):
    sorted_input_dataset, sorted_tokenized_answer, src_sentences, tgt_sentences = set_dataset(which_domain)
    dataloader = set_dataloader(sorted_input_dataset, NLLB_tokenizer_src)

    # translate!!
    translated_sentences, el2n_list, entropy_list, outputs_offset_mappings \
        = translate(pretrained_model, dataloader, sorted_tokenized_answer, tgt_lang_id)

    el2n_scores = [x.mean().item() for x in el2n_list]
    entropy_scores = [x.mean().item() for x in entropy_list]
    entropy_for_NE = [x for x in entropy_list]
    translated_dict[which_domain] = {
        'translated':translated_sentences, 'output_offset_mappings':outputs_offset_mappings,
        'el2n':el2n_scores, 'entropy':entropy_scores, 'entropy_for_NE':entropy_for_NE,
        'src':src_sentences, 'tgt':tgt_sentences}
    gc.collect()
    torch.cuda.empty_cache()

del pretrained_model
gc.collect()
torch.cuda.empty_cache()

The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
  0%|          | 0/4 [00:00<?, ?it/s]
  0%|          | 0/50 [00:00<?, ?it/s][AYou're using a NllbTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)

  2%|▏         | 1/50 [00:03<03:09,  3.87s/it][A
  4%|▍         | 2/50 [00:06<02:20,  2.92s/it][A
  6%|▌         | 3/50 [00:07<01:52,  2.40s/it][A
  8%|▊         | 4/50 [00:09<01:36,  2.09s/it][A
 10%|█         | 5/50 [00:11<01:29,  1.99s/it][A
 12%|█▏        | 6/50 [00:12<01:21,  1.85s/i

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]


  0%|          | 0/40 [00:00<?, ?it/s][A
  2%|▎         | 1/40 [00:01<01:06,  1.71s/it][A
  5%|▌         | 2/40 [00:03<01:02,  1.66s/it][A
  8%|▊         | 3/40 [00:06<01:27,  2.37s/it][A
 10%|█         | 4/40 [00:08<01:15,  2.11s/it][A
 12%|█▎        | 5/40 [00:09<01:06,  1.90s/it][A
 15%|█▌        | 6/40 [00:11<01:00,  1.78s/it][A
 18%|█▊        | 7/40 [00:12<00:54,  1.67s/it][A
 20%|██        | 8/40 [00:14<00:51,  1.60s/it][A
 22%|██▎       | 9/40 [00:17<01:04,  2.09s/it][A
 25%|██▌       | 10/40 [00:18<00:55,  1.86s/it][A
 28%|██▊       | 11/40 [00:20<00:49,  1.70s/it][A
 30%|███       | 12/40 [00:21<00:44,  1.58s/it][A
 32%|███▎      | 13/40 [00:22<00:40,  1.52s/it][A
 35%|███▌      | 14/40 [00:23<00:36,  1.42s/it][A
 38%|███▊      | 15/40 [00:25<00:34,  1.37s/it][A
 40%|████      | 16/40 [00:26<00:33,  1.38s/it][A
 42%|████▎     | 17/40 [00:27<00:29,  1.29s/it][A
 45%|████▌     | 18/40 [00:28<00:27,  1.24s/it][A
 48%|████▊     | 19/40 [00:29<00:24,  1.18s/it]

save function & save entropy + el2n

In [9]:
def save(which_domain, which_method):
    saving_dict = {'translated':translated_dict[which_domain]['translated'],
                   which_method:translated_dict[which_domain][which_method],
                   'src':translated_dict[which_domain]['src'],
                   'tgt':translated_dict[which_domain]['tgt']
                   }
    saving_dataset_obj = Dataset.from_dict(saving_dict)
    saving_dataset_obj.save_to_disk(f'{output_path}/{which_domain}/{which_method}')

for which_domain in domains:
    save(which_domain, 'el2n')
    save(which_domain, 'entropy')

Saving the dataset (0/1 shards):   0%|          | 0/400 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/400 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/240 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/240 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

# NER_based_methods
define NER function and other utils

In [10]:
def NER(EN_NER_pipeline, which_domain):
    translated_dataset = Dataset.from_dict({'tgt': translated_dict[which_domain]['tgt']})

    ner_list = []
    for out in tqdm(EN_NER_pipeline(KeyDataset(translated_dataset, "tgt"), batch_size=32), total=len(translated_dataset)):
        ner_list.append(out)

    ner_dict = {n: x for n, x in enumerate(ner_list)}
    for key in ner_dict.keys():
        if which_domain == 'medical':
            ner_dict[key] = [[x['start'], x['end']] for x in ner_dict[key] if x['entity_group'] not in
                             ['Age', 'Date', 'Frequency', 'Duration',
                              'Distance', 'Mass', 'Sex', 'Lab_value', 'Time', 'Coreference']]
        else:
            ner_dict[key] = [[x['start'], x['end']] for x in ner_dict[key]]

    return ner_dict

In [11]:
# check whether translated token is one of named entity composition
def find_indices2(ref, special):
    res = []

    for s_interval in special:
        for r_interval in ref:
            if s_interval[1] >= r_interval[0] and s_interval[0] <= r_interval[1]:
                res.append(s_interval)
                break
    return res

# check offset_mapping's token(NLLB tokenized) index
def find_indices_reverse_indices(outputs_offset_mappings, final_indices):
    res = []
    for ner_range in final_indices:
        sub_list = []
        for idx, offset in enumerate(outputs_offset_mappings):
            if offset[1] <= ner_range[0]:  # If the offset is completely before the NER range
                continue
            elif offset[0] >= ner_range[1]:  # If the offset is completely after the NER range
                break
            else:
                sub_list.append(idx)
        res.append(sub_list)
    return res

save NER-based methods

'd4data/biomedical-ner-all' works in medical domain pretty well

'RashidNLP/NER-Deberta' works in general domain

In [12]:
for which_domain in tqdm(domains):
    if which_domain=='medical':
        NER_model_path = 'd4data/biomedical-ner-all'
        ner_model = AutoModelForTokenClassification.from_pretrained(NER_model_path).eval()
        ner_tokenizer = AutoTokenizer.from_pretrained(NER_model_path, model_max_length=144)
        EN_NER_pipeline = TokenClassificationPipeline(model=ner_model, tokenizer=ner_tokenizer, aggregation_strategy="simple",
                                                      device = device)
    else:
        NER_model_path = 'RashidNLP/NER-Deberta'
        ner_model = AutoModelForTokenClassification.from_pretrained(NER_model_path).eval()
        ner_tokenizer = AutoTokenizer.from_pretrained(NER_model_path, model_max_length=144)
        EN_NER_pipeline = TokenClassificationPipeline(model=ner_model, tokenizer=ner_tokenizer, aggregation_strategy="simple",
                                                      device = device)

    NER_predicted_dict = NER(EN_NER_pipeline, which_domain)
    entropy_NE_mean_list = []
    entropy_NE_list = []
    for key in NER_predicted_dict.keys():
        Distorted_NEs_indices = find_indices2(
            translated_dict[which_domain]['output_offset_mappings'][key], NER_predicted_dict[key])
        Distorted_NEs_indices = find_indices_reverse_indices(
            translated_dict[which_domain]['output_offset_mappings'][key], Distorted_NEs_indices)
        Distorted_NEs_indices = [j for sub in Distorted_NEs_indices for j in sub]
        if len(Distorted_NEs_indices) == 0:
            entropy_NE_list.append(-1)
            entropy_NE_mean_list.append(-1)
            continue

        entropy_NE_list.append(translated_dict[which_domain]['entropy_for_NE'][key][Distorted_NEs_indices].max().item())
        entropy_NE_mean_list.append(translated_dict[which_domain]['entropy_for_NE'][key][Distorted_NEs_indices].mean().item())

    translated_dict[which_domain]['entropy_NE'] = entropy_NE_list
    translated_dict[which_domain]['entropy_NE_mean']  = entropy_NE_mean_list
    save(which_domain, 'entropy_NE')
    save(which_domain, 'entropy_NE_mean')

    del ner_model
    gc.collect()
    torch.cuda.empty_cache()

  0%|          | 0/4 [00:00<?, ?it/s]
  0%|          | 0/400 [00:00<?, ?it/s][A
  8%|▊         | 33/400 [00:00<00:01, 314.28it/s][A
 22%|██▏       | 88/400 [00:00<00:00, 447.57it/s][A
 34%|███▍      | 137/400 [00:00<00:00, 466.67it/s][A
 48%|████▊     | 193/400 [00:00<00:00, 497.36it/s][A
 64%|██████▍   | 257/400 [00:00<00:00, 548.13it/s][A
100%|██████████| 400/400 [00:00<00:00, 588.24it/s][A


Saving the dataset (0/1 shards):   0%|          | 0/400 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/400 [00:00<?, ? examples/s]

 25%|██▌       | 1/4 [00:01<00:05,  1.86s/it]
  0%|          | 0/320 [00:00<?, ?it/s][A
  2%|▏         | 7/320 [00:00<00:04, 70.00it/s][A
 14%|█▍        | 45/320 [00:00<00:01, 249.41it/s][A
 30%|██▉       | 95/320 [00:00<00:00, 361.47it/s][A
 41%|████▏     | 132/320 [00:00<00:00, 342.01it/s][A
 60%|██████    | 193/320 [00:00<00:00, 401.15it/s][A
100%|██████████| 320/320 [00:00<00:00, 444.44it/s][A


Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

 50%|█████     | 2/4 [00:04<00:04,  2.33s/it]
  0%|          | 0/240 [00:00<?, ?it/s][A
 13%|█▎        | 32/240 [00:00<00:00, 316.83it/s][A
 27%|██▋       | 65/240 [00:00<00:00, 277.42it/s][A
 54%|█████▍    | 129/240 [00:00<00:00, 378.16it/s][A
100%|██████████| 240/240 [00:00<00:00, 455.41it/s][A


Saving the dataset (0/1 shards):   0%|          | 0/240 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/240 [00:00<?, ? examples/s]

 75%|███████▌  | 3/4 [00:06<00:02,  2.38s/it]
  0%|          | 0/320 [00:00<?, ?it/s][A
  6%|▌         | 19/320 [00:00<00:01, 186.27it/s][A
 18%|█▊        | 57/320 [00:00<00:00, 297.56it/s][A
 30%|███       | 97/320 [00:00<00:00, 299.24it/s][A
 50%|█████     | 160/320 [00:00<00:00, 418.40it/s][A
 63%|██████▎   | 203/320 [00:00<00:00, 415.28it/s][A
100%|██████████| 320/320 [00:00<00:00, 450.70it/s][A


Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

100%|██████████| 4/4 [00:09<00:00,  2.50s/it]


# Embeddings_based_methods


In [13]:
def k_means(model, model_path, which_domain, label_size=128, batch_size=128):
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    def tokenize(row):
        return tokenizer(row['src'], truncation=True, max_length=72) # 36+50, mean + 4 sigma

    src_sentences_dataset = Dataset.from_dict({'src':translated_dict[which_domain]['src']})
    tokenized_src_dataset = src_sentences_dataset.map(tokenize, batched=True)
    tokenized_src_dataset = tokenized_src_dataset.remove_columns(['src', 'token_type_ids'])
    dataloader = set_dataloader(tokenized_src_dataset, tokenizer)

    embeddings_list = []
    with torch.no_grad():
        for inputs in tqdm(dataloader):
            inputs = {k: v.to(device) for k, v in inputs.items()}
            embeddings, _ = model(**inputs, return_dict=False)
            embeddings_list.append(embeddings[:,0,:].cpu())

    embeddings_tensors = torch.cat(embeddings_list, axis=0)
    kmeans = KMeans(n_clusters=label_size, random_state=42, n_init="auto").fit(embeddings_tensors)
    label_list = kmeans.labels_.tolist()
    centers = torch.tensor(kmeans.cluster_centers_)

    eu_val_list = []
    for n, cluster_idx in enumerate(label_list):
        dist = (embeddings_tensors[n] - centers[cluster_idx]).pow(2).sum().sqrt()
        eu_val_list.append(dist.item())

    return eu_val_list

save embedding based methods.

'BM-K/KoSimCSE-roberta-multitask' works pretty well in Korean language

'sentence-transformers/LaBSE' is good at multilingual sentence embeddings

In [14]:
multilingual_model_path = 'sentence-transformers/LaBSE'
monolingual_model_path = 'BM-K/KoSimCSE-roberta-multitask'

for which_domain in tqdm(domains):
    multilingual_embeds_model = AutoModel.from_pretrained(multilingual_model_path)
    multilingual_embeds_model.to(device)
    multilingual_embeds_val = k_means(multilingual_embeds_model, multilingual_model_path, which_domain)

    monolingual_embeds_model = AutoModel.from_pretrained(monolingual_model_path)
    monolingual_embeds_model.to(device)
    monolingual_embeds_val = k_means(monolingual_embeds_model, monolingual_model_path, which_domain)

    translated_dict[which_domain]['self_sup_multi'] = multilingual_embeds_val
    translated_dict[which_domain]['self_sup_mono']  = monolingual_embeds_val
    save(which_domain, 'self_sup_multi')
    save(which_domain, 'self_sup_mono')

    del multilingual_embeds_model
    del monolingual_embeds_model
    gc.collect()
    torch.cuda.empty_cache()

  0%|          | 0/4 [00:00<?, ?it/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]


  0%|          | 0/50 [00:00<?, ?it/s][AYou're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

 22%|██▏       | 11/50 [00:00<00:00, 101.85it/s][A
 44%|████▍     | 22/50 [00:00<00:00, 105.89it/s][A
 68%|██████▊   | 34/50 [00:00<00:00, 108.32it/s][A
100%|██████████| 50/50 [00:00<00:00, 109.41it/s][A


Map:   0%|          | 0/400 [00:00<?, ? examples/s]


  0%|          | 0/50 [00:00<?, ?it/s][AYou're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

 22%|██▏       | 11/50 [00:00<00:00, 102.80it/s][A
 46%|████▌     | 23/50 [00:00<00:00, 108.30it/s][A
 70%|███████   | 35/50 [00:00<00:00, 111.01it/s][A
100%|██████████| 50/50 [00:00<00:00, 109.89it/s][A


Saving the dataset (0/1 shards):   0%|          | 0/400 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/400 [00:00<?, ? examples/s]

 25%|██▌       | 1/4 [00:06<00:20,  6.98s/it]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]


  0%|          | 0/40 [00:00<?, ?it/s][AYou're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

 25%|██▌       | 10/40 [00:00<00:00, 96.48it/s][A
 55%|█████▌    | 22/40 [00:00<00:00, 104.66it/s][A
100%|██████████| 40/40 [00:00<00:00, 107.92it/s][A


Map:   0%|          | 0/320 [00:00<?, ? examples/s]


  0%|          | 0/40 [00:00<?, ?it/s][AYou're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

 25%|██▌       | 10/40 [00:00<00:00, 98.04it/s][A
 55%|█████▌    | 22/40 [00:00<00:00, 105.91it/s][A
100%|██████████| 40/40 [00:00<00:00, 107.64it/s][A


Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

 50%|█████     | 2/4 [00:13<00:13,  6.76s/it]

Map:   0%|          | 0/240 [00:00<?, ? examples/s]


  0%|          | 0/30 [00:00<?, ?it/s][AYou're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

 37%|███▋      | 11/30 [00:00<00:00, 103.77it/s][A
100%|██████████| 30/30 [00:00<00:00, 109.09it/s][A


Map:   0%|          | 0/240 [00:00<?, ? examples/s]


  0%|          | 0/30 [00:00<?, ?it/s][AYou're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

 37%|███▋      | 11/30 [00:00<00:00, 104.76it/s][A
100%|██████████| 30/30 [00:00<00:00, 109.49it/s][A


Saving the dataset (0/1 shards):   0%|          | 0/240 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/240 [00:00<?, ? examples/s]

 75%|███████▌  | 3/4 [00:20<00:06,  6.74s/it]

Map:   0%|          | 0/320 [00:00<?, ? examples/s]


  0%|          | 0/40 [00:00<?, ?it/s][AYou're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

 25%|██▌       | 10/40 [00:00<00:00, 93.46it/s][A
 52%|█████▎    | 21/40 [00:00<00:00, 102.33it/s][A
100%|██████████| 40/40 [00:00<00:00, 106.95it/s][A


Map:   0%|          | 0/320 [00:00<?, ? examples/s]


  0%|          | 0/40 [00:00<?, ?it/s][AYou're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

 25%|██▌       | 10/40 [00:00<00:00, 99.01it/s][A
 55%|█████▌    | 22/40 [00:00<00:00, 107.53it/s][A
100%|██████████| 40/40 [00:00<00:00, 109.29it/s][A


Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

100%|██████████| 4/4 [00:26<00:00,  6.70s/it]


# ref-free Comet based methods

In [15]:
def prune_by_comet(comet_model, which_domain):
    dataset_for_comet = Dataset.from_dict(
        {'src': translated_dict[which_domain]['src'], 'mt': translated_dict[which_domain]['translated'],
         'tgt': translated_dict[which_domain]['tgt']})
    dataset_for_comet_list = dataset_for_comet.to_list()

    with torch.no_grad():
        model_output = comet_model.predict(dataset_for_comet_list, batch_size=4, gpus=1, num_workers=0)

    comet_scores = [x for x in model_output['scores']]
    return comet_scores

save comet based methods.

'Unbabel/wmt23-cometkiwi-da-xl' is a recent model and used for experiments.

In [17]:
comet_model_path = 'Unbabel/wmt22-cometkiwi-da'
comet_model_path = download_model(comet_model_path)
comet_model = load_from_checkpoint(comet_model_path)

for which_domain in domains:
    comet_scores = prune_by_comet(comet_model, which_domain)
    translated_dict[which_domain]['refree_comet']  = comet_scores
    save(which_domain, 'refree_comet')

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Lightning automatically upgraded your loaded checkpoint from v1.8.2 to v2.0.9.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file C:\Users\jish1\.cache\huggingface\hub\models--Unbabel--wmt22-cometkiwi-da\snapshots\b3a8aea5a5fc22db68a554b92b3d96eb6ea75cc9\checkpoints\model.ckpt`
Encoder model frozen.
  rank_zero_warn(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 50/50 [00:01<00:00, 37.52it/s]


Saving the dataset (0/1 shards):   0%|          | 0/400 [00:00<?, ? examples/s]

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 40/40 [00:01<00:00, 34.57it/s]


Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 30/30 [00:00<00:00, 32.89it/s]


Saving the dataset (0/1 shards):   0%|          | 0/240 [00:00<?, ? examples/s]

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 40/40 [00:01<00:00, 34.97it/s]


Saving the dataset (0/1 shards):   0%|          | 0/320 [00:00<?, ? examples/s]