# Fine-Pruning with a Sparse Trainer

In this notebook, we'll see how `nn_pruning` combines techniques from [movement pruning](https://arxiv.org/abs/2005.07683) and structured pruning to produce compact Transformers that can run inference faster than their dense counterparts, with little impact on accuracy. Let's get started!

In [4]:
# pip install transformers datasets

In [5]:
# pip install accelerate -U

In [6]:
# pip install nn_pruning

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import random
import numpy as np
import pandas as pd
import torch
from torch import optim
import torch.nn.functional as F
from tqdm import tqdm

import datasets
from datasets import Dataset
import transformers
transformers.logging.set_verbosity_error()

from transformers import BertForSequenceClassification, BertTokenizer

## Import dataset
Import dataset in DatasetDict format.

In [9]:
train_dataset_path = '/content/train_preprocess.tsv'
valid_dataset_path = '/content/valid_preprocess.tsv'
test_dataset_path = '/content/test_preprocess_masked_label.tsv'

In [10]:
df_train = pd.read_table(train_dataset_path, header=None)
df_valid = pd.read_table(valid_dataset_path, header=None)
df_test = pd.read_table(test_dataset_path, header=None)

df_train = df_train.rename(columns={0: "text", 1: "label"})
df_valid = df_valid.rename(columns={0: "text", 1: "label"})
df_test = df_test.rename(columns={0: "text", 1: "label"})

labels = {'positive': 0, 'neutral': 1, 'negative': 2}
df_train['label'] = df_train['label'].map(labels)
df_valid['label'] = df_valid['label'].map(labels)
df_test['label'] = df_test['label'].map(labels)

In [11]:
train_dataset = Dataset.from_pandas(df_train)
valid_dataset = Dataset.from_pandas(df_valid)
test_dataset = Dataset.from_pandas(df_test)

In [12]:
dd = datasets.DatasetDict({"train":train_dataset, "validation":valid_dataset, "test":test_dataset})
dd

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 11000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 500
    })
})

## Tokenize the inputs
Before we can fine-prune any models, the first thing we need to do is tokenize and encode the `text` fields of each example. Currently, `nn_pruning` supports fine-pruning for BERT models so we'll use BERT-base and load up the tokenizer as follows:

In [13]:
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

In [14]:
# Set random seed
set_seed(2023)

In [15]:
# Define device
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

device

device(type='cuda')

In [16]:
# Load Tokenizer
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-base-p1")

def tokenize_and_encode(examples):
    return tokenizer(examples['text'], max_length=512, truncation=True)

dd_enc = dd.map(tokenize_and_encode, batched=True)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/229k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Map:   0%|          | 0/11000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1260 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [17]:
dd_enc

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 11000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 500
    })
})

## Creating a sparse trainer

The next thing to do is create a trainer that can handle the fine-pruning and evaluation steps for us. In `nn_pruning` this is done via the `sparse_trainer.SparseTrainer` [mixin class](https://realpython.com/inheritance-composition-python/#mixing-features-with-mixin-classes) that provides extra methods for `transformers.Trainer` to "patch" or sparsify pretrained models and implement the various pruning techniques discussed in the movement pruning paper.

To keep things simple, we'll override the `compute_loss` function to ignore knowledge distillation and just return the cross-entropy loss.

In [18]:
from transformers import Trainer
from nn_pruning.sparse_trainer import SparseTrainer

class PruningTrainer(SparseTrainer, Trainer):
    def __init__(self, sparse_args, *args, **kwargs):
        Trainer.__init__(self, *args, **kwargs)
        SparseTrainer.__init__(self, sparse_args)

    def compute_loss(self, model, inputs, return_outputs=False):
        """
        We override the default loss in SparseTrainer because it throws an
        error when run without distillation
        """
        outputs = model(**inputs)

        # Save past state if it exists
        # TODO: this needs to be fixed and made cleaner later.
        if self.args.past_index >= 0:
            self._past = outputs[self.args.past_index]

        # We don't use .loss here since the model may return tuples instead of ModelOutput.
        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
        self.metrics["ce_loss"] += float(loss)
        self.loss_counter += 1
        return (loss, outputs) if return_outputs else loss

Note that `SparseTrainer` expects `sparse_args` in its `__init__` method. These arguments are analogous to  `transformers.TrainingArguments` and specify which pruning method is applied, whether knowledge distillation is activated, the associated hyperparameters, and more. Let's take a look at the defaults:

In [19]:
from nn_pruning.patch_coordinator import SparseTrainingArguments

sparse_args = SparseTrainingArguments()
sparse_args

SparseTrainingArguments(mask_scores_learning_rate=0.01, dense_pruning_method='topK', attention_pruning_method='topK', ampere_pruning_method='disabled', attention_output_with_dense=True, bias_mask=True, mask_init='constant', mask_scale=0.0, dense_block_rows=1, dense_block_cols=1, attention_block_rows=1, attention_block_cols=1, initial_threshold=1.0, final_threshold=0.5, initial_warmup=1, final_warmup=2, initial_ampere_temperature=0.0, final_ampere_temperature=20.0, regularization='disabled', regularization_final_lambda=0.0, attention_lambda=1.0, dense_lambda=1.0, distil_teacher_name_or_path=None, distil_alpha_ce=0.5, distil_alpha_teacher=0.5, distil_temperature=2.0, final_finetune=False, layer_norm_patch=False, layer_norm_patch_steps=50000, layer_norm_patch_start_delta=0.99, gelu_patch=False, gelu_patch_steps=50000, linear_min_parameters=0.005, rewind_model_name_or_path=None)

The main hyperparameters to tweak for fine-pruning are:

* `dense_pruning_method` / `attention_pruning_method`: determines how the matrix of mask scores are calculated for the dense/attention layers. Can take one of the following values:
    * `l0`: $L_0$ regularization
    * `magnitude`: magnitude pruning
    * `topK`: Movement pruning
    * `sigmoied_threshold`: soft movement pruning
* `initial_threshold`: the initial value of the masking threshold for scheduling. Set this to 1 when using `topK` (initial density) or 0 when using `sigmoied_threshold` (cutoff)
* `final_threshold`: the final value of the masking threshold. When using `topK`, this is the final density. With `sigmoied_threshold`, a good choice is 0.1
* `initial_warmup`: runs `initial_warmup` * `warmup_steps` steps of threshold warm-up during which threshold stays at its `initial_threshold` value (sparsity schedule)
* `final_warmup`: runs `final_warmup` * `warmup_steps` steps of threshold cool-down during which threshold stays at its final_threshold value (sparsity schedule)

For our example, let's use `topK` movement pruning and remove 50% of the weights in the encoder. We'll apply a form of "hybrid pruning" by performing block pruning on the attention layers and adding the `1d_alt` argument for the dense layers, which prunes alternating rows and columns and produces better results:

In [20]:
hyperparams = {
    "dense_pruning_method": "topK:1d_alt",
    "attention_pruning_method": "topK",
    "initial_threshold": 1.0,
    "final_threshold": 0.5,
    "initial_warmup": 1,
    "final_warmup": 3,
    "attention_block_rows":32,
    "attention_block_cols":32,
    "attention_output_with_dense": 0
}

for k,v in hyperparams.items():
    if hasattr(sparse_args, k):
        setattr(sparse_args, k, v)
    else:
        print(f"sparse_args does not have argument {k}")

In addition to the pruning hyperparameters, we also need the usual training parameters like learning rate, batch size and so on. These can be configured using `transformers.TrainingArguments` as follows:

In [21]:
from transformers import TrainingArguments

batch_size = 8
learning_rate = 3e-6
num_train_epochs = 10
logging_steps = len(dd_enc["train"]) // batch_size
warmup_steps = logging_steps * num_train_epochs * 0.1

args = TrainingArguments(
    output_dir="checkpoints",
    evaluation_strategy="epoch",
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=learning_rate,
    load_best_model_at_end=True,
    save_strategy='epoch',
    metric_for_best_model='accuracy',
    seed=2023,
    weight_decay=0.01,
    logging_steps=logging_steps,
    disable_tqdm=False,
    report_to=None,
    warmup_steps=warmup_steps
)

> Tip: a key ingredient for getting good results with movement pruning is to prune the model slowly by training for several epochs and including some amount of linear warmup (6-10% of the total steps is a good heuristic).

## Patching a Dense Model
To enable movement pruning, we need masked versions of BERT-base that can compute the adaptive mask in the forward pass. The way this is done in `nn_pruning` is via the `ModelPatchingCoordinator` class:

In [22]:
import torch
from nn_pruning.patch_coordinator import ModelPatchingCoordinator

mpc = ModelPatchingCoordinator(
    sparse_args=sparse_args,
    device=device,
    cache_dir="checkpoints",
    logit_names="logits",
    teacher_constructor=BertForSequenceClassification)

This class has several methods that control how pruning is applied during training and how to convert a pruned model into a format that is compatible for runnning inference with the `transformers` API. The first thing we need to do is "patch" our dense model which can be achieved with the `ModelPatchingCoordinator.patch_model` function as follows:

In [23]:
bert_model = BertForSequenceClassification.from_pretrained("indobenchmark/indobert-base-p1").to(device)
mpc.patch_model(bert_model)

bert_model.save_pretrained("/content/models/patched")

Downloading pytorch_model.bin:   0%|          | 0.00/498M [00:00<?, ?B/s]

## Fine-pruning
We almost have all the ingredients needed to fine-prune our model! The only thing missing is the `compute_metrics` function for our trainer, so let's load the `accuracy` metric from `datasets` to measure the performance of our model:

In [24]:
import numpy as np
from datasets import load_metric

accuracy_score = load_metric('accuracy')

def compute_metrics(pred):
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_score.compute(predictions=predictions, references=labels)

  accuracy_score = load_metric('accuracy')


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The last thing to do is instantiate our trainer

In [25]:
trainer = PruningTrainer(
    sparse_args=sparse_args,
    args=args,
    model=bert_model,
    train_dataset=dd_enc["train"],
    eval_dataset=dd_enc["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

specify the patch coordinator during training


In [26]:
trainer.set_patch_coordinator(mpc)

and fine-prune:

In [27]:
trainer.train()



Epoch,Training Loss,Validation Loss,Loss,Accuracy,Runtime,Samples Per Second,Steps Per Second,Threshold,Regu Lambda,Ampere Temperature
1,0.5728,0.548191,0.549839,0.809524,25.2042,49.992,6.269,0.5,0.0,20.0
2,0.2292,0.421472,0.422693,0.859524,25.2838,49.834,6.249,0.5,0.0,20.0
3,0.19,0.314459,0.31537,0.894444,25.1402,50.119,6.285,0.5,0.0,20.0
4,0.1522,0.304795,0.305758,0.922222,24.8563,50.691,6.357,0.5,0.0,20.0
5,0.1305,0.318965,0.319975,0.935714,24.7585,50.892,6.382,0.5,0.0,20.0
6,0.1145,0.36904,0.37021,0.930952,25.0949,50.209,6.296,0.5,0.0,20.0
7,0.093,0.410497,0.411799,0.930159,25.2614,49.879,6.255,0.5,0.0,20.0
8,0.0819,0.429327,0.430689,0.928571,25.1979,50.004,6.27,0.5,0.0,20.0
9,0.065,0.406985,0.408276,0.93254,25.2833,49.835,6.249,0.5,0.0,20.0
10,0.0578,0.407292,0.408584,0.930159,25.4592,49.491,6.206,0.5,0.0,20.0


TrainOutput(global_step=13750, training_loss=0.16869874156605114, metrics={'train_runtime': 4263.1886, 'train_samples_per_second': 25.802, 'train_steps_per_second': 3.225, 'total_flos': 4135131884495568.0, 'train_loss': 0.16869874156605114, 'eval_threshold': 0.5, 'eval_regu_lambda': 0.0, 'eval_ampere_temperature': 20.0, 'epoch': 10.0})

In [28]:
output_model_path = "/content/drive/MyDrive/Model/finepruned"
trainer.save_model(output_model_path)

## Optimising for inference
Once a model has been fine-pruned, the weights that are masked during the forward pass can be set to zero and pruned once for all (which reduces the amount of information to store). This is achieved by applying the `ModelPatchingCoordinator.compile_model` function which will transform the model in-place and make it compatible with `transformers`:

In [29]:
mpc.compile_model(trainer.model)

(1, 144)

However, this alone won't give us any speed-up during inference because matrix multiplication does not get faster just because more values are zero. To take care of this, `nn_pruning` provides an `optimize_model` function that will cleverly remove the zeroes from the model and produce a pruned model that has fewer parameters (and thus faster for inference):

In [30]:
from nn_pruning.inference_model_patcher import optimize_model

prunebert_model = optimize_model(trainer.model, "dense")

removed heads 0, total_heads=143, percentage removed=0.0
bert.encoder.layer.0.intermediate.dense, sparsity = 50.00
bert.encoder.layer.0.output.dense, sparsity = 50.00
bert.encoder.layer.1.intermediate.dense, sparsity = 50.00
bert.encoder.layer.1.output.dense, sparsity = 50.00
bert.encoder.layer.2.intermediate.dense, sparsity = 50.00
bert.encoder.layer.2.output.dense, sparsity = 50.00
bert.encoder.layer.3.intermediate.dense, sparsity = 50.00
bert.encoder.layer.3.output.dense, sparsity = 50.00
bert.encoder.layer.4.intermediate.dense, sparsity = 50.00
bert.encoder.layer.4.output.dense, sparsity = 50.00
bert.encoder.layer.5.intermediate.dense, sparsity = 50.00
bert.encoder.layer.5.output.dense, sparsity = 50.00
bert.encoder.layer.6.intermediate.dense, sparsity = 50.00
bert.encoder.layer.6.output.dense, sparsity = 50.00
bert.encoder.layer.7.intermediate.dense, sparsity = 50.00
bert.encoder.layer.7.output.dense, sparsity = 50.00
bert.encoder.layer.8.intermediate.dense, sparsity = 50.00
bert.

We can also see what fraction of total parameters remain in our pruned model:

In [31]:
prunebert_model.num_parameters() / bert_model.num_parameters()

0.771989124140676

To see what kind of inference gains our pruned model provides, let's write a simple function that computes the average latency from several runs involving a text to be classified:

In [35]:
from time import perf_counter

def compute_latencies(model,
                      text='Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
                      ):
    inputs = tokenizer(text, max_length=512, truncation=True, return_tensors="pt")
    latencies = []

    # Warmup
    for _ in range(10):
        _ = model(**inputs)

    for _ in range(100):
        start_time = perf_counter()
        _ = model(**inputs)
        latency = perf_counter() - start_time
        latencies.append(latency)
        # Compute run statistics
        time_avg_ms = 1000 * np.mean(latencies)
        time_std_ms = 1000 * np.std(latencies)
    print(f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}")
    return {"time_avg_ms": time_avg_ms, "time_std_ms": time_std_ms}

Let's use this function to calculate the latency of our pruned model:

In [36]:
latencies = {}
latencies["prunebert"] = compute_latencies(prunebert_model.to("cpu"))

Average latency (ms) - 273.43 +\- 76.85


## Load model and inference
We try one more time to ensure that the model is working fine if it is used in another environment.

In [9]:
device = torch.device("cpu")
load_model_path = "/content/drive/MyDrive/Model/finepruned"

loaded_model = BertForSequenceClassification.from_pretrained(load_model_path).to(device)

In [12]:
from nn_pruning.patch_coordinator import SparseTrainingArguments
from nn_pruning.patch_coordinator import ModelPatchingCoordinator

sparse_args = SparseTrainingArguments()
hyperparams = {
    "initial_threshold": 1.0,
    "final_threshold": 0.5,
    "initial_warmup": 1,
    "final_warmup": 3,
    "attention_block_rows":32,
    "attention_block_cols":32,
    "attention_output_with_dense": 0
}

for k,v in hyperparams.items():
    if hasattr(sparse_args, k):
        setattr(sparse_args, k, v)
    else:
        print(f"sparse_args does not have argument {k}")

mpc = ModelPatchingCoordinator(
    sparse_args=sparse_args,
    device=device,
    cache_dir="checkpoints",
    logit_names="logits",
    teacher_constructor=BertForSequenceClassification)

In [13]:
mpc = ModelPatchingCoordinator(
    sparse_args=sparse_args,
    device=device,
    cache_dir="checkpoints",
    logit_names="logits",
    teacher_constructor=BertForSequenceClassification)

In [15]:
bert_model = BertForSequenceClassification.from_pretrained("indobenchmark/indobert-base-p1").to(device)

In [16]:
print("BERT model params before patched:", bert_model.num_parameters())
print("Pruned model params before patched:", loaded_model.num_parameters())

BERT model params before patched: 124445189
Pruned model params before patched: 124445189


In [17]:
mpc.patch_model(bert_model)
mpc.compile_model(loaded_model)



(0, 144)

In [18]:
print("Model params after patched:", bert_model.num_parameters())
print("Pruned model params after patched:", loaded_model.num_parameters())

Model params after patched: 181095941
Pruned model params after patched: 124445189


It is shown that after the model being patched, the size increases. Still doesn't know why this happens though.

In [20]:
from nn_pruning.inference_model_patcher import optimize_model

prune_loaded_model = optimize_model(loaded_model, "dense")

removed heads 0, total_heads=144, percentage removed=0.0
bert.encoder.layer.0.intermediate.dense, sparsity = 0.00
bert.encoder.layer.0.output.dense, sparsity = 0.00
bert.encoder.layer.1.intermediate.dense, sparsity = 0.00
bert.encoder.layer.1.output.dense, sparsity = 0.00
bert.encoder.layer.2.intermediate.dense, sparsity = 0.00
bert.encoder.layer.2.output.dense, sparsity = 0.00
bert.encoder.layer.3.intermediate.dense, sparsity = 0.00
bert.encoder.layer.3.output.dense, sparsity = 0.00
bert.encoder.layer.4.intermediate.dense, sparsity = 0.00
bert.encoder.layer.4.output.dense, sparsity = 0.00
bert.encoder.layer.5.intermediate.dense, sparsity = 0.00
bert.encoder.layer.5.output.dense, sparsity = 0.00
bert.encoder.layer.6.intermediate.dense, sparsity = 0.00
bert.encoder.layer.6.output.dense, sparsity = 0.00
bert.encoder.layer.7.intermediate.dense, sparsity = 0.00
bert.encoder.layer.7.output.dense, sparsity = 0.00
bert.encoder.layer.8.intermediate.dense, sparsity = 0.00
bert.encoder.layer.8.o

In [21]:
prune_loaded_model.num_parameters()

124445189

In [22]:
import timeit

def benchmark(f, name=""):
    # warmup
    for _ in range(10):
        f()
    seconds_per_iter = timeit.timeit(f, number=100) / 100
    print(
        f"{name}:",
        f"{seconds_per_iter * 1000:.3f} ms",
    )

    return seconds_per_iter * 1000

In [24]:
tokenizer = BertTokenizer.from_pretrained(load_model_path)

text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
inputs = tokenizer.encode(text)
inputs = torch.LongTensor(inputs).view(1, -1).to("cpu")

In [25]:
speed_prune = benchmark(lambda: prune_loaded_model(inputs), "Pruned BERT")

Pruned BERT: 92.641 ms


It works just fine! Let's compare the overall performance of this pruned model to the dense finetuned version of the model.

## Comparison Study
Compare the performance between the densed finetuned BERT and the finepruned BERT.

In [26]:
ft_model_path = "/content/drive/MyDrive/Model/"

# Load tokenizer and model
tokenizer_ft = BertTokenizer.from_pretrained(ft_model_path)
model_ft = BertForSequenceClassification.from_pretrained(ft_model_path).to("cpu")

In [27]:
size_ft = model_ft.num_parameters()
print(f"Dense params: {size_ft}")

Dense params: 124443651


In [28]:
size_pr = prune_loaded_model.num_parameters()
print(f"Pruned params: {size_pr}")

Pruned params: 124445189


In [29]:
text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
inputs = tokenizer.encode(text)
inputs = torch.LongTensor(inputs).view(1, -1).to("cpu")

In [30]:
speed_dense = benchmark(lambda: model_ft(inputs), "Dense BERT")
speed_prune = benchmark(lambda: prune_loaded_model(inputs), "Pruned BERT")

Dense BERT: 97.863 ms
Pruned BERT: 77.152 ms


The pruned model is faster than the densed one by ±20ms. This means the pruned version is 1.25x faster!

In [32]:
test_dataset_path = "/content/test_preprocess.tsv"
df_test = pd.read_table(test_dataset_path, header=None)
df_test.rename(columns={0: "text", 1: "label"}, inplace=True)
df_test.head()

Unnamed: 0,text,label
0,kemarin gue datang ke tempat makan baru yang a...,negative
1,kayak nya sih gue tidak akan mau balik lagi ke...,negative
2,"kalau dipikir-pikir , sebenarnya tidak ada yan...",negative
3,ini pertama kalinya gua ke bank buat ngurusin ...,negative
4,waktu sampai dengan gue pernah disuruh ibu lat...,negative


In [33]:
def infer(text):
  print(text)
  i2w = {0: 'positive', 1: 'neutral', 2: 'negative'}
  inputs = tokenizer_ft.encode(text)
  inputs = torch.LongTensor(inputs).view(1, -1).to(model_ft.device)

  logits = model_ft(inputs)[0]
  label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()
  return i2w[label]

In [34]:
df_test['pred_dense'] = df_test['text'].apply(infer)
df_test.head()

kemarin gue datang ke tempat makan baru yang ada di dago atas . gue kira makanan nya enak karena harga nya mahal . ternyata , boro-boro . tidak mau lagi deh ke tempat itu . sudah mana tempat nya juga tidak nyaman banget , terlalu sempit .
kayak nya sih gue tidak akan mau balik lagi ke tempat itu . gila , ya , gue enggak ngerti kenapa tempat nya dibiarkan panas . sudah begitu kotor pula . kalau panas kepanasan , kalau hujan kehujanan . harus nya sih tidak ada restoran yang kayak gitu . tidak tahu deh apa yang mereka jual .
kalau dipikir-pikir , sebenarnya tidak ada yang bisa dibanggakan dari jokowi . pertama , dia tidak bisa nepatin janji . kedua , kerjaan nya selalu pencitraan . ketiga , dia tidak pro rakyat . sudahlah . ku sudah terlanjur kecewa .
ini pertama kalinya gua ke bank buat ngurusin pembuatan rekening baru . nama nya juga orang pertama kali ya baru ke bank , gua kena semprot . kelihatan banget pelayanan pelanggan - nya tidak suka gua banyak bertanya . amit-amit . padahal itu

Unnamed: 0,text,label,pred_dense
0,kemarin gue datang ke tempat makan baru yang a...,negative,negative
1,kayak nya sih gue tidak akan mau balik lagi ke...,negative,negative
2,"kalau dipikir-pikir , sebenarnya tidak ada yan...",negative,negative
3,ini pertama kalinya gua ke bank buat ngurusin ...,negative,negative
4,waktu sampai dengan gue pernah disuruh ibu lat...,negative,negative


In [37]:
def infer_pr(text):
  print(text)
  i2w = {0: 'positive', 1: 'neutral', 2: 'negative'}
  inputs = tokenizer.encode(text)
  inputs = torch.LongTensor(inputs).view(1, -1).to(prune_loaded_model.device)

  logits = prune_loaded_model(inputs)[0]
  label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()
  return i2w[label]

In [38]:
df_test['pred_pruned'] = df_test['text'].apply(infer_pr)
df_test.head()

kemarin gue datang ke tempat makan baru yang ada di dago atas . gue kira makanan nya enak karena harga nya mahal . ternyata , boro-boro . tidak mau lagi deh ke tempat itu . sudah mana tempat nya juga tidak nyaman banget , terlalu sempit .
kayak nya sih gue tidak akan mau balik lagi ke tempat itu . gila , ya , gue enggak ngerti kenapa tempat nya dibiarkan panas . sudah begitu kotor pula . kalau panas kepanasan , kalau hujan kehujanan . harus nya sih tidak ada restoran yang kayak gitu . tidak tahu deh apa yang mereka jual .
kalau dipikir-pikir , sebenarnya tidak ada yang bisa dibanggakan dari jokowi . pertama , dia tidak bisa nepatin janji . kedua , kerjaan nya selalu pencitraan . ketiga , dia tidak pro rakyat . sudahlah . ku sudah terlanjur kecewa .
ini pertama kalinya gua ke bank buat ngurusin pembuatan rekening baru . nama nya juga orang pertama kali ya baru ke bank , gua kena semprot . kelihatan banget pelayanan pelanggan - nya tidak suka gua banyak bertanya . amit-amit . padahal itu

Unnamed: 0,text,label,pred_dense,pred_pruned
0,kemarin gue datang ke tempat makan baru yang a...,negative,negative,negative
1,kayak nya sih gue tidak akan mau balik lagi ke...,negative,negative,negative
2,"kalau dipikir-pikir , sebenarnya tidak ada yan...",negative,negative,negative
3,ini pertama kalinya gua ke bank buat ngurusin ...,negative,negative,negative
4,waktu sampai dengan gue pernah disuruh ibu lat...,negative,negative,negative


In [40]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

d = {
    "Accuracy": [accuracy_score(df_test['label'], df_test['pred_dense']),
                 accuracy_score(df_test['label'], df_test['pred_pruned']),],
    "Precision":[precision_score(df_test['label'], df_test['pred_dense'], average="macro"),
                 precision_score(df_test['label'], df_test['pred_pruned'], average="macro")],
    "Recall":   [recall_score(df_test['label'], df_test['pred_dense'], average="macro"),
                 recall_score(df_test['label'], df_test['pred_pruned'], average="macro")],
    "F1":       [f1_score(df_test['label'], df_test['pred_dense'], average="macro"),
                 f1_score(df_test['label'], df_test['pred_pruned'], average="macro")]
}

df_comp = pd.DataFrame.from_dict(d)
df_comp = df_comp.rename(index={0: 'Dense', 1: 'Pruned'})
df_comp['Inference Time'] = [speed_dense, speed_prune]
df_comp.to_csv('comparison.csv', index=False)
df_comp

Unnamed: 0,Accuracy,Precision,Recall,F1,Inference Time
Dense,0.916,0.91558,0.875811,0.890512,97.863438
Pruned,0.928,0.910219,0.913741,0.911848,77.152169


After the finepruning process, the model not only provides a faster speed, but other performance such as accuracy and f1-score is also higher than the dense model.

This proves that this method can be an alternative to optimize the inference time of a BERT model.