In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Pytorch_memlab
**A library for memory profiling. Uses torch.cuda.memory_stats() inside.**

In [None]:
#!pip install git+https://github.com/stonesjtu/pytorch_memlab

In [None]:
!pip install torch transformers pytorch_memlab

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# **Memory Profiler**

The memory profiler is a modification of python's line_profiler, it gives the memory usage info for each line of code in the specified function/method.

In [None]:
import torch
from pytorch_memlab import LineProfiler

def inner():
    torch.nn.Linear(100, 100).cuda()

def outer():
    linear = torch.nn.Linear(100, 100).cuda()
    linear2 = torch.nn.Linear(100, 100).cuda()
    inner()

with LineProfiler(outer, inner) as prof:
    outer()
prof.display()


sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/local/lib/python3.9/dist-packages/pytorch_memlab/line_profiler/line_profiler.py", line 81, in register_callback
    sys.settrace(self._trace_callback)


sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/local/lib/python3.9/dist-packages/pytorch_memlab/line_profiler/line_profiler.py", line 100, in disable
    sys.settrace(None)

  records = (_accumulate_line_records(raw_line_records)
  merged = merged.drop('code', 1, lev

active_bytes,reserved_bytes,line,code
all,all,Unnamed: 2_level_1,Unnamed: 3_level_1
peak,peak,Unnamed: 2_level_2,Unnamed: 3_level_2
0.00B,0.00B,7,def outer():
40.00K,2.00M,8,"linear = torch.nn.Linear(100, 100).cuda()"
80.00K,2.00M,9,"linear2 = torch.nn.Linear(100, 100).cuda()"
120.00K,2.00M,10,inner()

active_bytes,reserved_bytes,line,code
all,all,Unnamed: 2_level_1,Unnamed: 3_level_1
peak,peak,Unnamed: 2_level_2,Unnamed: 3_level_2
80.00K,2.00M,4,def inner():
120.00K,2.00M,5,"torch.nn.Linear(100, 100).cuda()"


# **Memory Reporter**

As Memory Profiler only gives the overall memory usage information by lines, a more low-level memory usage information can be obtained by Memory Reporter.

Memory reporter iterates all the Tensor objects and gets the underlying Storage object to get the actual memory usage instead of the surface Tensor.size.

In [None]:
import torch
from pytorch_memlab import MemReporter
linear = torch.nn.Linear(1024, 1024).cuda()
reporter = MemReporter()

In [None]:
reporter.print_stats()

Element type                                            Size  Used MEM


In [None]:
reporter.report()

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0                                      (1024, 1024)     4.00M
Parameter1                                           (1024,)     4.00K
-------------------------------------------------------------------------------
Total Tensors: 1049600 	Used Memory: 4.00M
The allocated memory on cuda:0: 4.00M
-------------------------------------------------------------------------------


  fact_numel = tensor.storage().size()
  data_ptr = tensor.storage().data_ptr()


In [None]:
import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
inp = torch.Tensor(512, 1024).cuda()
# pass in a model to automatically infer the tensor names
reporter = MemReporter(linear)
out = linear(inp).mean()
print('========= before backward =========')
reporter.report()
out.backward()
print('========= after backward =========')
reporter.report()

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                          (512, 1024)     2.00M
Tensor2                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 2098177 	Used Memory: 8.00M
The allocated memory on cuda:0: 16.13M
Memory differs due to the matrix alignment or invisible gradient buffer tensors
-------------------------------------------------------------------------------
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Sto

In [None]:
import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
linear2 = torch.nn.Linear(1024, 1024).cuda()
linear2.weight = linear.weight
container = torch.nn.Sequential(
    linear, linear2
)
inp = torch.Tensor(512, 1024).cuda()
# pass in a model to automatically infer the tensor names

out = container(inp).mean()
out.backward()

# verbose shows how storage is shared across multiple Tensors
reporter = MemReporter(container)
reporter.report(verbose=True)

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0                                      (1024, 1024)     4.00M
Parameter0.grad                                 (1024, 1024)     4.00M
Parameter1                                           (1024,)     4.00K
Parameter1.grad                                      (1024,)     4.00K
Tensor2                                          (512, 1024)     2.00M
Tensor3                                                 (1,)   512.00B
Parameter0.grad(->Parameter0.grad)              (1024, 1024)     0.00B
Parameter1.grad(->Parameter1.grad)                   (1024,)     0.00B
0.weight                                        (1024, 1024)     4.00M
0.weight.grad                                   (1024, 1024)     4.00M
0.bias                                               (1024,)     4.00K
0.bias.grad                                       

In [None]:
import torch
from pytorch_memlab import MemReporter

lstm = torch.nn.LSTM(1024, 1024).cuda()
reporter = MemReporter(lstm)
reporter.report(verbose=True)
inp = torch.Tensor(10, 10, 1024).cuda()
out, _ = lstm(inp)
out.mean().backward()
reporter.report(verbose=True)

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0                                      (1024, 1024)     4.00M
Parameter0.grad                                 (1024, 1024)     4.00M
Parameter1                                           (1024,)     4.00K
Parameter1.grad                                      (1024,)     4.00K
Parameter2                                           (1024,)     4.00K
Parameter2.grad                                      (1024,)     4.00K
Tensor3                                          (512, 1024)     2.00M
Tensor4                                                 (1,)   512.00B
Parameter0.grad(->Parameter0.grad)              (1024, 1024)     0.00B
Parameter1.grad(->Parameter1.grad)                   (1024,)     0.00B
Parameter2.grad(->Parameter2.grad)                   (1024,)     0.00B
weight_ih_l0                                    (4

In [None]:
import torch
from pytorch_memlab import MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
inp = torch.Tensor(512, 1024).cuda()
# pass in a model to automatically infer the tensor names
reporter = MemReporter(linear)
out = linear(inp * (inp + 2)).mean()
reporter.report()

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                          (512, 1024)     2.00M
Tensor2                                          (512, 1024)     2.00M
Tensor3                                                 (1,)   512.00B
Tensor4                                         (1024, 1024)     4.00M
Tensor5                                              (1024,)     4.00K
Tensor6                                              (1024,)     4.00K
Tensor7                                       (10, 10, 1024)   400.00K
Tensor8                                       (10, 10, 1024)   400.00K
Tensor9                                         (4

# **Courtesy**

Sometimes people would like to preempt your running task, but you don't want to save checkpoint and then load, actually all they need is GPU resources ( typically CPU resources and CPU memory is always spare in GPU clusters), so you can move all your workspaces from GPU to CPU and then halt your task until a restart signal is triggered, instead of saving&loading checkpoints and bootstrapping from scratch.

In [None]:
from pytorch_memlab import Courtesy

iamcourtesy = Courtesy()
#for i in range(num_iteration):
#    if something_happens:
#        iamcourtesy.yield_memory()
#        wait_for_restart_signal()
#        iamcourtesy.restore()

In [None]:
iamcourtesy

<pytorch_memlab.courtesy.Courtesy at 0x7fbb7638fe50>

# **Demo**

In [None]:
import torch
from pytorch_memlab import LineProfiler, MemReporter, profile
from transformers import BertForTokenClassification, BertTokenizerFast

In [None]:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu')

In [None]:
model = BertForTokenClassification.from_pretrained(
                'bert-base-cased',
                num_labels=10
).cuda()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [None]:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

# **GPT-3**

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("minhtoan/gpt3-small-finetune-cnndaily-news")

model = AutoModelForCausalLM.from_pretrained("minhtoan/gpt3-small-finetune-cnndaily-news").cuda()

In [None]:
model

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(2048, 768)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_fe

# Memory Reporter

We can inspect the memory used by the model tensors.

In [None]:
reporter = MemReporter(model)


In [None]:
#BERT
reporter.report(device=device, verbose=True)

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Tensor0                                         (1024, 1024)     4.00M
Tensor1                                              (1024,)     4.00K
Tensor2                                              (1024,)     4.00K
Tensor3                                         (4096, 1024)    32.03M
Tensor4(->Tensor3)                              (4096, 1024)     0.00B
Tensor5(->Tensor3)                                   (4096,)     0.00B
Tensor6(->Tensor3)                                   (4096,)     0.00B
Parameter7                                           (4096,)    32.03M
Parameter7.grad(->Tensor3)                           (4096,)     0.00B
Parameter8(->Parameter7)                             (4096,)     0.00B
Parameter8.grad(->Tensor3)                           (4096,)     0.00B
Parameter9(->Parameter7)                        (4

  fact_numel = tensor.storage().size()
  data_ptr = tensor.storage().data_ptr()


In [None]:
#GPT3
reporter.report(device=device, verbose = True)

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Tensor0                                   (1, 1, 2048, 2048)     4.00M
Tensor1                                                 (1,)   512.00B
Tensor2                                   (1, 1, 2048, 2048)     4.00M
Tensor3                                                 (1,)   512.00B
Tensor4                                   (1, 1, 2048, 2048)     4.00M
Tensor5                                                 (1,)   512.00B
Tensor6                                   (1, 1, 2048, 2048)     4.00M
Tensor7                                                 (1,)   512.00B
Tensor8                                   (1, 1, 2048, 2048)     4.00M
Tensor9                                                 (1,)   512.00B
Tensor10                                  (1, 1, 2048, 2048)     4.00M
Tensor11                                          

  fact_numel = tensor.storage().size()
  data_ptr = tensor.storage().data_ptr()


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xxl")

model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-xxl")

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/674 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/50.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00005.bin:   0%|          | 0.00/9.45G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00005.bin:   0%|          | 0.00/9.60G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00005.bin:   0%|          | 0.00/9.96G [00:00<?, ?B/s]

Downloading (…)l-00004-of-00005.bin:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

Downloading (…)l-00005-of-00005.bin:   0%|          | 0.00/6.06G [00:00<?, ?B/s]

In [None]:
#FLAN T5
reporter = MemReporter(model)

reporter.report(device=device, verbose=True)

In [None]:
data = tokenizer(['This is a sentence'], return_tensors='pt').to(device)
labels = torch.Tensor([1] * len(data.input_ids[0])).to(dtype=torch.long).cuda()

In [None]:
loss, logits = model(data.input_ids, token_type_ids=None, attention_mask=data.attention_mask, labels=labels)
#loss.backward()

We calculated the gradients and they are now shown in memory inspection.

In [None]:
reporter = MemReporter(model)
reporter.report(device=device)

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Tensor0                                         (1024, 1024)     4.00M
Tensor1                                              (1024,)     4.00K
Tensor2                                              (1024,)     4.00K
Tensor3                                         (4096, 1024)    32.03M
Tensor4                                         (4096, 1024)     0.00B
Tensor5                                              (4096,)     0.00B
Tensor6                                              (4096,)     0.00B
Parameter7                                           (4096,)    32.03M
Parameter7.grad                                      (4096,)     0.00B
Parameter8                                           (4096,)     0.00B
Parameter8.grad                                      (4096,)     0.00B
Parameter9                                      (4

## Shared parameters

We can also see that some variables are shared: reused memory is shown by '->'

In [None]:
%reset -f

In [None]:
import torch
from pytorch_memlab import LineProfiler, MemReporter
device = torch.device('cuda:0')

In [None]:
# use verbose=True to see reused memory
lstm = torch.nn.LSTM(1024, 1024).cuda()
reporter = MemReporter(lstm)
reporter.report(device=device, verbose=True)

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight_ih_l0                                    (4096, 1024)    32.03M
weight_hh_l0(->weight_ih_l0)                    (4096, 1024)     0.00B
bias_ih_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_hh_l0(->weight_ih_l0)                           (4096,)     0.00B
-------------------------------------------------------------------------------
Total Tensors: 8396800 	Used Memory: 32.03M
The allocated memory on cuda:0: 48.28M
Memory differs due to the matrix alignment or invisible gradient buffer tensors
-------------------------------------------------------------------------------


## Leaking memory

Sometimes used memory and allocated memory are not equal. This is due to memory leaks, the fact of which you can see but unfortunately not inspect. In the example below *input_tensor + 2* is a temporary operation result which is stored but not shown in memory inspection.

(Actually, if you try to run this notebook on torch==1.10.2 + transformers==4.17.0 + pytorch-memlab==0.2.4, memory leakage is gone - you'll see Tensor2 to account for a temporary result.)

In [None]:
%reset -f

In [None]:
import torch
from pytorch_memlab import LineProfiler, MemReporter

linear = torch.nn.Linear(1024, 1024).cuda()
input_tensor = torch.Tensor(512, 1024).cuda()
reporter = MemReporter(linear)
reporter.report()

out = linear(input_tensor * (input_tensor + 2)).mean()
reporter.report()

Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0                                      (4096, 1024)    32.03M
Parameter1                                      (4096, 1024)     0.00B
Parameter2                                           (4096,)     0.00B
Parameter3                                           (4096,)     0.00B
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor4                                          (512, 1024)     2.00M
-------------------------------------------------------------------------------
Total Tensors: 9970688 	Used Memory: 38.04M
The allocated memory on cuda:0: 54.29M
Memory differs due to the matrix alignment or invisible gradient buffer tensors
-------------------------------------------------------------------------------
Element typ

# Line Profiler

Line profiler can show memory usage line by line.

In [None]:
%reset -f

In [None]:
import torch
from pytorch_memlab import LineProfiler, MemReporter, profile
from transformers import BertForTokenClassification, BertTokenizerFast, BertModel

### A simple case

In [None]:
def inner():
    torch.nn.Linear(100, 100).cuda()

def outer():
    linear = torch.nn.Linear(100, 100).cuda()
    linear2 = torch.nn.Linear(100, 100).cuda()
    inner()

with LineProfiler(outer, inner) as prof:
    outer()
prof.display()

  records = (_accumulate_line_records(raw_line_records)
  merged = merged.drop('code', 1, level=0)
  merged = merged.drop('code', 1, level=0)
  html[qual_name] = (style
  html[qual_name] = (style
  html[qual_name] = (style
  html[qual_name] = (style


active_bytes,reserved_bytes,line,code
all,all,Unnamed: 2_level_1,Unnamed: 3_level_1
peak,peak,Unnamed: 2_level_2,Unnamed: 3_level_2
58.29M,72.00M,4,def outer():
58.32M,72.00M,5,"linear = torch.nn.Linear(100, 100).cuda()"
58.36M,72.00M,6,"linear2 = torch.nn.Linear(100, 100).cuda()"
58.40M,72.00M,7,inner()

active_bytes,reserved_bytes,line,code
all,all,Unnamed: 2_level_1,Unnamed: 3_level_1
peak,peak,Unnamed: 2_level_2,Unnamed: 3_level_2
58.36M,72.00M,1,def inner():
58.40M,72.00M,2,"torch.nn.Linear(100, 100).cuda()"


### Trying to profile BERT

In [None]:
def initialize_model():
    model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=10).cuda()
    return model

In [None]:
def get_data():
    device = torch.device('cuda:0')
    tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
    data = tokenizer(['This is a sentence'], return_tensors='pt').to(device)
    labels = torch.Tensor([1] * len(data.input_ids[0])).to(dtype=torch.long).cuda()
    return data, labels

In [None]:
def run_model():
    model = initialize_model()
    data, labels = get_data()
    loss, logits = model(data.input_ids, token_type_ids=None, attention_mask=data.attention_mask, labels=labels)
    return loss

In [None]:
with LineProfiler(run_model, initialize_model, get_data) as prof:
    run_model()
prof.display()

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

active_bytes,reserved_bytes,line,code
all,all,Unnamed: 2_level_1,Unnamed: 3_level_1
peak,peak,Unnamed: 2_level_2,Unnamed: 3_level_2
58.29M,72.00M,1,def run_model():
469.49M,518.00M,2,model = initialize_model()
469.49M,518.00M,3,"data, labels = get_data()"
473.95M,522.00M,4,"loss, logits = model(data.input_ids, token_type_ids=None, attention_mask=data.attention_mask, labels=labels)"
469.49M,522.00M,5,return loss

active_bytes,reserved_bytes,line,code
all,all,Unnamed: 2_level_1,Unnamed: 3_level_1
peak,peak,Unnamed: 2_level_2,Unnamed: 3_level_2
58.29M,72.00M,1,def initialize_model():
469.49M,518.00M,2,"model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=10).cuda()"
469.49M,518.00M,3,return model

active_bytes,reserved_bytes,line,code
all,all,Unnamed: 2_level_1,Unnamed: 3_level_1
peak,peak,Unnamed: 2_level_2,Unnamed: 3_level_2
469.49M,518.00M,1,def get_data():
469.49M,518.00M,2,device = torch.device('cuda:0')
469.49M,518.00M,3,tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
469.49M,518.00M,4,"data = tokenizer(['This is a sentence'], return_tensors='pt').to(device)"
469.49M,518.00M,5,labels = torch.Tensor([1] * len(data.input_ids[0])).to(dtype=torch.long).cuda()
469.49M,518.00M,6,"return data, labels"


Not very much useful data. Let's try to look at the BERT forward function...

### BERT forward function

In [None]:
%reset -f

In [None]:
import torch
from pytorch_memlab import LineProfiler, MemReporter, profile
from transformers import BertForTokenClassification, BertTokenizerFast

class ProfiledBertForTokenClassification(BertForTokenClassification):
    def __init__(self, config):
        super().__init__(config)

    def forward(self, *args, **kwargs):
        with LineProfiler(super().forward) as prof:
            result = super().forward(*args, **kwargs)
        # jupyter display stops working here, so I had to print stats
        print(prof.display())
        return result

model = ProfiledBertForTokenClassification.from_pretrained('bert-base-cased', num_labels=10).cuda()
device = torch.device('cuda:0')
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
data = tokenizer(['This is a sentence'], return_tensors='pt').to(device)
labels = torch.Tensor([1] * len(data.input_ids[0])).to(dtype=torch.long).cuda()
loss, logits = model(data.input_ids, token_type_ids=None, attention_mask=data.attention_mask, labels=labels)

Some weights of the model checkpoint at bert-base-cased were not used when initializing ProfiledBertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing ProfiledBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ProfiledBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ProfiledBertForTokenClassification were not initialized from the m

## BertForTokenClassification.forward

active_bytes reserved_bytes line code                                                                                                                    
         all            all                                                                                                                              
        peak           peak                                                                                                                              
     469.49M        518.00M 1731     @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))                 
                            1732     @add_code_sample_docstrings(                                                                                        
                            1733         checkpoint=_CHECKPOINT_FOR_TOKEN_CLASSIFICATION,                                                                
                            1734     

  records = (_accumulate_line_records(raw_line_records)
  merged = merged.drop('code', 1, level=0)


In [None]:
!pip install torch_tb_profiler

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch_tb_profiler
  Downloading torch_tb_profiler-0.4.1-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch_tb_profiler
Successfully installed torch_tb_profiler-0.4.1


In [None]:
from torch.profiler import tensorboard_trace_handler

In [None]:
 with torch.profiler.profile(
    schedule=torch.profiler.schedule(
        wait=2,
        warmup=2,
        active=6,
        repeat=1),
    on_trace_ready=tensorboard_trace_handler,
    with_stack=True) as profiler:
    for step, data in enumerate(trainloader, 0):
        print("step:{}".format(step))
        inputs, labels = data[0].to(device=device), data[1].to(device=device)

        outputs = model(inputs)
        loss = criterion(outputs, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        profiler.step()

NameError: ignored

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m54.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-no

In [None]:
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments, TrainerCallback
import torch
import numpy as np
import time


raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch", num_train_epochs=1, fp16=True)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

start = time.perf_counter()

class ProfCallback(TrainerCallback):
    def __init__(self, prof):
        self.prof = prof

    def on_step_end(self, args, state, control, **kwargs):
        self.prof.step()

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU,
                                        torch.profiler.ProfilerActivity.CUDA], 
                            schedule=torch.profiler.schedule(skip_first=3, wait=1, warmup=1, active=2, repeat=2),
                            on_trace_ready=torch.profiler.tensorboard_trace_handler('hf-training-trainer'),
                            profile_memory=True,
                            with_stack=True,
                            record_shapes=True) as prof:
    
    trainer.add_callback(ProfCallback(prof=prof))
    trainer.train()

print(f'training time, {(time.perf_counter() - start):.1f} s')

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.424045,0.803922,0.861592


  metric = load_metric("glue", "mrpc")


Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]

training time, 71.5 s
