-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] torch_geometric support #18
Comments
Hi @plutonium-239, thanks for the suggestion. I'll be investigating in these two weeks.
I'm thinking of adding a more customizable mid-level API, e.g., a background daemon gathering the process status (both on host and GPUs). Let the users log the useful items into |
Hi @plutonium-239, I have looked into the source code of For flexibility, I implement a new metric collector in PR #21, which allows the user has full control of the code logic of DL training. For example: >>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '3,2,1,0'
>>> from nvitop import ResourceMetricCollector, Device, CudaDevice
>>> collector = ResourceMetricCollector() # log all devices and children processes on the GPUs of the current process
>>> collector = ResourceMetricCollector(root_pids={1}) # log all devices and all GPU processes
>>> collector = ResourceMetricCollector(devices=CudaDevice.all()) # use the CUDA ordinal
>>> with collector(tag='<tag>'):
... # do something
... collector.collect() # -> Dict[str, float]
# key -> '<tag>/<scope>/<metric (unit)>/<mean/min/max>'
{
'<tag>/host/cpu_percent (%)/mean': 8.967849777683456,
'<tag>/host/cpu_percent (%)/min': 6.1,
'<tag>/host/cpu_percent (%)/max': 28.1,
...,
'<tag>/host/memory_percent (%)/mean': 21.5,
'<tag>/host/swap_percent (%)/mean': 0.3,
'<tag>/host/memory_used (GiB)/mean': 91.0136418208109,
'<tag>/host/load_average (%) (1 min)/mean': 10.251427386878328,
'<tag>/host/load_average (%) (5 min)/mean': 10.072539414569503,
'<tag>/host/load_average (%) (15 min)/mean': 11.91126970422139,
...,
'<tag>/cuda:0 (gpu:3)/memory_used (MiB)/mean': 3.875,
'<tag>/cuda:0 (gpu:3)/memory_free (MiB)/mean': 11015.562499999998,
'<tag>/cuda:0 (gpu:3)/memory_total (MiB)/mean': 11019.437500000002,
'<tag>/cuda:0 (gpu:3)/memory_percent (%)/mean': 0.0,
'<tag>/cuda:0 (gpu:3)/gpu_utilization (%)/mean': 0.0,
'<tag>/cuda:0 (gpu:3)/memory_utilization (%)/mean': 0.0,
'<tag>/cuda:0 (gpu:3)/fan_speed (%)/mean': 22.0,
'<tag>/cuda:0 (gpu:3)/temperature (C)/mean': 25.0,
'<tag>/cuda:0 (gpu:3)/power_usage (W)/mean': 19.11166264116916,
...,
'<tag>/cuda:1 (gpu:2)/memory_used (MiB)/mean': 8878.875,
...,
'<tag>/cuda:2 (gpu:1)/memory_used (MiB)/mean': 8182.875,
...,
'<tag>/cuda:3 (gpu:0)/memory_used (MiB)/mean': 9286.875,
...,
'<tag>/pid:12345/host/cpu_percent (%)/mean': 151.34342772112265,
'<tag>/pid:12345/host/host_memory (MiB)/mean': 44749.72373447514,
'<tag>/pid:12345/host/host_memory_percent (%)/mean': 8.675082352111717,
'<tag>/pid:12345/host/running_time (min)': 336.23803206741576,
'<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory (MiB)/mean': 8861.0,
'<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory_percent (%)/mean': 80.4,
'<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory_utilization (%)/mean': 6.711118172407917,
'<tag>/pid:12345/cuda:1 (gpu:4)/gpu_sm_utilization (%)/mean': 48.23283397736476,
...,
'<tag>/duration (s)': 7.247399162035435,
'<tag>/timestamp': 1655909466.9981883
} The results can be easily logged into TensorBoard or to CSV file. import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter
from nvitop import CudaDevice, ResourceMetricCollector
from nvitop.callbacks.tensorboard import add_scalar_dict
# Build networks and prepare datasets
...
# Logger and status collector
writer = SummaryWriter()
collector = ResourceMetricCollector(devices=CudaDevice.all(), # log all visible CUDA devices and use the CUDA ordinal
root_pids={os.getpid()}, # only log the children processes of the current process
interval=1.0) # snapshot interval for background daemon thread
# Start training
global_step = 0
for epoch in range(num_epoch):
with collector(tag='train'):
for batch in train_dataset:
with collector(tag='batch'):
metrics = train(net, batch)
global_step += 1
add_scalar_dict(writer, 'train', metrics, global_step=global_step)
add_scalar_dict(writer, 'resources', # tag='resources/train/batch/...'
collector.collect(),
global_step=global_step)
add_scalar_dict(writer, 'resources', # tag='resources/train/...'
collector.collect(),
global_step=epoch)
with collector(tag='validate'):
metrics = validate(net, validation_dataset)
add_scalar_dict(writer, 'validate', metrics, global_step=epoch)
add_scalar_dict(writer, 'resources', # tag='resources/validate/...'
collector.collect(),
global_step=epoch) |
Since the new feature is built on top of branch pip3 install git+https://github.com/XuehaiPan/nvitop.git@collector#egg=nvitop Any feedback is welcome. |
This is awesome! |
Close as resolved by PR #21. |
First of all, thank you for the excellent
nvitop
.I want to know if you have plans to add an integration with
PyTorch Geometric (pyg)
? It is a really great library for GNNs. I don't know if its helpful at all but it also has some profiling functions in thetorch_geometric.profile
module.Since pytorch lightning doesn't give you granular control over your models (sometimes reqd in research) I haven't seen anyone use it. On the flipside, pytorch geometric is probably the most popular library for GNNs.
Hope you consider this!
The text was updated successfully, but these errors were encountered: