Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] torch_geometric support #18

Closed
plutonium-239 opened this issue May 27, 2022 · 5 comments · Fixed by #21
Closed

[Feature Request] torch_geometric support #18

plutonium-239 opened this issue May 27, 2022 · 5 comments · Fixed by #21
Assignees
Labels
api Something related to the core APIs enhancement New feature or request

Comments

@plutonium-239
Copy link

First of all, thank you for the excellent nvitop.

I want to know if you have plans to add an integration with PyTorch Geometric (pyg)? It is a really great library for GNNs. I don't know if its helpful at all but it also has some profiling functions in the torch_geometric.profile module.
Since pytorch lightning doesn't give you granular control over your models (sometimes reqd in research) I haven't seen anyone use it. On the flipside, pytorch geometric is probably the most popular library for GNNs.

Hope you consider this!

@XuehaiPan
Copy link
Owner

I want to know if you have plans to add an integration with PyTorch Geometric (pyg)? It is a really great library for GNNs. I don't know if its helpful at all but it also has some profiling functions in the torch_geometric.profile module.

Hi @plutonium-239, thanks for the suggestion. I'll be investigating in these two weeks.

Since pytorch lightning doesn't give you granular control over your models (sometimes reqd in research)

nvitop.core provides low-level APIs that users can integrate it into their training/testing code. nvitop.callbacks offers framework high-level APIs.

I'm thinking of adding a more customizable mid-level API, e.g., a background daemon gathering the process status (both on host and GPUs). Let the users log the useful items into tensorboard.SummaryWriter, csv, or print to stdout.

@XuehaiPan XuehaiPan changed the title torch_geometric support [Feature Request] torch_geometric support Jun 15, 2022
@XuehaiPan XuehaiPan added the enhancement New feature or request label Jun 15, 2022
@XuehaiPan XuehaiPan self-assigned this Jun 15, 2022
@XuehaiPan
Copy link
Owner

XuehaiPan commented Jun 22, 2022

Hi @plutonium-239, I have looked into the source code of PyTorch Geometric (pyg). It looks that pyg supports PyTorch Lightning and the callback in nvitop is also usable.

For flexibility, I implement a new metric collector in PR #21, which allows the user has full control of the code logic of DL training.

For example:

>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '3,2,1,0'

>>> from nvitop import ResourceMetricCollector, Device, CudaDevice

>>> collector = ResourceMetricCollector()                          # log all devices and children processes on the GPUs of the current process
>>> collector = ResourceMetricCollector(root_pids={1})             # log all devices and all GPU processes
>>> collector = ResourceMetricCollector(devices=CudaDevice.all())  # use the CUDA ordinal

>>> with collector(tag='<tag>'):
...     # do something
...     collector.collect()  # -> Dict[str, float]
# key -> '<tag>/<scope>/<metric (unit)>/<mean/min/max>'
{
    '<tag>/host/cpu_percent (%)/mean': 8.967849777683456,
    '<tag>/host/cpu_percent (%)/min': 6.1,
    '<tag>/host/cpu_percent (%)/max': 28.1,
    ...,
    '<tag>/host/memory_percent (%)/mean': 21.5,
    '<tag>/host/swap_percent (%)/mean': 0.3,
    '<tag>/host/memory_used (GiB)/mean': 91.0136418208109,
    '<tag>/host/load_average (%) (1 min)/mean': 10.251427386878328,
    '<tag>/host/load_average (%) (5 min)/mean': 10.072539414569503,
    '<tag>/host/load_average (%) (15 min)/mean': 11.91126970422139,
    ...,
    '<tag>/cuda:0 (gpu:3)/memory_used (MiB)/mean': 3.875,
    '<tag>/cuda:0 (gpu:3)/memory_free (MiB)/mean': 11015.562499999998,
    '<tag>/cuda:0 (gpu:3)/memory_total (MiB)/mean': 11019.437500000002,
    '<tag>/cuda:0 (gpu:3)/memory_percent (%)/mean': 0.0,
    '<tag>/cuda:0 (gpu:3)/gpu_utilization (%)/mean': 0.0,
    '<tag>/cuda:0 (gpu:3)/memory_utilization (%)/mean': 0.0,
    '<tag>/cuda:0 (gpu:3)/fan_speed (%)/mean': 22.0,
    '<tag>/cuda:0 (gpu:3)/temperature (C)/mean': 25.0,
    '<tag>/cuda:0 (gpu:3)/power_usage (W)/mean': 19.11166264116916,
    ...,
    '<tag>/cuda:1 (gpu:2)/memory_used (MiB)/mean': 8878.875,
    ...,
    '<tag>/cuda:2 (gpu:1)/memory_used (MiB)/mean': 8182.875,
    ...,
    '<tag>/cuda:3 (gpu:0)/memory_used (MiB)/mean': 9286.875,
    ...,
    '<tag>/pid:12345/host/cpu_percent (%)/mean': 151.34342772112265,
    '<tag>/pid:12345/host/host_memory (MiB)/mean': 44749.72373447514,
    '<tag>/pid:12345/host/host_memory_percent (%)/mean': 8.675082352111717,
    '<tag>/pid:12345/host/running_time (min)': 336.23803206741576,
    '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory (MiB)/mean': 8861.0,
    '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory_percent (%)/mean': 80.4,
    '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_memory_utilization (%)/mean': 6.711118172407917,
    '<tag>/pid:12345/cuda:1 (gpu:4)/gpu_sm_utilization (%)/mean': 48.23283397736476,
    ...,
    '<tag>/duration (s)': 7.247399162035435,
    '<tag>/timestamp': 1655909466.9981883
}

The results can be easily logged into TensorBoard or to CSV file.

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter

from nvitop import CudaDevice, ResourceMetricCollector
from nvitop.callbacks.tensorboard import add_scalar_dict

# Build networks and prepare datasets
...

# Logger and status collector
writer = SummaryWriter()
collector = ResourceMetricCollector(devices=CudaDevice.all(),  # log all visible CUDA devices and use the CUDA ordinal
                                    root_pids={os.getpid()},   # only log the children processes of the current process
                                    interval=1.0)              # snapshot interval for background daemon thread

# Start training
global_step = 0
for epoch in range(num_epoch):
    with collector(tag='train'):
        for batch in train_dataset:
            with collector(tag='batch'):
                metrics = train(net, batch)
                global_step += 1
                add_scalar_dict(writer, 'train', metrics, global_step=global_step)
                add_scalar_dict(writer, 'resources',      # tag='resources/train/batch/...'
                                collector.collect(),
                                global_step=global_step)

        add_scalar_dict(writer, 'resources',              # tag='resources/train/...'
                        collector.collect(),
                        global_step=epoch)

    with collector(tag='validate'):
        metrics = validate(net, validation_dataset)
        add_scalar_dict(writer, 'validate', metrics, global_step=epoch)
        add_scalar_dict(writer, 'resources',              # tag='resources/validate/...'
                        collector.collect(),
                        global_step=epoch)

@XuehaiPan
Copy link
Owner

Since the new feature is built on top of branch mig-support, which has not been released yet. To install:

pip3 install git+https://github.com/XuehaiPan/nvitop.git@collector#egg=nvitop

Any feedback is welcome.

@plutonium-239
Copy link
Author

This is awesome!
Thanks so much!

@XuehaiPan
Copy link
Owner

Close as resolved by PR #21.

@XuehaiPan XuehaiPan added the api Something related to the core APIs label Sep 7, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Something related to the core APIs enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants