Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] OOM while inferencing on ogbn-papers100M for link prediction #77

Closed
isratnisa opened this issue Apr 10, 2023 · 1 comment · Fixed by #86
Closed

[Bug] OOM while inferencing on ogbn-papers100M for link prediction #77

isratnisa opened this issue Apr 10, 2023 · 1 comment · Fixed by #86
Assignees
Labels
bug Something isn't working v0.1

Comments

@isratnisa
Copy link
Contributor

isratnisa commented Apr 10, 2023

Attempting inference on ogbn-papers100M for link prediction is causing an OOM (out-of-memory) issue, as shown in the attached screenshot. As a result, the system is becoming extremely unresponsive. The OOM happens in the evaluation function (val_mrr, test_mrr = self.evaluator.evaluate(None, test_scores, 0)) function here. System can successfully save the node embedding and relational embeddings and exit the program without any issue when the evaluation function is omitted.

Screenshot 2023-04-10 at 11 47 20 AM

Experiment setup:

Dataset: ogbn-papers100M partitioned into 3
Instance: g4dn.metal
Command to run Inferencing:

python3 -u  ~/dgl/tools/launch.py \
        --workspace /graph-storm/inference_scripts/lp_infer \
        --num_trainers 1 \
        --num_servers 1 \
        --num_samplers 0 \
        --part_config /data/ogbn-papers100M-3p/ogbn-papers100M.json \
        --ip_config  /data/ip_list_p3_metal.txt \
        --ssh_port 2222 \
        "python3 lp_infer_gnn.py --cf  /data/ogbn_papers100M_infer_p3.yaml  --use-node-embeddings false --num-gpus 4 --part-config /data/ogbn-papers100M-3p/ogbn-papers100M.json  --restore-model-path /data/papers100M-lp-p3-model/epoch-0  --feat-name feat --no-validation false"

Reproduced with the following environment:

  • DGL 1.0.2 + GSF github/gitlab version
  • DGL 1.0.0 + GSF github/gitlab version

Smaller dataset like ogbn-mag works fine on similar setup.

@isratnisa isratnisa self-assigned this Apr 10, 2023
@isratnisa
Copy link
Contributor Author

@classicsong

@isratnisa isratnisa changed the title OOM while inferencing link prediction on ogbn-papers100M OOM while inferencing on ogbn-papers100M for link prediction Apr 10, 2023
@isratnisa isratnisa added bug Something isn't working v0.1 labels Apr 10, 2023
@isratnisa isratnisa changed the title OOM while inferencing on ogbn-papers100M for link prediction [BugFix] OOM while inferencing on ogbn-papers100M for link prediction Apr 10, 2023
@isratnisa isratnisa changed the title [BugFix] OOM while inferencing on ogbn-papers100M for link prediction [Bug] OOM while inferencing on ogbn-papers100M for link prediction Apr 10, 2023
isratnisa added a commit that referenced this issue Apr 13, 2023
Resolves [issue# 77](#77)

Attempting distributed inference on ogbn-papers100M (3 partitions) for
link prediction is causing an OOM (out-of-memory) issue. This
[issue](#77) has the
details.

The following screenshot is collected while running inference on 3
g4dn.metal instances. It causes OOM and the screenshot shows 94%
consumed memory:
<img width="1324" alt="Screenshot 2023-04-10 at 11 47 20 AM"
src="https://user-images.githubusercontent.com/18449426/231774872-c7aff289-b274-44f6-9603-402411d0a666.png">

With the proposed fix, the OOM issue is not encountered anymore with a
memory consumption of ~14%



<img width="1006" alt="Screenshot 2023-04-13 at 9 30 09 AM"
src="https://user-images.githubusercontent.com/18449426/231774802-be5245fc-a38c-46ee-b228-5bef50e6d71b.png">

---------

Co-authored-by: Israt Nisa <nisisrat@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v0.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant