Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected performance when using COSMA with GPU (single node) #94

Closed
rohany opened this issue Jun 21, 2021 · 5 comments
Closed

unexpected performance when using COSMA with GPU (single node) #94

rohany opened this issue Jun 21, 2021 · 5 comments

Comments

@rohany
Copy link

rohany commented Jun 21, 2021

I'm testing out COSMA with GPU support on a single node with GPU's, and I'm not seeing performance that I might expect.

1 GPU:
COSMA TIMES [ms] = 1562 1657 2133 2390 6865
2 GPU:
COSMA TIMES [ms] = 1544 2710 3374 3626 6060
4 GPU:
COSMA TIMES [ms] = 805 832 1456 3142 6419

I expect to:

  • See some difference in runtime from 1 -> 2 GPUs
  • Somewhat stable performance? The difference between the min and max are quite high.

I'm on the current master, and running the miniapp with (-n and -r are how many ranks to run on a node)

OMP_NUM_THREADS=6 COSMA_OVERLAP_COMM_AND_COMP=ON jsrun -n 4 -c 6 -g 1 -b none -r 4 ./miniapp/cosma_miniapp -m 16384 -n 16384 -k 16384 -r 5

I build cosma with:

cmake -DCOSMA_BLAS=CUDA -DCMAKE_INSTALL_PREFIX=../ ..
@kabicm
Copy link
Collaborator

kabicm commented Jun 21, 2021

Great that it works now!

Can you check also without overlapping communication and computation? And can you also try some larger matrix sizes, e.g. 32k or so? Basically, 16k case can be run on a rank with a single GPU.

I will check this testcase on our system and then we will see.

@rohany
Copy link
Author

rohany commented Jun 21, 2021

I see slightly better performance without overlap:
1 GPU:
COSMA TIMES [ms] = 1996 2453 3166 4126 4613
2 GPU:
COSMA TIMES [ms] = 1934 2348 2505 3602 5486
4 GPU:
COSMA TIMES [ms] = 1041 1370 1530 1584 1905

At matrix size 30000 without overlap i see:
1 GPU:
COSMA TIMES [ms] = 8987 9811 9840 13445 13833
2 GPU:
COSMA TIMES [ms] = 7181 7182 7227 7839 9772
4 GPU:
COSMA TIMES [ms] = 4282 4327 6609 8345 23970

@rohany
Copy link
Author

rohany commented Sep 28, 2021

I'm going to close this since things are working as expected for me now.

@rohany rohany closed this as completed Sep 28, 2021
@kabicm
Copy link
Collaborator

kabicm commented Sep 28, 2021

Hi Rohan,

What was the main problem? Was this also related to limited memory that we yesterday discussed?

Also, did you check adding just -s "sm2" or just -s "sm2,sn2" instead of splitting all three dimensions beforehand?

Thanks for your feedback!

@rohany
Copy link
Author

rohany commented Sep 28, 2021

The main problem at this time iirc was that I was strong scaling instead of weak scaling, as well as on relatively small problem sizes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants