Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU resources, htop shows all 12 cores being used for example_1.py #72

Closed
DanielTakeshi opened this issue Nov 11, 2019 · 11 comments
Closed

Comments

@DanielTakeshi
Copy link
Contributor

DanielTakeshi commented Nov 11, 2019

Thanks for the great library. I am running some tests to benchmark performance. Since I hope to be running these using many CPUs, I want to understand how many CPUs a script will consume. I installed the repository as of today, and ran:

python examples/example_1.py

I'm running this on a machine with a single GPU, and an i7-8700k CPU with 12 logical cores. I assume I'm not using the GPU in the above command.

In a separate tab, my htop is regularly showing something like this:

Screenshot from 2019-11-11 10-21-15

All 12 CPUs on my machine are being used.

When I run:

python examples/example_1.py --cuda_idx 0

I don't generally see all 12 CPUs being used, but I may get something like 6 CPUs used:

Screenshot from 2019-11-11 10-25-52

Just wondering, the documentation of example_1.py says this:

Runs one instance of the Atari environment and optimizes using DQN algorithm.
Can use a GPU for the agent (applies to both sample and train). No parallelism
employed, so everything happens in one python process; can be easier to debug.

However I assumed "one python process = one core". Perhaps this is not the right way to think about it. Is there a way to roughly estimate how many CPUs (or "cores" -- I use terms interchangeably) will be used for a given training run?

@codelast
Copy link

codelast commented Nov 12, 2019

I ran example_1.py on an Ubuntu machine which has no GPU card installed and saw 8 "python examples/example_1.py" process.
This also confuses me, seems that example_1 should have only one python process when running.

@astooke
Copy link
Owner

astooke commented Nov 18, 2019

Hi! Sorry this took a while to reply.

My first guess as to what's happening here is that we're seeing some multithreading from inside PyTorch (such as for BLAS). Although I'm not sure why that would be so busy if you're using the GPU. Try putting torch.set_num_threads(1) at the top of example_1.py, before any torch code gets run otherwise, and see if that makes it run with only one core being busy?

The Atari env is single-threaded, and example_1.py only runs one Python process, so the rest of the code should really only use 1 core.

@DanielTakeshi
Copy link
Contributor Author

DanielTakeshi commented Nov 19, 2019

Thanks @astooke

I did another test where I took example_5 and did one change, which is to replace MinibatchRlEval with MinibatchRl so that there's no evaluation environment. I then ran it with 8 parallel envs as you can see below. htop is regularly showing at least 24 CPUs being used (on a 48 CPU machine) and nothing but my script is running on it.

Screenshot from 2019-11-18 19-22-46

Let me now do some testing with your suggestion of changing the num threads ...

Update 1: was going to do some experimentation but saw @codelast put some new stuff below.

Update 2: actually I might have gotten confused. In example_5.py the batch_B is hard-coded at 32 so the n_parallel does not actually change it. I assumed n_parallel was the number of parallel environments, but I guess it's more viewed as the amount of resources provided to a given program? I will continue investigating but do you have any quick suggestions / comments on the distinction between n_parallel and batch_B?

Update 3: figured out my prior question. batch_B is actually the number of parallel environments, and n_parallel is the number of workers that need to run those environments. If batch_B=32 and n_parallel=2 then we have [16,16] envs allocated to the two workers. If n_parallel=3 then it's [11,11,10] envs allocated to the three workers, and we get a warning saying that performance may suffer due to unequal environment distribution.

@codelast
Copy link

@astooke I tried using torch.set_num_threads(1) at the very beginning of the main function in example_1.py, that did reduce the process number, to 3(still not one), and I found that the %CPU usage of the 3 processes varies greatly(via "htop" command), the highest is about 95%, the 2nd is about 20%, and the lowest is only about 0.6%.
To dig into these processes and find out "what they are doing", I use py-spy to monitor them and the results are:

  • 1st process(highest %CPU)
    1st

  • 2nd process(medium %CPU)
    2nd

  • 3rd process(lowest %CPU)
    3rd

Seems that they are doing the similar jobs, but process 1 takes a lot of time in "forward", process 2 takes a lot of time in "backward", and there is no obvious feature for process 3.
This still confuses me.

@astooke
Copy link
Owner

astooke commented Nov 20, 2019

Well, this is interesting to see the separate processes! I don't know what's going on, but it looks possible this is still something that's happening inside PyTorch? So within rlpyt you can develop as a serial program.

@tbwxmu
Copy link

tbwxmu commented Nov 21, 2019

Hi all,
I think this is a small bug in MinibatchRlBase.startup.
One solution is to add the master_cpus arg in "build_and_train" when you want to specify the CPUs.
Like the next
image

I also find the args.cuda_idx have to be used no matter if we want to use Gpu or not. I think we should fix this as well.

@astooke
Copy link
Owner

astooke commented Dec 11, 2019

I've also noticed sometimes that multi-threading in PyTorch doesn't fully obey the cpu affinity set with psutil. But it seems that using taskset to call the script does enforce affinities, e.g. taskset -c 0,5 python example.py to use CPUs 0 and 5.

Good call on not needing the cuda_idx arg, that might just be a feature of the example? In the MinibatchRlBase.startup() it uses affinity.get("cuda_idx", None), which will default to CPU if no argument is provided. :)

@DanielTakeshi
Copy link
Contributor Author

DanielTakeshi commented Feb 18, 2020

Just wondering @astooke was there any update on understanding how to control CPU resources? I'm asking because my code has been using more CPUs than specified and it has caused some machines and/or scripts to crash or hang.

@astooke
Copy link
Owner

astooke commented Feb 19, 2020

I haven't tried with newer versions of PyTorch since 1.2, but my experience is still that using taskset when calling script to launch the individual training run is the most sure way to limit the program to the selected CPUs. The built-in experiment launcher does this, for example, based on the CPUs listed in the affinity:

call_list += ["taskset", "-c", cpus] # PyTorch obeys better than just psutil.

Hope that helps! And if that doesn't work, please let us know, that would be a surprising problem.

@DanielTakeshi
Copy link
Contributor Author

I just ran some tests today, and indeed using taskset helped to limit my CPU usage, according to htop.

@DanielTakeshi
Copy link
Contributor Author

I'll close this for now and then in experiments I will just put taskset -c xxx before everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants