CPU resources, htop shows all 12 cores being used for example_1.py #72

DanielTakeshi · 2019-11-11T18:29:36Z

Thanks for the great library. I am running some tests to benchmark performance. Since I hope to be running these using many CPUs, I want to understand how many CPUs a script will consume. I installed the repository as of today, and ran:

python examples/example_1.py

I'm running this on a machine with a single GPU, and an i7-8700k CPU with 12 logical cores. I assume I'm not using the GPU in the above command.

In a separate tab, my htop is regularly showing something like this:

All 12 CPUs on my machine are being used.

When I run:

python examples/example_1.py --cuda_idx 0

I don't generally see all 12 CPUs being used, but I may get something like 6 CPUs used:

Just wondering, the documentation of example_1.py says this:

Runs one instance of the Atari environment and optimizes using DQN algorithm.
Can use a GPU for the agent (applies to both sample and train). No parallelism
employed, so everything happens in one python process; can be easier to debug.

However I assumed "one python process = one core". Perhaps this is not the right way to think about it. Is there a way to roughly estimate how many CPUs (or "cores" -- I use terms interchangeably) will be used for a given training run?

The text was updated successfully, but these errors were encountered:

codelast · 2019-11-12T09:03:20Z

I ran example_1.py on an Ubuntu machine which has no GPU card installed and saw 8 "python examples/example_1.py" process.
This also confuses me, seems that example_1 should have only one python process when running.

astooke · 2019-11-18T17:03:40Z

Hi! Sorry this took a while to reply.

My first guess as to what's happening here is that we're seeing some multithreading from inside PyTorch (such as for BLAS). Although I'm not sure why that would be so busy if you're using the GPU. Try putting torch.set_num_threads(1) at the top of example_1.py, before any torch code gets run otherwise, and see if that makes it run with only one core being busy?

The Atari env is single-threaded, and example_1.py only runs one Python process, so the rest of the code should really only use 1 core.

DanielTakeshi · 2019-11-19T03:25:35Z

Thanks @astooke

I did another test where I took example_5 and did one change, which is to replace MinibatchRlEval with MinibatchRl so that there's no evaluation environment. I then ran it with 8 parallel envs as you can see below. htop is regularly showing at least 24 CPUs being used (on a 48 CPU machine) and nothing but my script is running on it.

Let me now do some testing with your suggestion of changing the num threads ...

Update 1: was going to do some experimentation but saw @codelast put some new stuff below.

Update 2: actually I might have gotten confused. In example_5.py the batch_B is hard-coded at 32 so the n_parallel does not actually change it. I assumed n_parallel was the number of parallel environments, but I guess it's more viewed as the amount of resources provided to a given program? I will continue investigating but do you have any quick suggestions / comments on the distinction between n_parallel and batch_B?

Update 3: figured out my prior question. batch_B is actually the number of parallel environments, and n_parallel is the number of workers that need to run those environments. If batch_B=32 and n_parallel=2 then we have [16,16] envs allocated to the two workers. If n_parallel=3 then it's [11,11,10] envs allocated to the three workers, and we get a warning saying that performance may suffer due to unequal environment distribution.

codelast · 2019-11-19T03:33:16Z

@astooke I tried using torch.set_num_threads(1) at the very beginning of the main function in example_1.py, that did reduce the process number, to 3(still not one), and I found that the %CPU usage of the 3 processes varies greatly(via "htop" command), the highest is about 95%, the 2nd is about 20%, and the lowest is only about 0.6%.
To dig into these processes and find out "what they are doing", I use py-spy to monitor them and the results are:

1st process(highest %CPU)
2nd process(medium %CPU)
3rd process(lowest %CPU)

Seems that they are doing the similar jobs, but process 1 takes a lot of time in "forward", process 2 takes a lot of time in "backward", and there is no obvious feature for process 3.
This still confuses me.

astooke · 2019-11-20T20:36:25Z

Well, this is interesting to see the separate processes! I don't know what's going on, but it looks possible this is still something that's happening inside PyTorch? So within rlpyt you can develop as a serial program.

tbwxmu · 2019-11-21T17:38:05Z

Hi all,
I think this is a small bug in MinibatchRlBase.startup.
One solution is to add the master_cpus arg in "build_and_train" when you want to specify the CPUs.
Like the next

I also find the args.cuda_idx have to be used no matter if we want to use Gpu or not. I think we should fix this as well.

astooke · 2019-12-11T23:16:28Z

I've also noticed sometimes that multi-threading in PyTorch doesn't fully obey the cpu affinity set with psutil. But it seems that using taskset to call the script does enforce affinities, e.g. taskset -c 0,5 python example.py to use CPUs 0 and 5.

Good call on not needing the cuda_idx arg, that might just be a feature of the example? In the MinibatchRlBase.startup() it uses affinity.get("cuda_idx", None), which will default to CPU if no argument is provided. :)

DanielTakeshi · 2020-02-18T17:49:58Z

Just wondering @astooke was there any update on understanding how to control CPU resources? I'm asking because my code has been using more CPUs than specified and it has caused some machines and/or scripts to crash or hang.

astooke · 2020-02-19T23:58:52Z

I haven't tried with newer versions of PyTorch since 1.2, but my experience is still that using taskset when calling script to launch the individual training run is the most sure way to limit the program to the selected CPUs. The built-in experiment launcher does this, for example, based on the CPUs listed in the affinity:

rlpyt/rlpyt/utils/launching/exp_launcher.py

Line 45 in d797dd8

call_list += ["taskset", "-c", cpus] # PyTorch obeys better than just psutil.

Hope that helps! And if that doesn't work, please let us know, that would be a surprising problem.

DanielTakeshi · 2020-02-20T00:15:09Z

I just ran some tests today, and indeed using taskset helped to limit my CPU usage, according to htop.

DanielTakeshi · 2020-02-20T00:15:34Z

I'll close this for now and then in experiments I will just put taskset -c xxx before everything.

DanielTakeshi closed this as completed Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU resources, htop shows all 12 cores being used for example_1.py #72

CPU resources, htop shows all 12 cores being used for example_1.py #72

DanielTakeshi commented Nov 11, 2019 •

edited

codelast commented Nov 12, 2019 •

edited

astooke commented Nov 18, 2019

DanielTakeshi commented Nov 19, 2019 •

edited

codelast commented Nov 19, 2019

astooke commented Nov 20, 2019

tbwxmu commented Nov 21, 2019

astooke commented Dec 11, 2019

DanielTakeshi commented Feb 18, 2020 •

edited

astooke commented Feb 19, 2020

DanielTakeshi commented Feb 20, 2020

DanielTakeshi commented Feb 20, 2020

CPU resources, htop shows all 12 cores being used for example_1.py #72

CPU resources, htop shows all 12 cores being used for example_1.py #72

Comments

DanielTakeshi commented Nov 11, 2019 • edited

codelast commented Nov 12, 2019 • edited

astooke commented Nov 18, 2019

DanielTakeshi commented Nov 19, 2019 • edited

codelast commented Nov 19, 2019

astooke commented Nov 20, 2019

tbwxmu commented Nov 21, 2019

astooke commented Dec 11, 2019

DanielTakeshi commented Feb 18, 2020 • edited

astooke commented Feb 19, 2020

DanielTakeshi commented Feb 20, 2020

DanielTakeshi commented Feb 20, 2020

DanielTakeshi commented Nov 11, 2019 •

edited

codelast commented Nov 12, 2019 •

edited

DanielTakeshi commented Nov 19, 2019 •

edited

DanielTakeshi commented Feb 18, 2020 •

edited