Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Low CPU usage of MXNet in subprocesses #13593

Open
YutingZhang opened this issue Dec 9, 2018 · 18 comments
Open

Low CPU usage of MXNet in subprocesses #13593

YutingZhang opened this issue Dec 9, 2018 · 18 comments

Comments

@YutingZhang
Copy link
Contributor

YutingZhang commented Dec 9, 2018

MXNet has low CPU usage when running CPU operations in multiple process scenarios. Specifically, for MXNet computation in a subprocess, MxNet can use only 1 or 2 CPUs to do its job. This issue shows different behavior for different variants of MxNet (see below) and on different machines ...

This issue is critical because it slows down the multiprocess object-detection data-loading in gluoncv very significantly, making Faster-RCNN training in gluoncv unusable.

This is tested on the 20181207 version, and other versions (e.g., 1.3.1) show similar problems.

Code to reproduce the issue

Filename: mxnet_cpu_test.py

import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None


def run(need_import):
    if need_import:
        import mxnet as mx
    else:
        global mx
    A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
    while True:
        A = mx.nd.dot(A, A)

def parse_args():
    parser = argparse.ArgumentParser("benchmark mxnet cpu")
    parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
    parser.add_argument('--late-import', action='store_true')
    return parser.parse_args()

def main(args):

    if args.num_workers == 0:
        print("Main process")
        try:
            run(need_import=args.late_import)
        except KeyboardInterrupt:
            pass
    else:
        print("Subprocesses")
        ex = futures.ProcessPoolExecutor(args.num_workers)

        for _ in range(args.num_workers):
            ex.submit(run, need_import=args.late_import)
        while True:
            try:
                time.sleep(10000)
            except KeyboardInterrupt:
                ex.shutdown(wait=False)
                break
    print("Stopped")


if __name__ == "__main__":
    args = parse_args()
    if not args.late_import:
       import mxnet as mx
    main(args)

Detailed experiments:

  • Run in the main process:
    python3 mxnet_cpu_test.py --num-workers=0
    image
    Working fine for all mxnet variants (GPU or CPU-only).

  • Run in two subproceses
    -- mxnet-cu90 on p3.16x:
    python3 mxnet_cpu_test.py --num-workers=2
    image
    It uses only 2 CPUs per subprocess.
    -- mxnet-mkl on p3.16x:
    python3 mxnet_cpu_test.py --num-workers=2
    image
    Same here. It uses only 2 CPUs per subprocess.
    -- mxnet-mkl on CPU-only machine c5.18x:
    python3 mxnet_cpu_test.py --num-workers=2
    image
    Even worse. It uses only 1.5 CPUs per subprocess.
    -- However, for vanilla CPU-version mxnet on c5.18x:
    python3 mxnet_cpu_test.py --num-workers=2
    image
    It is working better. At least, it uses 5 CPUs per subprocess.
    -- Weirdly, still vanilla CPU-version mxnet but on GPU machine p3.16x:
    python3 mxnet_cpu_test.py --num-workers=2
    image
    It is working worse, i.e., 2 CPUs per subprocesses.

  • This problem seems relevant to how MXNet manage the thread per subprocess. If I do not import mxnet in the main process and instead import mxnet in each subprocess:
    python3 mxnet_cpu_test.py --num-workers=2 --late-import
    image
    Then everything is working fine.

@YutingZhang YutingZhang changed the title Low CPU usage of MXNet Low CPU usage of MXNet in subprocesses Dec 9, 2018
@pengzhao-intel
Copy link
Contributor

@TaoLv to help look at this issue

@lanking520
Copy link
Member

@YutingZhang Thanks for your issue reporting! @anirudh2290 @apeforest @azai91 @samskalicky please take a look in here.

@TaoLv
Copy link
Member

TaoLv commented Dec 10, 2018

Hi @YutingZhang, please try:

  1. set OMP_NUM_THREADS manually. For this test case, I tried OMP_NUM_THREADS=#core/#worker;
  2. remove the two SetEnv form https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L61-L62.

Please let me know if it works for you. Thanks.

@samskalicky
Copy link
Contributor

Related issue: #12255

@zhreshold
Copy link
Member

The limitation of 1 thread per worker is deliberately set to avoid thread contention.

Per offline discussion, I think a good solution is to use a ENV variable to control the limit of threads per worker can use (which defaults to 1 now).

@anirudh2290
Copy link
Member

@zhreshold this would also require rebuild with modified initialize.cc, otherwise the env variable would get overwritten.

@zhreshold
Copy link
Member

@anirudh2290 Yes, I mean a PR is required to address this issue.

@YutingZhang
Copy link
Contributor Author

Thanks everyone for discussing and solving the issue!

@YutingZhang
Copy link
Contributor Author

@zhreshold I tried the latest version of mxnet, and do export MXNET_MP_WORKER_NTHREADS=20. However, the example code I posted still results in the same CPU usage. Any ideas?

@YutingZhang YutingZhang reopened this Dec 19, 2018
@zhreshold
Copy link
Member

@YutingZhang MXNET_MP_WORKER_NTHREADS can only control how many mxnet operators run in parallel, in the case of some transformations, it might not be able to parallelize as much op as possible. Due to a openmp bug, it's disabled for the worker so unfortunately it is the case.

You might want to enable opencv multithreading for each worker which might be the most time consuming part in worker process

@YutingZhang
Copy link
Contributor Author

YutingZhang commented Jan 2, 2019

@pengzhao-intel @TaoLv @anirudh2290 @zhreshold Thank you for everyone's help, and happy new year! This problem seems more complicated (it might be multiple problems in the beginning). @zhreshold's fix solved the problem in most cases.
However, I found, if we call asnumpy in each worker, it interferes among the processes. And it does not seem to be a problem for GPU-version MxNet running on a GPU-machine. It seems only happening on CPU-only machine (I tested on c5.18large with mxnet-mkl).

Code (one-line difference):

import argparse
import sys
from concurrent import futures
import time
import numpy as np
mx=None


def run(need_import):
    if need_import:
        import mxnet as mx
    else:
        global mx
    A = mx.nd.random.uniform(low=0, high=1, shape=(5000, 5000))
    while True:
        A = mx.nd.dot(A, A)
        A.asnumpy()    # ******** only difference ***********

def parse_args():
    parser = argparse.ArgumentParser("benchmark mxnet cpu")
    parser.add_argument('--num-workers', '-j', dest='num_workers', type=int, default=0)
    parser.add_argument('--late-import', action='store_true')
    return parser.parse_args()

def main(args):

    if args.num_workers == 0:
        print("Main process")
        try:
            run(need_import=args.late_import)
        except KeyboardInterrupt:
            pass
    else:
        print("Subprocesses")
        ex = futures.ProcessPoolExecutor(args.num_workers)

        for _ in range(args.num_workers):
            ex.submit(run, need_import=args.late_import)
        while True:
            try:
                time.sleep(10000)
            except KeyboardInterrupt:
                ex.shutdown(wait=False)
                break
    print("Stopped")


if __name__ == "__main__":
    args = parse_args()
    if not args.late_import:
       import mxnet as mx
    main(args)

Launch 10 workers (python3 mxnet_cpu_test.py --num-workers=10). MXNET_MP_WORKER_NTHREADS does not affect the results.
image

But running it only in the main process is fine:
image

By the way, another issue I found with mxnet (cpu non-mkl version) is: when you run MxNet in a subprocess, it interferes with many other non-mxnet functions (e.g., cv2.cvtColor). The subprocess got stuck at those functions. This did not happen for mxnet==1.3.1, it started to happen in some nightly build version. Probably, we should create a new ticket for this.

@pengzhao-intel
Copy link
Contributor

@YutingZhang thanks for the case, we will look into the issue.

@ZhennanQin
Copy link
Contributor

ZhennanQin commented Jan 8, 2019

@YutingZhang If you just want to utilize 100% cpu for each process, please try export KMP_AFFINITY=granularity=fine,noduplicates, it works on my environment.

If you want enable openmp multi-threading to utilize >100% cpu for each process, you need to make below change for MXNet:
ZhennanQin@48fe761

Then you can use export OMP_NUM_THREADS=4 to specify 4x cpu usage for each process.

If you don't want to change MXNet and just want to increase the efficiency of MKL dot, you can try export MKL_NUM_THREADS=4. It only works for MKL library.

@pengzhao-intel
Copy link
Contributor

@zhreshold do you know some backgrounds why fixed the thread number to 1 in the worker processor as below line shown?
ZhennanQin/incubator-mxnet@48fe761

@pengzhao-intel
Copy link
Contributor

Got some info from @YutingZhang #13449 #12380 thanks a lot.

@pengzhao-intel
Copy link
Contributor

@anirudh2290

@zhreshold
Copy link
Member

@pengzhao-intel The thread limit is set to 1 according to comment: #13606 (comment)

If you have better understanding of the problem please let me know.

@zhreshold
Copy link
Member

@YutingZhang
Just tested out the master version, the ENV variable OMP_NUM_THREADS can now effectively control the OMP threads each worker is allowed to use.

For example, OMP_NUM_THREADS=32 python3 mxnet_cpu_test.py --num-workers=2 gives

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants