Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

significant slow-down of tensorflow on non-AVX machine(s) #33442

Open
2 tasks
slava77 opened this issue Apr 15, 2021 · 31 comments
Open
2 tasks

significant slow-down of tensorflow on non-AVX machine(s) #33442

slava77 opened this issue Apr 15, 2021 · 31 comments

Comments

@slava77
Copy link
Contributor

slava77 commented Apr 15, 2021

Originally from https://mattermost.web.cern.ch/cms-o-and-c/pl/zrtbufg8zbb9jgspeuxef183rc

I learned that TF inference is much slower on an older AMD compared to Intel.

Intel Broadwell: https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/133

AMD Opteron 6128 https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.wn36/29

both are running the same inputs in a bit older release where I had input data and where igprof was still working fine

one example call to mkldnn_sgemm has a very large difference in two cases, about a factor of 1000 less on Intel (look at % total):
https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/2651

vs
https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.wn36/30

[From @makortel ] Some slowdown was observed e.g. in https://mathematica.stackexchange.com/questions/64645/mkl-on-intel-vs-amd

I have a suspicion that we are using https://github.com/oneapi-src/oneDNN/blob/v1.0.4/src/cpu/gemm/gemm.cpp
Here (mkldnn_sgemm calls extended_sgemm, which in tern makes a choice between gemm_driver [igprof cost 0.02%] or ref_gemm<float> [igprof cost 30%])

If that's correct, then my analysis is that mkldnn_sgemm is common in both cases and it's really just this method implementation that differs by selecting for SSE4.1 flag.
Then the difference in speed is close to 1000. This does not look reasonable. A better understanding of what we actually compile here would help to confirm. (it may be straightforward to modify and more clearly confirm that ref_gemm is really so slow)

Goals towards resolving the issue:

  • understand which oneDNN is used to compile our tensorflow
  • see if there is a faster solution for the older arch (pre-SSE4.1?)
@cmsbuild
Copy link
Contributor

A new Issue was created by @slava77 Slava Krutelyov.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core, reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: core,reconstruction

@Dr15Jones,@smuzaffar,@slava77,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

  • understand which oneDNN is used to compile our tensorflow

@smuzaffar @mrodozov can you comment?

@makortel
Copy link
Contributor

FYI @riga @mialiu149

@slava77 slava77 changed the title significant slow-down of tensorflow on AMD (pre-SSE4.1) significant slow-down of tensorflow on AMD (non-AVX) Apr 15, 2021
@slava77
Copy link
Contributor Author

slava77 commented Apr 15, 2021

I have a suspicion that we are using https://github.com/oneapi-src/oneDNN/blob/v1.0.4/src/cpu/gemm/gemm.cpp

in the head of 11_3_X we are using TF 2.4.1 and based on https://github.com/cms-externals/tensorflow/blob/cms/v2.4.1/tensorflow/workspace.bzl it has

So, to correct the initial assumption about SSE4.1, the logic is actually about AVX
https://github.com/oneapi-src/oneDNN/blob/v0.21.3/src/cpu/gemm/gemm.cpp#L123-L136
the logic here is

    if (mayiuse(avx512_mic)) {
        return jit_avx512_common_gemm_f32(transa, transb,
...
    } else if (mayiuse(avx)) {
...
        return gemm_driver(transa, transb, bias ? "C" : NULL, M, N, K, alpha,
...
    } else {
        return ref_gemm<float>(transa, transb,

@mrodozov
Copy link
Contributor

right, so TF is not using oneDNN from direct deps since we don't have it anywhere else.

A better understanding of what we actually compile here would help to confirm.

let me find the latest logs of TF to check it.

@mrodozov
Copy link
Contributor

let the bot builds it and we can check what bazel is doing

@slava77
Copy link
Contributor Author

slava77 commented Apr 15, 2021

let the bot builds it and we can check what bazel is doing

BTW, do we have a debug build for our externals and CMSSW?

@mrodozov
Copy link
Contributor

mrodozov commented Apr 15, 2021

we have 'a' debug build
scram list DBG -> CMSSW_11_3_DBG_X_2021-04-08-2300
I'm not sure what is in the DBG build but I'll assume only cmssw was build with debug flags
A debug build for all externals will require additional cmsdist branch with debug flags for all externals in every spec
From time to time we change the flags to debug for ROOT when there is something to be fixed
Although I don't remember if Shazhad have invented something to put debug flags for all externals ... somehow

@makortel
Copy link
Contributor

I'm not sure what is in the DBG build but I'll assume only cmssw was build with debug flags

The DBG build is -g -O0 -DEDM_ML_DEBUG (mostly to help to ensure everything would build with those options).

@mrodozov
Copy link
Contributor

after reading this:
https://docs.bazel.build/versions/master/external.html#external-packages
it says all the external packages downloaded by bazel are in bazel-tensorflow-2.4.1/external/mkl_dnn
and inside there is the BUILD.bazel which is this:


exports_files(["LICENSE"])

load(
    "@org_tensorflow//third_party:common.bzl",
    "template_rule",
)

config_setting(
    name = "clang_linux_x86_64",
    values = {
        "cpu": "k8",
        "define": "using_clang=true",
    },
)

template_rule(
    name = "mkldnn_config_h",
    src = "include/mkldnn_config.h.in",
    out = "include/mkldnn_config.h",
    substitutions = {
        "#cmakedefine MKLDNN_CPU_BACKEND MKLDNN_BACKEND_${MKLDNN_CPU_BACKEND}": "#define MKLDNN_CPU_BACKEND MKLDNN_BACKEND_NATIVE",
        "#cmakedefine MKLDNN_GPU_BACKEND MKLDNN_BACKEND_${MKLDNN_GPU_BACKEND}": "#define MKLDNN_GPU_BACKEND MKLDNN_BACKEND_NONE",
    },
)

# Create the file mkldnn_version.h with MKL-DNN version numbers.
# Currently, the version numbers are hard coded here. If MKL-DNN is upgraded then
# the version numbers have to be updated manually. The version numbers can be
# obtained from the PROJECT_VERSION settings in CMakeLists.txt. The variable is
# set to "version_major.version_minor.version_patch". The git hash version can
# be set to NA.
# TODO(agramesh1) Automatically get the version numbers from CMakeLists.txt.
# TODO(bhavanis): MKL-DNN minor version needs to be updated for MKL-DNN v1.x.
# The current version numbers will work only if MKL-DNN v0.21 is used.

template_rule(
    name = "mkldnn_version_h",
    src = "include/mkldnn_version.h.in",
    out = "include/mkldnn_version.h",
    substitutions = {
        "@MKLDNN_VERSION_MAJOR@": "0",
        "@MKLDNN_VERSION_MINOR@": "21",
        "@MKLDNN_VERSION_PATCH@": "3",
        "@MKLDNN_VERSION_HASH@": "N/A",
    },
)

cc_library(
    name = "mkldnn_single_threaded",
    srcs = glob([
        "src/common/*.cpp",
        "src/common/*.hpp",
        "src/cpu/*.cpp",
        "src/cpu/*.hpp",
        "src/cpu/**/*.cpp",
        "src/cpu/**/*.hpp",
        "src/cpu/xbyak/*.h",
    ]) + [":mkldnn_version_h"],
    hdrs = glob(["include/*"]),
    copts = [
        "-fexceptions",
        "-DMKLDNN_THR=MKLDNN_THR_SEQ",  # Disables threading.
    ],
    includes = [
        "include",
        "src",
        "src/common",
        "src/cpu",
        "src/cpu/gemm",
        "src/cpu/xbyak",
    ],
    visibility = ["//visibility:public"],
)

also before that I checked the cache in the build directory and only one of the two tar files hash was there:
again the 0.21.3 version
I'm also building TF without the 0.23.1 version to confirm the BUILD file will change
What we can also do is build an IB with the old mkl_dnn version and rerun the tests.

@slava77 slava77 changed the title significant slow-down of tensorflow on AMD (non-AVX) significant slow-down of tensorflow on non-AVX machine(s) Apr 28, 2021
@slava77
Copy link
Contributor Author

slava77 commented Apr 28, 2021

if I understand correctly, the same problem will be present on ARM and Power.
I updated the issue title, but did not try it explicitly.

@mrodozov
Copy link
Contributor

mrodozov commented May 4, 2021

The library itself will be the same version. the problem might not be the same as gemm implementations on arm and ppc employ different simd gimmick than on x86. could be worse :D

@slava77
Copy link
Contributor Author

slava77 commented May 4, 2021

@gartung @smuzaffar @mrodozov
is there a way to trigger a profiling job for ARM or/and PPC? (I'd be interested at least in the 11634.21 wf)

@slava77
Copy link
Contributor Author

slava77 commented May 4, 2021

is there a way to trigger a profiling job for ARM or/and PPC? (I'd be interested at least in the 11634.21 wf)

making a piechart/timing measurement should be enough

@mrodozov
Copy link
Contributor

mrodozov commented May 4, 2021

enable profiling

@smuzaffar
Copy link
Contributor

@slava77 , currently no. Profiling job is only run if we have profiling enable for IB. as currently we have run profiling for production arch, so bit is not going to run profiling for PR

@gartung
Copy link
Member

gartung commented May 4, 2021

You would need to run the run-pr-profiling job specificly on the ARM and/or PPC node.

@smuzaffar
Copy link
Contributor

of course one can manually run run-pr-profiling . @gartung, does PR profiling need any thing from IB profiling?

@gartung
Copy link
Member

gartung commented May 4, 2021

The Jenkins profiling jobs are set to run on nodes matching profiling label, ie vocms011.

@slava77
Copy link
Contributor Author

slava77 commented May 4, 2021

of course one can manually run run-pr-profiling . @gartung, does PR profiling need any thing from IB profiling?

I was mostly interested in a manual request to run.

@smuzaffar
Copy link
Contributor

Let me start a job for last 12.0.X IB

@slava77
Copy link
Contributor Author

slava77 commented May 5, 2021

Let me start a job for last 12.0.X IB

I see https://cmssdt.cern.ch/circles/web/piechart.php?local=false&dataset=CMSSW_12_0_X_2021-05-04-2300%2Fslc7_ppc64le_gcc9%2F11634.21%2Fstep4_PAT.resources&resource=time_thread&colours=default&groups=packages&threshold=0

but now I realized that I've asked for a wrong workflow number, I was supposed to ask for 11834.21 (it has pileup and matches what we run in IBs).

@smuzaffar
please start a job with 11834.21.
Thank you.

@smuzaffar
Copy link
Contributor

@slava77 , ok restarted for both aarch64 and ppc64le.

@smuzaffar
Copy link
Contributor

@slava77
Copy link
Contributor Author

slava77 commented May 6, 2021

@slava77 , profiling is nor available for ppc64le https://cmssdt.cern.ch/circles/web/piechart.php?local=false&dataset=CMSSW_12_0_X_2021-05-04-2300%2Fslc7_ppc64le_gcc9%2F11834.21%2Fstep4_PAT_PU.resources&resource=time_thread&colours=default&groups=packages&threshold=0

For aarch64, it is still running. Last time it was timed out after 12 hours

the fraction of DeepTauId is pretty similar between slc7_ppc64le_gcc9 and slc7_amd64_gcc900.
So, I guess the problem that I observed initially is not as simple as having a generic non-AVX case.

Regarding running on aarch64; I still do not see the output with aarch in the regular place for piecharts.
If there is a way to read step2.root (or even step3.root for this specific case) produced in a reference job, I expect that step3/4 will complete without a timeout.

@jpata
Copy link
Contributor

jpata commented May 17, 2022

I suppose this is still an issue. Do we have a way to produce timing charts for other arches?
Actually for my education, are the other arches actually used in production anywhere, or are they "nice-to-have R&D"?

@makortel
Copy link
Contributor

Actually for my education, are the other arches actually used in production anywhere, or are they "nice-to-have R&D"?

PPC is on its way for production (physics validation and operational testing are going on at Marconi100), and tests on ARM HPC(s) should be starting in the near future.

@gartung
Copy link
Member

gartung commented May 17, 2022

The pull requst profiling script is set up to use the production arch for the release it will be merged into.

@gartung
Copy link
Member

gartung commented May 17, 2022

You can manually trigger the profiling for a pull request and specify an alternate arch from what is automatically schecudled.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-profiling/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants