significant slow-down of tensorflow on non-AVX machine(s) #33442

slava77 · 2021-04-15T15:47:13Z

Originally from https://mattermost.web.cern.ch/cms-o-and-c/pl/zrtbufg8zbb9jgspeuxef183rc

I learned that TF inference is much slower on an older AMD compared to Intel.

Intel Broadwell: https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/133

AMD Opteron 6128 https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.wn36/29

both are running the same inputs in a bit older release where I had input data and where igprof was still working fine

one example call to mkldnn_sgemm has a very large difference in two cases, about a factor of 1000 less on Intel (look at % total):
https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.int34/2651

vs
https://slava77sk.web.cern.ch/slava77sk/reco/cgi-bin/igprof-navigator/sw-112X/CMSSW_11_2_0_pre7-orig-gcc820.TTbar_14UP21+DIGIPRMX.AVE_50_BX_25ns.1000.pp.wn36/30

[From @makortel ] Some slowdown was observed e.g. in https://mathematica.stackexchange.com/questions/64645/mkl-on-intel-vs-amd

I have a suspicion that we are using https://github.com/oneapi-src/oneDNN/blob/v1.0.4/src/cpu/gemm/gemm.cpp
Here (mkldnn_sgemm calls extended_sgemm, which in tern makes a choice between gemm_driver [igprof cost 0.02%] or ref_gemm<float> [igprof cost 30%])

If that's correct, then my analysis is that mkldnn_sgemm is common in both cases and it's really just this method implementation that differs by selecting for SSE4.1 flag.
Then the difference in speed is close to 1000. This does not look reasonable. A better understanding of what we actually compile here would help to confirm. (it may be straightforward to modify and more clearly confirm that ref_gemm is really so slow)

Goals towards resolving the issue:

understand which oneDNN is used to compile our tensorflow
see if there is a faster solution for the older arch (pre-SSE4.1?)

The text was updated successfully, but these errors were encountered:

cmsbuild · 2021-04-15T15:47:35Z

A new Issue was created by @slava77 Slava Krutelyov.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2021-04-15T15:48:53Z

assign core, reconstruction

cmsbuild · 2021-04-15T15:49:15Z

New categories assigned: core,reconstruction

@Dr15Jones,@smuzaffar,@slava77,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2021-04-15T15:49:43Z

understand which oneDNN is used to compile our tensorflow

@smuzaffar @mrodozov can you comment?

makortel · 2021-04-15T15:51:26Z

FYI @riga @mialiu149

slava77 · 2021-04-15T17:32:29Z

I have a suspicion that we are using https://github.com/oneapi-src/oneDNN/blob/v1.0.4/src/cpu/gemm/gemm.cpp

in the head of 11_3_X we are using TF 2.4.1 and based on https://github.com/cms-externals/tensorflow/blob/cms/v2.4.1/tensorflow/workspace.bzl it has

mkl_dnn mapping to https://github.com/oneapi-src/oneDNN/tree/v0.21.3
- ... in my 11_2_0_pre7 test the TF version is still 2.1.0 and oneDNN is v0.21.2
- this has a full implementation (the same in v0.21.2 and v0.21.3) https://github.com/oneapi-src/oneDNN/blob/v0.21.3/src/cpu/gemm/gemm.cpp
- it actually agrees with disassembly of this part of the code. GOOD.
mkl_dnn_v1 to https://github.com/oneapi-src/oneDNN/tree/v1.6.4
- ... in my 11_2_0_pre7 test the TF version is still 2.1.0 and oneDNN is v1.0-pc2
- in v1.6.4 case include/mkldnn_dnnl_mangling.h: #define mkldnn_sgemm dnnl_sgemm
- but in v1.0-pc2 it is still defined in https://github.com/oneapi-src/oneDNN/blob/v1.0-pc2/src/cpu/gemm/gemm.cpp#L136-L154
since I couldn't figure out if mkl_dnn or mkl_dnn_v1 is used, I'm relying on disassembly to conclude that only mkl_dnn is used; but it would be nice for someone to confirm

So, to correct the initial assumption about SSE4.1, the logic is actually about AVX
https://github.com/oneapi-src/oneDNN/blob/v0.21.3/src/cpu/gemm/gemm.cpp#L123-L136
the logic here is

    if (mayiuse(avx512_mic)) {
        return jit_avx512_common_gemm_f32(transa, transb,
...
    } else if (mayiuse(avx)) {
...
        return gemm_driver(transa, transb, bias ? "C" : NULL, M, N, K, alpha,
...
    } else {
        return ref_gemm<float>(transa, transb,

mrodozov · 2021-04-15T18:21:33Z

right, so TF is not using oneDNN from direct deps since we don't have it anywhere else.

A better understanding of what we actually compile here would help to confirm.

let me find the latest logs of TF to check it.

see cms-sw/cmssw#33442

mrodozov · 2021-04-15T19:20:38Z

let the bot builds it and we can check what bazel is doing

slava77 · 2021-04-15T20:43:40Z

let the bot builds it and we can check what bazel is doing

BTW, do we have a debug build for our externals and CMSSW?

mrodozov · 2021-04-15T20:51:21Z

we have 'a' debug build
scram list DBG -> CMSSW_11_3_DBG_X_2021-04-08-2300
I'm not sure what is in the DBG build but I'll assume only cmssw was build with debug flags
A debug build for all externals will require additional cmsdist branch with debug flags for all externals in every spec
From time to time we change the flags to debug for ROOT when there is something to be fixed
Although I don't remember if Shazhad have invented something to put debug flags for all externals ... somehow

makortel · 2021-04-15T21:08:59Z

I'm not sure what is in the DBG build but I'll assume only cmssw was build with debug flags

The DBG build is -g -O0 -DEDM_ML_DEBUG (mostly to help to ensure everything would build with those options).

mrodozov · 2021-04-20T13:33:22Z

after reading this:
https://docs.bazel.build/versions/master/external.html#external-packages
it says all the external packages downloaded by bazel are in bazel-tensorflow-2.4.1/external/mkl_dnn
and inside there is the BUILD.bazel which is this:


exports_files(["LICENSE"])

load(
    "@org_tensorflow//third_party:common.bzl",
    "template_rule",
)

config_setting(
    name = "clang_linux_x86_64",
    values = {
        "cpu": "k8",
        "define": "using_clang=true",
    },
)

template_rule(
    name = "mkldnn_config_h",
    src = "include/mkldnn_config.h.in",
    out = "include/mkldnn_config.h",
    substitutions = {
        "#cmakedefine MKLDNN_CPU_BACKEND MKLDNN_BACKEND_${MKLDNN_CPU_BACKEND}": "#define MKLDNN_CPU_BACKEND MKLDNN_BACKEND_NATIVE",
        "#cmakedefine MKLDNN_GPU_BACKEND MKLDNN_BACKEND_${MKLDNN_GPU_BACKEND}": "#define MKLDNN_GPU_BACKEND MKLDNN_BACKEND_NONE",
    },
)

# Create the file mkldnn_version.h with MKL-DNN version numbers.
# Currently, the version numbers are hard coded here. If MKL-DNN is upgraded then
# the version numbers have to be updated manually. The version numbers can be
# obtained from the PROJECT_VERSION settings in CMakeLists.txt. The variable is
# set to "version_major.version_minor.version_patch". The git hash version can
# be set to NA.
# TODO(agramesh1) Automatically get the version numbers from CMakeLists.txt.
# TODO(bhavanis): MKL-DNN minor version needs to be updated for MKL-DNN v1.x.
# The current version numbers will work only if MKL-DNN v0.21 is used.

template_rule(
    name = "mkldnn_version_h",
    src = "include/mkldnn_version.h.in",
    out = "include/mkldnn_version.h",
    substitutions = {
        "@MKLDNN_VERSION_MAJOR@": "0",
        "@MKLDNN_VERSION_MINOR@": "21",
        "@MKLDNN_VERSION_PATCH@": "3",
        "@MKLDNN_VERSION_HASH@": "N/A",
    },
)

cc_library(
    name = "mkldnn_single_threaded",
    srcs = glob([
        "src/common/*.cpp",
        "src/common/*.hpp",
        "src/cpu/*.cpp",
        "src/cpu/*.hpp",
        "src/cpu/**/*.cpp",
        "src/cpu/**/*.hpp",
        "src/cpu/xbyak/*.h",
    ]) + [":mkldnn_version_h"],
    hdrs = glob(["include/*"]),
    copts = [
        "-fexceptions",
        "-DMKLDNN_THR=MKLDNN_THR_SEQ",  # Disables threading.
    ],
    includes = [
        "include",
        "src",
        "src/common",
        "src/cpu",
        "src/cpu/gemm",
        "src/cpu/xbyak",
    ],
    visibility = ["//visibility:public"],
)

also before that I checked the cache in the build directory and only one of the two tar files hash was there:
again the 0.21.3 version
I'm also building TF without the 0.23.1 version to confirm the BUILD file will change
What we can also do is build an IB with the old mkl_dnn version and rerun the tests.

slava77 · 2021-04-28T13:22:01Z

if I understand correctly, the same problem will be present on ARM and Power.
I updated the issue title, but did not try it explicitly.

mrodozov · 2021-05-04T11:18:49Z

The library itself will be the same version. the problem might not be the same as gemm implementations on arm and ppc employ different simd gimmick than on x86. could be worse :D

slava77 · 2021-05-04T13:13:41Z

@gartung @smuzaffar @mrodozov
is there a way to trigger a profiling job for ARM or/and PPC? (I'd be interested at least in the 11634.21 wf)

slava77 · 2021-05-04T13:14:18Z

is there a way to trigger a profiling job for ARM or/and PPC? (I'd be interested at least in the 11634.21 wf)

making a piechart/timing measurement should be enough

mrodozov · 2021-05-04T13:17:46Z

enable profiling

smuzaffar · 2021-05-04T13:20:03Z

@slava77 , currently no. Profiling job is only run if we have profiling enable for IB. as currently we have run profiling for production arch, so bit is not going to run profiling for PR

gartung · 2021-05-04T13:20:27Z

You would need to run the run-pr-profiling job specificly on the ARM and/or PPC node.

smuzaffar · 2021-05-04T13:21:59Z

of course one can manually run run-pr-profiling . @gartung, does PR profiling need any thing from IB profiling?

gartung · 2021-05-04T13:28:03Z

The Jenkins profiling jobs are set to run on nodes matching profiling label, ie vocms011.

slava77 · 2021-05-04T13:43:00Z

of course one can manually run run-pr-profiling . @gartung, does PR profiling need any thing from IB profiling?

I was mostly interested in a manual request to run.

smuzaffar · 2021-05-04T13:56:28Z

Let me start a job for last 12.0.X IB

slava77 · 2021-05-05T15:38:47Z

Let me start a job for last 12.0.X IB

I see https://cmssdt.cern.ch/circles/web/piechart.php?local=false&dataset=CMSSW_12_0_X_2021-05-04-2300%2Fslc7_ppc64le_gcc9%2F11634.21%2Fstep4_PAT.resources&resource=time_thread&colours=default&groups=packages&threshold=0

but now I realized that I've asked for a wrong workflow number, I was supposed to ask for 11834.21 (it has pileup and matches what we run in IBs).

@smuzaffar
please start a job with 11834.21.
Thank you.

smuzaffar · 2021-05-05T15:44:36Z

@slava77 , ok restarted for both aarch64 and ppc64le.

smuzaffar · 2021-05-06T06:13:03Z

@slava77 , profiling is nor available for ppc64le https://cmssdt.cern.ch/circles/web/piechart.php?local=false&dataset=CMSSW_12_0_X_2021-05-04-2300%2Fslc7_ppc64le_gcc9%2F11834.21%2Fstep4_PAT_PU.resources&resource=time_thread&colours=default&groups=packages&threshold=0

For aarch64, it is still running. Last time it was timed out after 12 hours

slava77 · 2021-05-06T21:46:04Z

@slava77 , profiling is nor available for ppc64le https://cmssdt.cern.ch/circles/web/piechart.php?local=false&dataset=CMSSW_12_0_X_2021-05-04-2300%2Fslc7_ppc64le_gcc9%2F11834.21%2Fstep4_PAT_PU.resources&resource=time_thread&colours=default&groups=packages&threshold=0

For aarch64, it is still running. Last time it was timed out after 12 hours

the fraction of DeepTauId is pretty similar between slc7_ppc64le_gcc9 and slc7_amd64_gcc900.
So, I guess the problem that I observed initially is not as simple as having a generic non-AVX case.

Regarding running on aarch64; I still do not see the output with aarch in the regular place for piecharts.
If there is a way to read step2.root (or even step3.root for this specific case) produced in a reference job, I expect that step3/4 will complete without a timeout.

jpata · 2022-05-17T08:26:38Z

I suppose this is still an issue. Do we have a way to produce timing charts for other arches?
Actually for my education, are the other arches actually used in production anywhere, or are they "nice-to-have R&D"?

makortel · 2022-05-17T13:46:37Z

Actually for my education, are the other arches actually used in production anywhere, or are they "nice-to-have R&D"?

PPC is on its way for production (physics validation and operational testing are going on at Marconi100), and tests on ARM HPC(s) should be starting in the near future.

gartung · 2022-05-17T13:51:46Z

The pull requst profiling script is set up to use the production arch for the release it will be merged into.

gartung · 2022-05-17T13:58:55Z

You can manually trigger the profiling for a pull request and specify an alternate arch from what is automatically schecudled.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-profiling/

cmsbuild added the pending-assignment label Apr 15, 2021

cmsbuild added core-pending pending-signatures reconstruction-pending and removed pending-assignment labels Apr 15, 2021

slava77 changed the title ~~significant slow-down of tensorflow on AMD (pre-SSE4.1)~~ significant slow-down of tensorflow on AMD (non-AVX) Apr 15, 2021

mrodozov added a commit to cms-sw/cmsdist that referenced this issue Apr 15, 2021

[11.3] Build tensorflow to check mkldnn version

f69a661

see cms-sw/cmssw#33442

mrodozov mentioned this issue Apr 15, 2021

[11.3] Build tensorflow to check mkldnn version cms-sw/cmsdist#6813

Closed

slava77 mentioned this issue Apr 28, 2021

reduce ML inference time in b-tag and related jet taggers: focus on ParticleNet #32883

Closed

5 tasks

slava77 changed the title ~~significant slow-down of tensorflow on AMD (non-AVX)~~ significant slow-down of tensorflow on non-AVX machine(s) Apr 28, 2021

mrodozov mentioned this issue May 12, 2021

Non-reproducibility in DeepTau in 1325.81 #32628

Open

slava77 mentioned this issue May 13, 2021

update deepCore for Run3 #33531

Merged

slava77 mentioned this issue Jul 13, 2023

Failures in Run 3 data reprocessing #40437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

significant slow-down of tensorflow on non-AVX machine(s) #33442

significant slow-down of tensorflow on non-AVX machine(s) #33442

slava77 commented Apr 15, 2021 •

edited

Loading

cmsbuild commented Apr 15, 2021

makortel commented Apr 15, 2021

cmsbuild commented Apr 15, 2021

makortel commented Apr 15, 2021

makortel commented Apr 15, 2021

slava77 commented Apr 15, 2021

mrodozov commented Apr 15, 2021

mrodozov commented Apr 15, 2021

slava77 commented Apr 15, 2021

mrodozov commented Apr 15, 2021 •

edited

Loading

makortel commented Apr 15, 2021

mrodozov commented Apr 20, 2021

slava77 commented Apr 28, 2021

mrodozov commented May 4, 2021

slava77 commented May 4, 2021

slava77 commented May 4, 2021

mrodozov commented May 4, 2021

smuzaffar commented May 4, 2021

gartung commented May 4, 2021

smuzaffar commented May 4, 2021

gartung commented May 4, 2021 •

edited

Loading

slava77 commented May 4, 2021

smuzaffar commented May 4, 2021

slava77 commented May 5, 2021

smuzaffar commented May 5, 2021

smuzaffar commented May 6, 2021

slava77 commented May 6, 2021

jpata commented May 17, 2022

makortel commented May 17, 2022

gartung commented May 17, 2022

gartung commented May 17, 2022

significant slow-down of tensorflow on non-AVX machine(s) #33442

significant slow-down of tensorflow on non-AVX machine(s) #33442

Comments

slava77 commented Apr 15, 2021 • edited Loading

cmsbuild commented Apr 15, 2021

makortel commented Apr 15, 2021

cmsbuild commented Apr 15, 2021

makortel commented Apr 15, 2021

makortel commented Apr 15, 2021

slava77 commented Apr 15, 2021

mrodozov commented Apr 15, 2021

mrodozov commented Apr 15, 2021

slava77 commented Apr 15, 2021

mrodozov commented Apr 15, 2021 • edited Loading

makortel commented Apr 15, 2021

mrodozov commented Apr 20, 2021

slava77 commented Apr 28, 2021

mrodozov commented May 4, 2021

slava77 commented May 4, 2021

slava77 commented May 4, 2021

mrodozov commented May 4, 2021

smuzaffar commented May 4, 2021

gartung commented May 4, 2021

smuzaffar commented May 4, 2021

gartung commented May 4, 2021 • edited Loading

slava77 commented May 4, 2021

smuzaffar commented May 4, 2021

slava77 commented May 5, 2021

smuzaffar commented May 5, 2021

smuzaffar commented May 6, 2021

slava77 commented May 6, 2021

jpata commented May 17, 2022

makortel commented May 17, 2022

gartung commented May 17, 2022

gartung commented May 17, 2022

slava77 commented Apr 15, 2021 •

edited

Loading

mrodozov commented Apr 15, 2021 •

edited

Loading

gartung commented May 4, 2021 •

edited

Loading