Skip to content

Conversation

9prady9
Copy link
Member

@9prady9 9prady9 commented Jun 20, 2018

Resolves #2146

@9prady9 9prady9 requested a review from umar456 June 20, 2018 10:49
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from 27c7ba0 to ed7c846 Compare June 20, 2018 10:54
Copy link
Contributor

@WilliamTambellini WilliamTambellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks. Have you done a minimum of benchmark to check there is a minimum of speedup for the usecase Intel is advertizing ?

@9prady9
Copy link
Member Author

9prady9 commented Jun 20, 2018

@WilliamTambellini Yes, I am working on benchmarks. Once, I have them in presentable format, I will share them here.

@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from ed7c846 to 0859249 Compare July 11, 2018 04:29
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from 0859249 to 57eb26d Compare July 21, 2018 05:37
@9prady9
Copy link
Member Author

9prady9 commented Jul 21, 2018

Sorry about the delay but we wanted to make sure the changes are indeed giving the performance improvements we expected it to. Turns out, if we Intel TBB as threading solution, MKL batch call wasn't giving any performance improvements for some reason. Once we switched the threading to be GNU OpenMP, we started seeing good performance improvements.

Given below is the link to the interactive chart that shows the numbers we have for batched operations.
https://docs.google.com/spreadsheets/d/e/2PACX-1vRfKVz-VV9qGae7fLC3wYHtvNROwgfZ9yI7mCDTvxXJ2wqV8ibtrp1BVxykIz9nTVlCg5ouRvfd1hFN/pubchart?oid=1109307214&format=interactive

It is by no means a comprehensive set of benchmarks but serves as an confirmation of the potential performance improvements and issue with TBB.

Update:
Raised an issue in intel community forums regarding the same.
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/783212

@9prady9
Copy link
Member Author

9prady9 commented Jul 21, 2018

looks like batched matmul failed on linux-e5, I will look into it.

@WilliamTambellini
Copy link
Contributor

Hum very interesting. Congrats.
@umar456 I am aware about another drawback with TBB backend : it seems impossible to finely select the number of cpu cores to be used by afcpu. With OpenMP, just setting OMP_NUM_THREADS was at least working with AF 3.5. I know afcpu 3.6.1 has been built with TBB backend but would it be possible for afcpu 3.6.2 to switch back to OpenMP (gnu or intel, your choice) ?

@9prady9
Copy link
Member Author

9prady9 commented Jul 24, 2018

We will be defaulting to GNU OpenMP on non-Windows platforms and Intel OpenMP on windows.

@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch 6 times, most recently from 48159fc to 67900f5 Compare July 30, 2018 07:42
@9prady9
Copy link
Member Author

9prady9 commented Jul 30, 2018

The current changes to FindMKL seems to work on both Linux and Windows. However, OSX still can't find OpenMP, looking into it.

Update: Fixed this on OSX too. Turns out, Intel MKL installation on OSX doesn't have mkl_gnu_thread installed. It has only mkl_intel_thread and mkl_tbb_thread. Therefore, I have made changes so that threading defaults to Intel OpenMP on Windows and Apple, and GNU OpenMP on Linux.

@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch 3 times, most recently from 998426f to a334db2 Compare July 31, 2018 04:54
@9prady9
Copy link
Member Author

9prady9 commented Jul 31, 2018

linux-cpu-e5(given below are the CPU details) seems be giving a slightly higher error value of 9.15527e-05 compared to the expected 1e-05. I have changed the expected value to be 1e-04, @umar456 let me know you don't like that.

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 61
model name	: Intel Core Processor (Broadwell, IBRS)
siblings	: 4
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch arat xsaveopt invpcid_single tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap

Update:
Quadro M6000 is also generating error values in the range 0.0001* compared to the earlier expected 1e-05.

@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch 5 times, most recently from 10e3611 to e5e11d0 Compare August 7, 2018 10:54
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from e5e11d0 to 6b2669f Compare August 9, 2018 07:12
@9prady9
Copy link
Member Author

9prady9 commented Aug 9, 2018

@umar456 I am not sure how to proceed, the blas_cpu test is failing via ci on linux-cpu-e5 with Intel OpenMP, however it doesn't fail when I run it manually from shell - exit code is also zero. This is the only blocker on this PR at the moment. All other jobs seem to be running fine with Intel OpenMP as thread layer.

@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from 6b2669f to febc154 Compare August 26, 2018 14:43
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch 2 times, most recently from caba038 to 3d93e83 Compare October 24, 2018 04:46
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from 3d93e83 to a6e9bfb Compare November 11, 2018 01:56
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch 2 times, most recently from 2d304c1 to 7171bc8 Compare November 24, 2018 03:32
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from 7171bc8 to 36f7120 Compare November 29, 2018 06:45
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from 36f7120 to 566fc3b Compare December 12, 2018 12:37
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch 2 times, most recently from 7ddd69f to c1fdc7f Compare December 24, 2018 18:06
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from c1fdc7f to bae596d Compare December 26, 2018 06:40
@9prady9 9prady9 dismissed umar456’s stale review December 26, 2018 06:46

Addressed feedback

Default to Intel OpenMP on Linux, OSX and Windows

Intel TBB is not giving speed up as expected as of now. Hence, switching
to OpenMP as thread layer.

* GNU OpenMP causing some opencl tests to fail on certain debian
  configurations. Hence, choosing Intel OpenMP.
* GNU OpenMP related mkl_gnu_thread library is not installed
  on OSX's Intel MKL Installation, so on OSX only option is Intel OpenMP.
* Windows OpenMP support is lacking behind by so many versions, so we
  are using Intel OpenMP on Windows too.
@9prady9 9prady9 force-pushed the cpu_batched_matmul_mkl branch from bae596d to 8c06d90 Compare December 27, 2018 09:42
@9prady9 9prady9 requested a review from umar456 December 27, 2018 09:46
@@ -22,6 +22,9 @@ include(CheckCXXCompilerFlag)

arrayfire_set_cmake_default_variables()

#Set Intel OpenMP as default MKL thread layer
set(MKL_THREAD_LAYER "Intel OpenMP" CACHE STRING "The thread layer to choose for MKL")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in the FindMKL.cmake file.

@9prady9 9prady9 merged commit 5cc9967 into arrayfire:master Dec 27, 2018
@9prady9 9prady9 deleted the cpu_batched_matmul_mkl branch December 27, 2018 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants