Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dist() works on 10k X 10k, but not on subset: 10K X 100 #131

Closed
Toniiiio opened this issue May 25, 2018 · 13 comments
Closed

dist() works on 10k X 10k, but not on subset: 10K X 100 #131

Toniiiio opened this issue May 25, 2018 · 13 comments
Assignees
Labels
Milestone

Comments

@Toniiiio
Copy link

Hi,

thanks for any infos already. A feedback would be very appreciated.

The overall goal would be to have up to 50k X 300 matrices,
it would be nice to have 10k X 100.

  1. GPU details at the end.
  2. Code used (on empty environment): (note first part which takes 10sec, will speed up to <1sec after first run).

Thanks for any infos!

>library(gpuR)
>gpuInfo()
>ndim <- 1000
> A <- gpuMatrix(rnorm(ndim^2), nrow = ndim, ncol = ndim)
> start <- Sys.time()
> B <- dist(A)
> end <- Sys.time()
> end - start
Time difference of 10.44723 secs

> AA <- gpuMatrix(rnorm(ndim*100), nrow = ndim, ncol = 100)
> start <- Sys.time()
> BB <- dist(AA)
Error in cpp_gpuMatrix_eucl(A@address, D@address, squareDist, 8L) : 
  ViennaCL: FATAL ERROR: CL_OUT_OF_RESOURCES 
 ViennaCL tried to launch a compute kernel, but the device does not provide enough resources. Try changing the global and local work item sizes.
If you think that this is a bug in ViennaCL, please report it at viennacl-support@lists.sourceforge.net and supply at least the following information:
 * Operating System
 * Which OpenCL implementation (AMD, NVIDIA, etc.)
 * ViennaCL version
Many thanks in advance!
> end <- Sys.time()
> end - start
Time difference of 0.06125021 secs

Number of platforms: 1

  • platform: NVIDIA Corporation: OpenCL 1.2 CUDA 8.0.0
    • context device index: 0
      • Tesla K80
    • context device index: 1
      • Tesla K80
    • context device index: 2
      • Tesla K80
    • context device index: 3
      • Tesla K80
    • context device index: 4
      • Tesla K80
    • context device index: 5
      • Tesla K80
    • context device index: 6
      • Tesla K80
    • context device index: 7
      • Tesla K80
        checked all devices
        completed initialization
        gpuR 2.0.0

Attaching package: ‘gpuR’

The following objects are masked from ‘package:base’:

colnames, pmax, pmin, svd

gpuInfo()
$deviceName
[1] "Tesla K80"

$deviceVendor
[1] "NVIDIA Corporation"

$numberOfCores
[1] 13

$maxWorkGroupSize
[1] 1024

$maxWorkItemDim
[1] 3

$maxWorkItemSizes
[1] 1024 1024 64

$deviceMemory
[1] 11995578368

$clockFreq
[1] 823

$localMem
[1] 49152

$maxAllocatableMem
[1] 2998894592

$available
[1] "yes"

$deviceExtensions
[1] "cl_khr_global_int32_base_atomics"
[2] "cl_khr_global_int32_extended_atomics"
[3] "cl_khr_local_int32_base_atomics"
[4] "cl_khr_local_int32_extended_atomics"
[5] "cl_khr_fp64"
[6] "cl_khr_byte_addressable_store"
[7] "cl_khr_icd"
[8] "cl_khr_gl_sharing"
[9] "cl_nv_compiler_options"
[10] "cl_nv_device_attribute_query"
[11] "cl_nv_pragma_unroll"
[12] "cl_nv_copy_opts"

$double_support
[1] TRUE

cdeterman added a commit that referenced this issue May 30, 2018
@cdeterman
Copy link
Owner

@TiMaG Could you try the latest develop version?

devtools::install_github('cdeterman/gpuR', ref = 'develop')

There have been some changes there and I have added some additional comments that will spit out during dist to help debugging this. You certainly shouldn't be encountering memory limitations with those Telsa GPUs and those matrix sizes.

@Toniiiio
Copy link
Author

Toniiiio commented May 31, 2018

@cdeterman: Thank you for the reply. It works for dim = 1000 with the develop version. But not for dim>=1500.
What i am surprised by, is that 1500X1500 matrix works but not 1500X100 matrix. Even 10000X10000 works fine.
I lack ideas why the subset fails, while the bigger martrix (code) doesnt result in errors.

Thanks a lot!
(Offtopic: To not steal too much of your time - If i may ask - are you open for consulting hours (paid of course)).

@cdeterman
Copy link
Owner

@TiMaG sorry for my delayed response. Thanks for testing again, did you have any of the comments that the current develop version should have printed? For example, there should be lines like "pulled data" and "create 'twos' matrix". I will need these to know where the problem is happening. I find it curious that the square matrices are working for you at the larger sizes but not the rectangular ones.

Regarding consulting, I am possibly open but we should take such conversations off list to my email at cdetermanjr@gmail.com

@cdeterman cdeterman self-assigned this Jun 5, 2018
@cdeterman cdeterman added the bug label Jun 5, 2018
@Toniiiio
Copy link
Author

I wasnt sure whether it would be more easy to combine it with the "consulting" and sharing screen. But to keep going i just went ahead already:

Concerning your question: No i dont see any messages like that:

> devtools::install_github('cdeterman/gpuR', ref = 'develop')
> ndim <- 1500
> AA <- gpuMatrix(rnorm(ndim*100), nrow = ndim, ncol = 100)
> start <- Sys.time()
> BB <- dist(AA)
Error in cpp_gpuMatrix_eucl(A@address, D@address, squareDist, 6L) : 
  ViennaCL: FATAL ERROR: CL_OUT_OF_RESOURCES 
 ViennaCL tried to launch a compute kernel, but the device does not provide enough resources. Try changing the global and local work item sizes.
If you think that this is a bug in ViennaCL, please report it at viennacl-support@lists.sourceforge.net and supply at least the following information:
 * Operating System
 * Which OpenCL implementation (AMD, NVIDIA, etc.)
 * ViennaCL version
Many thanks in advance!
> end <- Sys.time()
> end - start
Time difference of 0.07861066 secs

@cdeterman
Copy link
Owner

@TiMaG ah, that was because I only put the comments in the vclMatrix code. I have now added to the gpuMatrix code. Feel free to install again at your convenience. I apologize again for the delayed response, things have been quite busy for me lately. Curiously I am having trouble reproducing your error. Even on my puny Intel GPU, I am able to do the following without a problem.

library(gpuR)

ndim <- 5000
AA <- gpuMatrix(rnorm(ndim*100), nrow = ndim, ncol = 100, type = "double")
BB <- dist(AA)

@Toniiiio
Copy link
Author

@cdeterman No problem at all.
The output is kind of long. Please find it below. Hope it helps.

> library(gpuR)
Number of platforms: 1
- platform: NVIDIA Corporation: OpenCL 1.2 CUDA 8.0.0
  - context device index: 0
    - Tesla K80
  - context device index: 1
    - Tesla K80
  - context device index: 2
    - Tesla K80
  - context device index: 3
    - Tesla K80
  - context device index: 4
    - Tesla K80
  - context device index: 5
    - Tesla K80
  - context device index: 6
    - Tesla K80
  - context device index: 7
    - Tesla K80
checked all devices
completed initialization
gpuR 2.0.3

Attaching package: ‘gpuR’

The following objects are masked from ‘package:base’:

    colnames, pmax, pmin, svd

> ndim <- 1500
> AA <- gpuMatrix(rnorm(ndim*100), nrow = ndim, ncol = 100)
> AA <- gpuMatrix(rnorm(ndim*100), nrow = ndim, ncol = 100)
> start <- Sys.time()
> BB <- dist(AA)
pulled data
row of zeros
powers and rowsum completed
outer product complete
transpose complete
Build Status = -2 ( Err = -9999 )
Log: 

Sources: #pragma OPENCL EXTENSION cl_khr_fp64 : enable
 __attribute__((reqd_work_group_size(128,1,1)))
__kernel void _prod_NN(unsigned int M, unsigned int N, unsigned int K, __global float* obj0_pointer,unsigned int obj0_ld,unsigned int obj0_start2,unsigned int obj0_start1,unsigned int obj0_stride2,unsigned int obj0_stride1,__global float* obj1_pointer,unsigned int obj1_ld,unsigned int obj1_start1,unsigned int obj1_start2,unsigned int obj1_stride1,unsigned int obj1_stride2,__global float* obj3_pointer,unsigned int obj3_ld,unsigned int obj3_start1,unsigned int obj3_start2,unsigned int obj3_stride1,unsigned int obj3_stride2,float obj4,float obj5)
{
    obj1_start1 /= 1;
    obj1_ld /= 1;
    obj3_start1/= 1;
    obj3_ld /= 1;
    obj0_pointer += (obj0_start1) * obj0_ld + ( obj0_start2);
    obj1_pointer += (obj1_start1) +  ( obj1_start2) * obj1_ld;
    obj3_pointer += (obj3_start1) +  ( obj3_start2) * obj3_ld;
    obj0_ld *= obj0_stride1;
    obj1_ld *= obj1_stride2;
    obj3_ld *= obj3_stride2;
    float rC[4][16] = {{(float)0}};
    float rA[2][4];
    float rB[2][16];__local float lB[1088];
    size_t gidx = get_group_id(0);
    size_t gidy = get_group_id(1);
    size_t idx = get_local_id(0);
    size_t idy = get_local_id(1);
    
    size_t idt = 128*idy + idx;
    size_t idxT = idt % 16;
    size_t idyT = idt / 16;
    
    obj1_pointer += (gidx*512+ idx);
    obj3_pointer += idxT + gidy*16*obj3_ld + idyT*obj3_ld;
    
    size_t K_size_t = K;
    for(size_t block_k=0; block_k < K_size_t; block_k+=64){
        __local float* plB = lB + idxT*17+ idyT;
        barrier(CLK_LOCAL_MEM_FENCE);
        (plB + 0)[0] = obj3_pointer[0*obj3_ld + 0];
        (plB + 272)[0] = obj3_pointer[0*obj3_ld + 16];
        (plB + 544)[0] = obj3_pointer[0*obj3_ld + 32];
        (plB + 816)[0] = obj3_pointer[0*obj3_ld + 48];
        (plB + 8)[0] = obj3_pointer[8*obj3_ld + 0];
        (plB + 280)[0] = obj3_pointer[8*obj3_ld + 16];
        (plB + 552)[0] = obj3_pointer[8*obj3_ld + 32];
        (plB + 824)[0] = obj3_pointer[8*obj3_ld + 48];
        barrier(CLK_LOCAL_MEM_FENCE);
        size_t offA = 1*idx;
        size_t offB = 1*idy;
        for(size_t k = 0; k < 64; k+=2){
            #pragma unroll 2
            for(size_t kk = 0; kk < 2; kk++)
            #pragma unroll 4
            for(size_t mm = 0; mm < 4; mm++)
            {
                rA[kk][mm] = obj1_pointer[kk*obj1_ld + mm*128];
            }
            #pragma unroll 2
            for(size_t kk = 0; kk < 2; kk++)
            #pragma unroll 16
            for(size_t nn = 0; nn < 16; nn++)
            {
                rB[kk][nn*1+0] = lB[offB + nn*1+0+ kk*17];
            }
            obj1_pointer += 2*obj1_ld;
            offB += 34;
            #pragma unroll 2
            for(size_t kk = 0; kk <2; ++kk)
            {
                rC[0][0]=fma(rA[kk][0],rB[kk][0],rC[0][0]);
                rC[1][0]=fma(rA[kk][1],rB[kk][0],rC[1][0]);
                rC[2][0]=fma(rA[kk][2],rB[kk][0],rC[2][0]);
                rC[3][0]=fma(rA[kk][3],rB[kk][0],rC[3][0]);
                rC[0][1]=fma(rA[kk][0],rB[kk][1],rC[0][1]);
                rC[1][1]=fma(rA[kk][1],rB[kk][1],rC[1][1]);
                rC[2][1]=fma(rA[kk][2],rB[kk][1],rC[2][1]);
                rC[3][1]=fma(rA[kk][3],rB[kk][1],rC[3][1]);
                rC[0][2]=fma(rA[kk][0],rB[kk][2],rC[0][2]);
                rC[1][2]=fma(rA[kk][1],rB[kk][2],rC[1][2]);
                rC[2][2]=fma(rA[kk][2],rB[kk][2],rC[2][2]);
                rC[3][2]=fma(rA[kk][3],rB[kk][2],rC[3][2]);
                rC[0][3]=fma(rA[kk][0],rB[kk][3],rC[0][3]);
                rC[1][3]=fma(rA[kk][1],rB[kk][3],rC[1][3]);
                rC[2][3]=fma(rA[kk][2],rB[kk][3],rC[2][3]);
                rC[3][3]=fma(rA[kk][3],rB[kk][3],rC[3][3]);
                rC[0][4]=fma(rA[kk][0],rB[kk][4],rC[0][4]);
                rC[1][4]=fma(rA[kk][1],rB[kk][4],rC[1][4]);
                rC[2][4]=fma(rA[kk][2],rB[kk][4],rC[2][4]);
                rC[3][4]=fma(rA[kk][3],rB[kk][4],rC[3][4]);
                rC[0][5]=fma(rA[kk][0],rB[kk][5],rC[0][5]);
                rC[1][5]=fma(rA[kk][1],rB[kk][5],rC[1][5]);
                rC[2][5]=fma(rA[kk][2],rB[kk][5],rC[2][5]);
                rC[3][5]=fma(rA[kk][3],rB[kk][5],rC[3][5]);
                rC[0][6]=fma(rA[kk][0],rB[kk][6],rC[0][6]);
                rC[1][6]=fma(rA[kk][1],rB[kk][6],rC[1][6]);
                rC[2][6]=fma(rA[kk][2],rB[kk][6],rC[2][6]);
                rC[3][6]=fma(rA[kk][3],rB[kk][6],rC[3][6]);
                rC[0][7]=fma(rA[kk][0],rB[kk][7],rC[0][7]);
                rC[1][7]=fma(rA[kk][1],rB[kk][7],rC[1][7]);
                rC[2][7]=fma(rA[kk][2],rB[kk][7],rC[2][7]);
                rC[3][7]=fma(rA[kk][3],rB[kk][7],rC[3][7]);
                rC[0][8]=fma(rA[kk][0],rB[kk][8],rC[0][8]);
                rC[1][8]=fma(rA[kk][1],rB[kk][8],rC[1][8]);
                rC[2][8]=fma(rA[kk][2],rB[kk][8],rC[2][8]);
                rC[3][8]=fma(rA[kk][3],rB[kk][8],rC[3][8]);
                rC[0][9]=fma(rA[kk][0],rB[kk][9],rC[0][9]);
                rC[1][9]=fma(rA[kk][1],rB[kk][9],rC[1][9]);
                rC[2][9]=fma(rA[kk][2],rB[kk][9],rC[2][9]);
                rC[3][9]=fma(rA[kk][3],rB[kk][9],rC[3][9]);
                rC[0][10]=fma(rA[kk][0],rB[kk][10],rC[0][10]);
                rC[1][10]=fma(rA[kk][1],rB[kk][10],rC[1][10]);
                rC[2][10]=fma(rA[kk][2],rB[kk][10],rC[2][10]);
                rC[3][10]=fma(rA[kk][3],rB[kk][10],rC[3][10]);
                rC[0][11]=fma(rA[kk][0],rB[kk][11],rC[0][11]);
                rC[1][11]=fma(rA[kk][1],rB[kk][11],rC[1][11]);
                rC[2][11]=fma(rA[kk][2],rB[kk][11],rC[2][11]);
                rC[3][11]=fma(rA[kk][3],rB[kk][11],rC[3][11]);
                rC[0][12]=fma(rA[kk][0],rB[kk][12],rC[0][12]);
                rC[1][12]=fma(rA[kk][1],rB[kk][12],rC[1][12]);
                rC[2][12]=fma(rA[kk][2],rB[kk][12],rC[2][12]);
                rC[3][12]=fma(rA[kk][3],rB[kk][12],rC[3][12]);
                rC[0][13]=fma(rA[kk][0],rB[kk][13],rC[0][13]);
                rC[1][13]=fma(rA[kk][1],rB[kk][13],rC[1][13]);
                rC[2][13]=fma(rA[kk][2],rB[kk][13],rC[2][13]);
                rC[3][13]=fma(rA[kk][3],rB[kk][13],rC[3][13]);
                rC[0][14]=fma(rA[kk][0],rB[kk][14],rC[0][14]);
                rC[1][14]=fma(rA[kk][1],rB[kk][14],rC[1][14]);
                rC[2][14]=fma(rA[kk][2],rB[kk][14],rC[2][14]);
                rC[3][14]=fma(rA[kk][3],rB[kk][14],rC[3][14]);
                rC[0][15]=fma(rA[kk][0],rB[kk][15],rC[0][15]);
                rC[1][15]=fma(rA[kk][1],rB[kk][15],rC[1][15]);
                rC[2][15]=fma(rA[kk][2],rB[kk][15],rC[2][15]);
                rC[3][15]=fma(rA[kk][3],rB[kk][15],rC[3][15]);
            }
        }
        obj3_pointer += 64;
    }
    obj0_pointer += gidx*512*obj0_ld;
    obj0_pointer += idx*1*obj0_ld;
    obj0_pointer += gidy*16*obj0_stride2;
    obj0_pointer += idy*1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][0]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][0]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][0]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][0]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][1]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][1]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][1]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][1]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][2]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][2]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][2]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][2]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][3]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][3]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][3]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][3]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][4]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][4]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][4]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][4]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][5]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][5]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][5]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][5]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][6]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][6]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][6]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][6]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][7]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][7]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][7]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][7]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][8]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][8]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][8]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][8]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][9]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][9]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][9]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][9]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][10]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][10]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][10]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][10]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][11]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][11]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][11]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][11]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][12]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][12]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][12]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][12]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][13]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][13]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][13]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][13]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][14]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][14]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][14]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][14]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][15]*obj4+ obj0_pointer[0*obj0_ld]*obj5;
    obj0_pointer[128*obj0_ld] = rC[1][15]*obj4+ obj0_pointer[128*obj0_ld]*obj5;
    obj0_pointer[256*obj0_ld] = rC[2][15]*obj4+ obj0_pointer[256*obj0_ld]*obj5;
    obj0_pointer[384*obj0_ld] = rC[3][15]*obj4+ obj0_pointer[384*obj0_ld]*obj5;
    obj0_pointer += 1*obj0_stride2;
}
 __attribute__((reqd_work_group_size(32,32,1)))
__kernel void _prod_TN(unsigned int M, unsigned int N, unsigned int K, __global float* obj0_pointer,unsigned int obj0_ld,unsigned int obj0_start2,unsigned int obj0_start1,unsigned int obj0_stride2,unsigned int obj0_stride1,__global float* obj1_pointer,unsigned int obj1_ld,unsigned int obj1_start1,unsigned int obj1_start2,unsigned int obj1_stride1,unsigned int obj1_stride2,__global float* obj4_pointer,unsigned int obj4_ld,unsigned int obj4_start1,unsigned int obj4_start2,unsigned int obj4_stride1,unsigned int obj4_stride2,float obj5,float obj6)
{
    obj1_start1 /= 1;
    obj1_ld /= 1;
    obj4_start1/= 1;
    obj4_ld /= 1;
    obj0_pointer += (obj0_start1) * obj0_ld + ( obj0_start2);
    obj1_pointer += (obj1_start1) +  ( obj1_start2) * obj1_ld;
    obj4_pointer += (obj4_start1) +  ( obj4_start2) * obj4_ld;
    obj0_ld *= obj0_stride1;
    obj1_ld *= obj1_stride2;
    obj4_ld *= obj4_stride2;
    float rC[8][4] = {{(float)0}};
    float rA[2][8];
    float rB[2][4];__local float lA[4112];__local float lB[2064];
    size_t gidx = get_group_id(0);
    size_t gidy = get_group_id(1);
    size_t idx = get_local_id(0);
    size_t idy = get_local_id(1);
    
    size_t idt = 32*idy + idx;
    size_t idxT = idt % 16;
    size_t idyT = idt / 16;
    
    obj1_pointer += idxT + gidx*256*obj1_ld + idyT*obj1_ld;
    obj4_pointer += idxT + gidy*128*obj4_ld + idyT*obj4_ld;
    
    size_t K_size_t = K;
    for(size_t block_k=0; block_k < K_size_t; block_k+=16){
        __local float* plA = lA + idxT*257 + idyT;
        __local float* plB = lB + idxT*129+ idyT;
        barrier(CLK_LOCAL_MEM_FENCE);
        (plA + 0)[0] = obj1_pointer[0*obj1_ld + 0];
        (plA + 64)[0] = obj1_pointer[64*obj1_ld + 0];
        (plA + 128)[0] = obj1_pointer[128*obj1_ld + 0];
        (plA + 192)[0] = obj1_pointer[192*obj1_ld + 0];
        (plB + 0)[0] = obj4_pointer[0*obj4_ld + 0];
        (plB + 64)[0] = obj4_pointer[64*obj4_ld + 0];
        barrier(CLK_LOCAL_MEM_FENCE);
        size_t offA = 1*idx;
        size_t offB = 1*idy;
        for(size_t k = 0; k < 16; k+=2){
            #pragma unroll 2
            for(size_t kk = 0; kk < 2; kk++)
            #pragma unroll 8
            for(size_t mm = 0; mm < 8; mm++)
            {
                rA[kk][mm*1+0] = lA[offA + mm*32+0+ kk*257];
            }
            #pragma unroll 2
            for(size_t kk = 0; kk < 2; kk++)
            #pragma unroll 4
            for(size_t nn = 0; nn < 4; nn++)
            {
                rB[kk][nn*1+0] = lB[offB + nn*32+0+ kk*129];
            }
            offA += 514;
            offB += 258;
            #pragma unroll 2
            for(size_t kk = 0; kk <2; ++kk)
            {
                rC[0][0]=fma(rA[kk][0],rB[kk][0],rC[0][0]);
                rC[1][0]=fma(rA[kk][1],rB[kk][0],rC[1][0]);
                rC[2][0]=fma(rA[kk][2],rB[kk][0],rC[2][0]);
                rC[3][0]=fma(rA[kk][3],rB[kk][0],rC[3][0]);
                rC[4][0]=fma(rA[kk][4],rB[kk][0],rC[4][0]);
                rC[5][0]=fma(rA[kk][5],rB[kk][0],rC[5][0]);
                rC[6][0]=fma(rA[kk][6],rB[kk][0],rC[6][0]);
                rC[7][0]=fma(rA[kk][7],rB[kk][0],rC[7][0]);
                rC[0][1]=fma(rA[kk][0],rB[kk][1],rC[0][1]);
                rC[1][1]=fma(rA[kk][1],rB[kk][1],rC[1][1]);
                rC[2][1]=fma(rA[kk][2],rB[kk][1],rC[2][1]);
                rC[3][1]=fma(rA[kk][3],rB[kk][1],rC[3][1]);
                rC[4][1]=fma(rA[kk][4],rB[kk][1],rC[4][1]);
                rC[5][1]=fma(rA[kk][5],rB[kk][1],rC[5][1]);
                rC[6][1]=fma(rA[kk][6],rB[kk][1],rC[6][1]);
                rC[7][1]=fma(rA[kk][7],rB[kk][1],rC[7][1]);
                rC[0][2]=fma(rA[kk][0],rB[kk][2],rC[0][2]);
                rC[1][2]=fma(rA[kk][1],rB[kk][2],rC[1][2]);
                rC[2][2]=fma(rA[kk][2],rB[kk][2],rC[2][2]);
                rC[3][2]=fma(rA[kk][3],rB[kk][2],rC[3][2]);
                rC[4][2]=fma(rA[kk][4],rB[kk][2],rC[4][2]);
                rC[5][2]=fma(rA[kk][5],rB[kk][2],rC[5][2]);
                rC[6][2]=fma(rA[kk][6],rB[kk][2],rC[6][2]);
                rC[7][2]=fma(rA[kk][7],rB[kk][2],rC[7][2]);
                rC[0][3]=fma(rA[kk][0],rB[kk][3],rC[0][3]);
                rC[1][3]=fma(rA[kk][1],rB[kk][3],rC[1][3]);
                rC[2][3]=fma(rA[kk][2],rB[kk][3],rC[2][3]);
                rC[3][3]=fma(rA[kk][3],rB[kk][3],rC[3][3]);
                rC[4][3]=fma(rA[kk][4],rB[kk][3],rC[4][3]);
                rC[5][3]=fma(rA[kk][5],rB[kk][3],rC[5][3]);
                rC[6][3]=fma(rA[kk][6],rB[kk][3],rC[6][3]);
                rC[7][3]=fma(rA[kk][7],rB[kk][3],rC[7][3]);
            }
        }
        obj1_pointer += 16;
        obj4_pointer += 16;
    }
    obj0_pointer += gidx*256*obj0_ld;
    obj0_pointer += idx*1*obj0_ld;
    obj0_pointer += gidy*128*obj0_stride2;
    obj0_pointer += idy*1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][0]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[1][0]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[64*obj0_ld] = rC[2][0]*obj5+ obj0_pointer[64*obj0_ld]*obj6;
    obj0_pointer[96*obj0_ld] = rC[3][0]*obj5+ obj0_pointer[96*obj0_ld]*obj6;
    obj0_pointer[128*obj0_ld] = rC[4][0]*obj5+ obj0_pointer[128*obj0_ld]*obj6;
    obj0_pointer[160*obj0_ld] = rC[5][0]*obj5+ obj0_pointer[160*obj0_ld]*obj6;
    obj0_pointer[192*obj0_ld] = rC[6][0]*obj5+ obj0_pointer[192*obj0_ld]*obj6;
    obj0_pointer[224*obj0_ld] = rC[7][0]*obj5+ obj0_pointer[224*obj0_ld]*obj6;
    obj0_pointer += 32*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][1]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[1][1]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[64*obj0_ld] = rC[2][1]*obj5+ obj0_pointer[64*obj0_ld]*obj6;
    obj0_pointer[96*obj0_ld] = rC[3][1]*obj5+ obj0_pointer[96*obj0_ld]*obj6;
    obj0_pointer[128*obj0_ld] = rC[4][1]*obj5+ obj0_pointer[128*obj0_ld]*obj6;
    obj0_pointer[160*obj0_ld] = rC[5][1]*obj5+ obj0_pointer[160*obj0_ld]*obj6;
    obj0_pointer[192*obj0_ld] = rC[6][1]*obj5+ obj0_pointer[192*obj0_ld]*obj6;
    obj0_pointer[224*obj0_ld] = rC[7][1]*obj5+ obj0_pointer[224*obj0_ld]*obj6;
    obj0_pointer += 32*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][2]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[1][2]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[64*obj0_ld] = rC[2][2]*obj5+ obj0_pointer[64*obj0_ld]*obj6;
    obj0_pointer[96*obj0_ld] = rC[3][2]*obj5+ obj0_pointer[96*obj0_ld]*obj6;
    obj0_pointer[128*obj0_ld] = rC[4][2]*obj5+ obj0_pointer[128*obj0_ld]*obj6;
    obj0_pointer[160*obj0_ld] = rC[5][2]*obj5+ obj0_pointer[160*obj0_ld]*obj6;
    obj0_pointer[192*obj0_ld] = rC[6][2]*obj5+ obj0_pointer[192*obj0_ld]*obj6;
    obj0_pointer[224*obj0_ld] = rC[7][2]*obj5+ obj0_pointer[224*obj0_ld]*obj6;
    obj0_pointer += 32*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][3]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[1][3]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[64*obj0_ld] = rC[2][3]*obj5+ obj0_pointer[64*obj0_ld]*obj6;
    obj0_pointer[96*obj0_ld] = rC[3][3]*obj5+ obj0_pointer[96*obj0_ld]*obj6;
    obj0_pointer[128*obj0_ld] = rC[4][3]*obj5+ obj0_pointer[128*obj0_ld]*obj6;
    obj0_pointer[160*obj0_ld] = rC[5][3]*obj5+ obj0_pointer[160*obj0_ld]*obj6;
    obj0_pointer[192*obj0_ld] = rC[6][3]*obj5+ obj0_pointer[192*obj0_ld]*obj6;
    obj0_pointer[224*obj0_ld] = rC[7][3]*obj5+ obj0_pointer[224*obj0_ld]*obj6;
    obj0_pointer += 32*obj0_stride2;
}
 __attribute__((reqd_work_group_size(8,4,1)))
__kernel void _prod_NT(unsigned int M, unsigned int N, unsigned int K, __global float* obj0_pointer,unsigned int obj0_ld,unsigned int obj0_start2,unsigned int obj0_start1,unsigned int obj0_stride2,unsigned int obj0_stride1,__global float4* obj1_pointer,unsigned int obj1_ld,unsigned int obj1_start1,unsigned int obj1_start2,unsigned int obj1_stride1,unsigned int obj1_stride2,__global float4* obj3_pointer,unsigned int obj3_ld,unsigned int obj3_start1,unsigned int obj3_start2,unsigned int obj3_stride1,unsigned int obj3_stride2,float obj5,float obj6)
{
    obj1_start1 /= 4;
    obj1_ld /= 4;
    obj3_start1/= 4;
    obj3_ld /= 4;
    obj0_pointer += (obj0_start1) * obj0_ld + ( obj0_start2);
    obj1_pointer += (obj1_start1) +  ( obj1_start2) * obj1_ld;
    obj3_pointer += (obj3_start1) +  ( obj3_start2) * obj3_ld;
    obj0_ld *= obj0_stride1;
    obj1_ld *= obj1_stride2;
    obj3_ld *= obj3_stride2;
    float rC[8][8] = {{(float)0}};
    float4 rA[2][2];
    float4 rB[2][2];
    
    size_t gidx = get_group_id(0);
    size_t gidy = get_group_id(1);
    size_t idx = get_local_id(0);
    size_t idy = get_local_id(1);
    
    obj1_pointer += (gidx*16+ idx);
    obj3_pointer += (gidy*8+ idy*2);
    
    size_t K_size_t = K;
    for(size_t block_k=0; block_k < K_size_t; block_k+=2){
        for(size_t k = 0; k < 2; k+=2){
            #pragma unroll 2
            for(size_t kk = 0; kk < 2; kk++)
            #pragma unroll 2
            for(size_t mm = 0; mm < 2; mm++)
            {
                rA[kk][mm] = obj1_pointer[kk*obj1_ld + mm*8];
            }
            #pragma unroll 2
            for(size_t kk = 0; kk < 2; kk++)
            #pragma unroll 2
            for(size_t nn = 0; nn < 2; nn++)
            {
                rB[kk][nn] = obj3_pointer[kk*obj3_ld + nn];
            }
            obj1_pointer += 2*obj1_ld;
            obj3_pointer += 2*obj3_ld;
            #pragma unroll 2
            for(size_t kk = 0; kk <2; ++kk)
            {
                rC[0][0]=fma(rA[kk][0].s0,rB[kk][0].s0,rC[0][0]);
                rC[1][0]=fma(rA[kk][0].s1,rB[kk][0].s0,rC[1][0]);
                rC[2][0]=fma(rA[kk][0].s2,rB[kk][0].s0,rC[2][0]);
                rC[3][0]=fma(rA[kk][0].s3,rB[kk][0].s0,rC[3][0]);
                rC[4][0]=fma(rA[kk][1].s0,rB[kk][0].s0,rC[4][0]);
                rC[5][0]=fma(rA[kk][1].s1,rB[kk][0].s0,rC[5][0]);
                rC[6][0]=fma(rA[kk][1].s2,rB[kk][0].s0,rC[6][0]);
                rC[7][0]=fma(rA[kk][1].s3,rB[kk][0].s0,rC[7][0]);
                rC[0][1]=fma(rA[kk][0].s0,rB[kk][0].s1,rC[0][1]);
                rC[1][1]=fma(rA[kk][0].s1,rB[kk][0].s1,rC[1][1]);
                rC[2][1]=fma(rA[kk][0].s2,rB[kk][0].s1,rC[2][1]);
                rC[3][1]=fma(rA[kk][0].s3,rB[kk][0].s1,rC[3][1]);
                rC[4][1]=fma(rA[kk][1].s0,rB[kk][0].s1,rC[4][1]);
                rC[5][1]=fma(rA[kk][1].s1,rB[kk][0].s1,rC[5][1]);
                rC[6][1]=fma(rA[kk][1].s2,rB[kk][0].s1,rC[6][1]);
                rC[7][1]=fma(rA[kk][1].s3,rB[kk][0].s1,rC[7][1]);
                rC[0][2]=fma(rA[kk][0].s0,rB[kk][0].s2,rC[0][2]);
                rC[1][2]=fma(rA[kk][0].s1,rB[kk][0].s2,rC[1][2]);
                rC[2][2]=fma(rA[kk][0].s2,rB[kk][0].s2,rC[2][2]);
                rC[3][2]=fma(rA[kk][0].s3,rB[kk][0].s2,rC[3][2]);
                rC[4][2]=fma(rA[kk][1].s0,rB[kk][0].s2,rC[4][2]);
                rC[5][2]=fma(rA[kk][1].s1,rB[kk][0].s2,rC[5][2]);
                rC[6][2]=fma(rA[kk][1].s2,rB[kk][0].s2,rC[6][2]);
                rC[7][2]=fma(rA[kk][1].s3,rB[kk][0].s2,rC[7][2]);
                rC[0][3]=fma(rA[kk][0].s0,rB[kk][0].s3,rC[0][3]);
                rC[1][3]=fma(rA[kk][0].s1,rB[kk][0].s3,rC[1][3]);
                rC[2][3]=fma(rA[kk][0].s2,rB[kk][0].s3,rC[2][3]);
                rC[3][3]=fma(rA[kk][0].s3,rB[kk][0].s3,rC[3][3]);
                rC[4][3]=fma(rA[kk][1].s0,rB[kk][0].s3,rC[4][3]);
                rC[5][3]=fma(rA[kk][1].s1,rB[kk][0].s3,rC[5][3]);
                rC[6][3]=fma(rA[kk][1].s2,rB[kk][0].s3,rC[6][3]);
                rC[7][3]=fma(rA[kk][1].s3,rB[kk][0].s3,rC[7][3]);
                rC[0][4]=fma(rA[kk][0].s0,rB[kk][1].s0,rC[0][4]);
                rC[1][4]=fma(rA[kk][0].s1,rB[kk][1].s0,rC[1][4]);
                rC[2][4]=fma(rA[kk][0].s2,rB[kk][1].s0,rC[2][4]);
                rC[3][4]=fma(rA[kk][0].s3,rB[kk][1].s0,rC[3][4]);
                rC[4][4]=fma(rA[kk][1].s0,rB[kk][1].s0,rC[4][4]);
                rC[5][4]=fma(rA[kk][1].s1,rB[kk][1].s0,rC[5][4]);
                rC[6][4]=fma(rA[kk][1].s2,rB[kk][1].s0,rC[6][4]);
                rC[7][4]=fma(rA[kk][1].s3,rB[kk][1].s0,rC[7][4]);
                rC[0][5]=fma(rA[kk][0].s0,rB[kk][1].s1,rC[0][5]);
                rC[1][5]=fma(rA[kk][0].s1,rB[kk][1].s1,rC[1][5]);
                rC[2][5]=fma(rA[kk][0].s2,rB[kk][1].s1,rC[2][5]);
                rC[3][5]=fma(rA[kk][0].s3,rB[kk][1].s1,rC[3][5]);
                rC[4][5]=fma(rA[kk][1].s0,rB[kk][1].s1,rC[4][5]);
                rC[5][5]=fma(rA[kk][1].s1,rB[kk][1].s1,rC[5][5]);
                rC[6][5]=fma(rA[kk][1].s2,rB[kk][1].s1,rC[6][5]);
                rC[7][5]=fma(rA[kk][1].s3,rB[kk][1].s1,rC[7][5]);
                rC[0][6]=fma(rA[kk][0].s0,rB[kk][1].s2,rC[0][6]);
                rC[1][6]=fma(rA[kk][0].s1,rB[kk][1].s2,rC[1][6]);
                rC[2][6]=fma(rA[kk][0].s2,rB[kk][1].s2,rC[2][6]);
                rC[3][6]=fma(rA[kk][0].s3,rB[kk][1].s2,rC[3][6]);
                rC[4][6]=fma(rA[kk][1].s0,rB[kk][1].s2,rC[4][6]);
                rC[5][6]=fma(rA[kk][1].s1,rB[kk][1].s2,rC[5][6]);
                rC[6][6]=fma(rA[kk][1].s2,rB[kk][1].s2,rC[6][6]);
                rC[7][6]=fma(rA[kk][1].s3,rB[kk][1].s2,rC[7][6]);
                rC[0][7]=fma(rA[kk][0].s0,rB[kk][1].s3,rC[0][7]);
                rC[1][7]=fma(rA[kk][0].s1,rB[kk][1].s3,rC[1][7]);
                rC[2][7]=fma(rA[kk][0].s2,rB[kk][1].s3,rC[2][7]);
                rC[3][7]=fma(rA[kk][0].s3,rB[kk][1].s3,rC[3][7]);
                rC[4][7]=fma(rA[kk][1].s0,rB[kk][1].s3,rC[4][7]);
                rC[5][7]=fma(rA[kk][1].s1,rB[kk][1].s3,rC[5][7]);
                rC[6][7]=fma(rA[kk][1].s2,rB[kk][1].s3,rC[6][7]);
                rC[7][7]=fma(rA[kk][1].s3,rB[kk][1].s3,rC[7][7]);
            }
        }
    }
    obj0_pointer += gidx*64*obj0_ld;
    obj0_pointer += idx*4*obj0_ld;
    obj0_pointer += gidy*32*obj0_stride2;
    obj0_pointer += idy*8*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][0]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[1*obj0_ld] = rC[1][0]*obj5+ obj0_pointer[1*obj0_ld]*obj6;
    obj0_pointer[2*obj0_ld] = rC[2][0]*obj5+ obj0_pointer[2*obj0_ld]*obj6;
    obj0_pointer[3*obj0_ld] = rC[3][0]*obj5+ obj0_pointer[3*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[4][0]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[33*obj0_ld] = rC[5][0]*obj5+ obj0_pointer[33*obj0_ld]*obj6;
    obj0_pointer[34*obj0_ld] = rC[6][0]*obj5+ obj0_pointer[34*obj0_ld]*obj6;
    obj0_pointer[35*obj0_ld] = rC[7][0]*obj5+ obj0_pointer[35*obj0_ld]*obj6;
    obj0_pointer += obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][1]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[1*obj0_ld] = rC[1][1]*obj5+ obj0_pointer[1*obj0_ld]*obj6;
    obj0_pointer[2*obj0_ld] = rC[2][1]*obj5+ obj0_pointer[2*obj0_ld]*obj6;
    obj0_pointer[3*obj0_ld] = rC[3][1]*obj5+ obj0_pointer[3*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[4][1]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[33*obj0_ld] = rC[5][1]*obj5+ obj0_pointer[33*obj0_ld]*obj6;
    obj0_pointer[34*obj0_ld] = rC[6][1]*obj5+ obj0_pointer[34*obj0_ld]*obj6;
    obj0_pointer[35*obj0_ld] = rC[7][1]*obj5+ obj0_pointer[35*obj0_ld]*obj6;
    obj0_pointer += obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][2]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[1*obj0_ld] = rC[1][2]*obj5+ obj0_pointer[1*obj0_ld]*obj6;
    obj0_pointer[2*obj0_ld] = rC[2][2]*obj5+ obj0_pointer[2*obj0_ld]*obj6;
    obj0_pointer[3*obj0_ld] = rC[3][2]*obj5+ obj0_pointer[3*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[4][2]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[33*obj0_ld] = rC[5][2]*obj5+ obj0_pointer[33*obj0_ld]*obj6;
    obj0_pointer[34*obj0_ld] = rC[6][2]*obj5+ obj0_pointer[34*obj0_ld]*obj6;
    obj0_pointer[35*obj0_ld] = rC[7][2]*obj5+ obj0_pointer[35*obj0_ld]*obj6;
    obj0_pointer += obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][3]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[1*obj0_ld] = rC[1][3]*obj5+ obj0_pointer[1*obj0_ld]*obj6;
    obj0_pointer[2*obj0_ld] = rC[2][3]*obj5+ obj0_pointer[2*obj0_ld]*obj6;
    obj0_pointer[3*obj0_ld] = rC[3][3]*obj5+ obj0_pointer[3*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[4][3]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[33*obj0_ld] = rC[5][3]*obj5+ obj0_pointer[33*obj0_ld]*obj6;
    obj0_pointer[34*obj0_ld] = rC[6][3]*obj5+ obj0_pointer[34*obj0_ld]*obj6;
    obj0_pointer[35*obj0_ld] = rC[7][3]*obj5+ obj0_pointer[35*obj0_ld]*obj6;
    obj0_pointer += obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][4]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[1*obj0_ld] = rC[1][4]*obj5+ obj0_pointer[1*obj0_ld]*obj6;
    obj0_pointer[2*obj0_ld] = rC[2][4]*obj5+ obj0_pointer[2*obj0_ld]*obj6;
    obj0_pointer[3*obj0_ld] = rC[3][4]*obj5+ obj0_pointer[3*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[4][4]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[33*obj0_ld] = rC[5][4]*obj5+ obj0_pointer[33*obj0_ld]*obj6;
    obj0_pointer[34*obj0_ld] = rC[6][4]*obj5+ obj0_pointer[34*obj0_ld]*obj6;
    obj0_pointer[35*obj0_ld] = rC[7][4]*obj5+ obj0_pointer[35*obj0_ld]*obj6;
    obj0_pointer += obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][5]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[1*obj0_ld] = rC[1][5]*obj5+ obj0_pointer[1*obj0_ld]*obj6;
    obj0_pointer[2*obj0_ld] = rC[2][5]*obj5+ obj0_pointer[2*obj0_ld]*obj6;
    obj0_pointer[3*obj0_ld] = rC[3][5]*obj5+ obj0_pointer[3*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[4][5]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[33*obj0_ld] = rC[5][5]*obj5+ obj0_pointer[33*obj0_ld]*obj6;
    obj0_pointer[34*obj0_ld] = rC[6][5]*obj5+ obj0_pointer[34*obj0_ld]*obj6;
    obj0_pointer[35*obj0_ld] = rC[7][5]*obj5+ obj0_pointer[35*obj0_ld]*obj6;
    obj0_pointer += obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][6]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[1*obj0_ld] = rC[1][6]*obj5+ obj0_pointer[1*obj0_ld]*obj6;
    obj0_pointer[2*obj0_ld] = rC[2][6]*obj5+ obj0_pointer[2*obj0_ld]*obj6;
    obj0_pointer[3*obj0_ld] = rC[3][6]*obj5+ obj0_pointer[3*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[4][6]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[33*obj0_ld] = rC[5][6]*obj5+ obj0_pointer[33*obj0_ld]*obj6;
    obj0_pointer[34*obj0_ld] = rC[6][6]*obj5+ obj0_pointer[34*obj0_ld]*obj6;
    obj0_pointer[35*obj0_ld] = rC[7][6]*obj5+ obj0_pointer[35*obj0_ld]*obj6;
    obj0_pointer += obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][7]*obj5+ obj0_pointer[0*obj0_ld]*obj6;
    obj0_pointer[1*obj0_ld] = rC[1][7]*obj5+ obj0_pointer[1*obj0_ld]*obj6;
    obj0_pointer[2*obj0_ld] = rC[2][7]*obj5+ obj0_pointer[2*obj0_ld]*obj6;
    obj0_pointer[3*obj0_ld] = rC[3][7]*obj5+ obj0_pointer[3*obj0_ld]*obj6;
    obj0_pointer[32*obj0_ld] = rC[4][7]*obj5+ obj0_pointer[32*obj0_ld]*obj6;
    obj0_pointer[33*obj0_ld] = rC[5][7]*obj5+ obj0_pointer[33*obj0_ld]*obj6;
    obj0_pointer[34*obj0_ld] = rC[6][7]*obj5+ obj0_pointer[34*obj0_ld]*obj6;
    obj0_pointer[35*obj0_ld] = rC[7][7]*obj5+ obj0_pointer[35*obj0_ld]*obj6;
    obj0_pointer += obj0_stride2;
}
 __attribute__((reqd_work_group_size(8,16,1)))
__kernel void _prod_TT(unsigned int M, unsigned int N, unsigned int K, __global float* obj0_pointer,unsigned int obj0_ld,unsigned int obj0_start2,unsigned int obj0_start1,unsigned int obj0_stride2,unsigned int obj0_stride1,__global float* obj1_pointer,unsigned int obj1_ld,unsigned int obj1_start1,unsigned int obj1_start2,unsigned int obj1_stride1,unsigned int obj1_stride2,__global float* obj4_pointer,unsigned int obj4_ld,unsigned int obj4_start1,unsigned int obj4_start2,unsigned int obj4_stride1,unsigned int obj4_stride2,float obj6,float obj7)
{
    obj1_start1 /= 1;
    obj1_ld /= 1;
    obj4_start1/= 1;
    obj4_ld /= 1;
    obj0_pointer += (obj0_start1) * obj0_ld + ( obj0_start2);
    obj1_pointer += (obj1_start1) +  ( obj1_start2) * obj1_ld;
    obj4_pointer += (obj4_start1) +  ( obj4_start2) * obj4_ld;
    obj0_ld *= obj0_stride1;
    obj1_ld *= obj1_stride2;
    obj4_ld *= obj4_stride2;
    float rC[4][4] = {{(float)0}};
    float rA[8][4];
    float rB[8][4];
    __local float lA[1056];
    size_t gidx = get_group_id(0);
    size_t gidy = get_group_id(1);
    size_t idx = get_local_id(0);
    size_t idy = get_local_id(1);
    
    size_t idt = 8*idy + idx;
    size_t idxT = idt % 8;
    size_t idyT = idt / 8;
    
    obj1_pointer += idxT + gidx*32*obj1_ld + idyT*obj1_ld;
    obj4_pointer += (gidy*64+ idy);
    
    size_t K_size_t = K;
    for(size_t block_k=0; block_k < K_size_t; block_k+=32){
        __local float* plA = lA + idxT*33 + idyT;
        barrier(CLK_LOCAL_MEM_FENCE);
        (plA + 0)[0] = obj1_pointer[0*obj1_ld + 0];
        (plA + 264)[0] = obj1_pointer[0*obj1_ld + 8];
        (plA + 528)[0] = obj1_pointer[0*obj1_ld + 16];
        (plA + 792)[0] = obj1_pointer[0*obj1_ld + 24];
        (plA + 16)[0] = obj1_pointer[16*obj1_ld + 0];
        (plA + 280)[0] = obj1_pointer[16*obj1_ld + 8];
        (plA + 544)[0] = obj1_pointer[16*obj1_ld + 16];
        (plA + 808)[0] = obj1_pointer[16*obj1_ld + 24];
        barrier(CLK_LOCAL_MEM_FENCE);
        size_t offA = 1*idx;
        size_t offB = 1*idy;
        for(size_t k = 0; k < 32; k+=8){
            #pragma unroll 8
            for(size_t kk = 0; kk < 8; kk++)
            #pragma unroll 4
            for(size_t mm = 0; mm < 4; mm++)
            {
                rA[kk][mm*1+0] = lA[offA + mm*8+0+ kk*33];
            }
            #pragma unroll 8
            for(size_t kk = 0; kk < 8; kk++)
            #pragma unroll 4
            for(size_t nn = 0; nn < 4; nn++)
            {
                rB[kk][nn] = obj4_pointer[kk*obj4_ld + nn*16];
            }
            offA += 264;
            obj4_pointer += 8*obj4_ld;
            #pragma unroll 8
            for(size_t kk = 0; kk <8; ++kk)
            {
                rC[0][0]=fma(rA[kk][0],rB[kk][0],rC[0][0]);
                rC[1][0]=fma(rA[kk][1],rB[kk][0],rC[1][0]);
                rC[2][0]=fma(rA[kk][2],rB[kk][0],rC[2][0]);
                rC[3][0]=fma(rA[kk][3],rB[kk][0],rC[3][0]);
                rC[0][1]=fma(rA[kk][0],rB[kk][1],rC[0][1]);
                rC[1][1]=fma(rA[kk][1],rB[kk][1],rC[1][1]);
                rC[2][1]=fma(rA[kk][2],rB[kk][1],rC[2][1]);
                rC[3][1]=fma(rA[kk][3],rB[kk][1],rC[3][1]);
                rC[0][2]=fma(rA[kk][0],rB[kk][2],rC[0][2]);
                rC[1][2]=fma(rA[kk][1],rB[kk][2],rC[1][2]);
                rC[2][2]=fma(rA[kk][2],rB[kk][2],rC[2][2]);
                rC[3][2]=fma(rA[kk][3],rB[kk][2],rC[3][2]);
                rC[0][3]=fma(rA[kk][0],rB[kk][3],rC[0][3]);
                rC[1][3]=fma(rA[kk][1],rB[kk][3],rC[1][3]);
                rC[2][3]=fma(rA[kk][2],rB[kk][3],rC[2][3]);
                rC[3][3]=fma(rA[kk][3],rB[kk][3],rC[3][3]);
            }
        }
        obj1_pointer += 32;
    }
    obj0_pointer += gidx*32*obj0_ld;
    obj0_pointer += idx*1*obj0_ld;
    obj0_pointer += gidy*64*obj0_stride2;
    obj0_pointer += idy*1*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][0]*obj6+ obj0_pointer[0*obj0_ld]*obj7;
    obj0_pointer[8*obj0_ld] = rC[1][0]*obj6+ obj0_pointer[8*obj0_ld]*obj7;
    obj0_pointer[16*obj0_ld] = rC[2][0]*obj6+ obj0_pointer[16*obj0_ld]*obj7;
    obj0_pointer[24*obj0_ld] = rC[3][0]*obj6+ obj0_pointer[24*obj0_ld]*obj7;
    obj0_pointer += 16*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][1]*obj6+ obj0_pointer[0*obj0_ld]*obj7;
    obj0_pointer[8*obj0_ld] = rC[1][1]*obj6+ obj0_pointer[8*obj0_ld]*obj7;
    obj0_pointer[16*obj0_ld] = rC[2][1]*obj6+ obj0_pointer[16*obj0_ld]*obj7;
    obj0_pointer[24*obj0_ld] = rC[3][1]*obj6+ obj0_pointer[24*obj0_ld]*obj7;
    obj0_pointer += 16*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][2]*obj6+ obj0_pointer[0*obj0_ld]*obj7;
    obj0_pointer[8*obj0_ld] = rC[1][2]*obj6+ obj0_pointer[8*obj0_ld]*obj7;
    obj0_pointer[16*obj0_ld] = rC[2][2]*obj6+ obj0_pointer[16*obj0_ld]*obj7;
    obj0_pointer[24*obj0_ld] = rC[3][2]*obj6+ obj0_pointer[24*obj0_ld]*obj7;
    obj0_pointer += 16*obj0_stride2;
    obj0_pointer[0*obj0_ld] = rC[0][3]*obj6+ obj0_pointer[0*obj0_ld]*obj7;
    obj0_pointer[8*obj0_ld] = rC[1][3]*obj6+ obj0_pointer[8*obj0_ld]*obj7;
    obj0_pointer[16*obj0_ld] = rC[2][3]*obj6+ obj0_pointer[16*obj0_ld]*obj7;
    obj0_pointer[24*obj0_ld] = rC[3][3]*obj6+ obj0_pointer[24*obj0_ld]*obj7;
    obj0_pointer += 16*obj0_stride2;
}

Error in cpp_gpuMatrix_eucl(A@address, D@address, squareDist, 6L) : 
  ViennaCL: FATAL ERROR: CL_INVALID_PROGRAM_EXECUTABLE.
If you think that this is a bug in ViennaCL, please report it at viennacl-support@lists.sourceforge.net and supply at least the following information:
 * Operating System
 * Which OpenCL implementation (AMD, NVIDIA, etc.)
 * ViennaCL version
Many thanks in advance!

@cdeterman
Copy link
Owner

@TiMaG I'm sorry to have let this issue just hang. Has this been a problem with any of the other gpuMatrix functions or just with dist? From my research that -9999 error is a special error from NVIDIA that pertains to an illegal read/write. I have added a few more print statements to the internal euclidean distance call. If you have the time, please reinstall the develop version and try once more. Again, sorry for leaving this issue hanging.

@cdeterman cdeterman added this to the 2.1.0 milestone Sep 28, 2018
@cdeterman
Copy link
Owner

@dselivanov have you encountered this error/bug before with any of the NVIDIA cards you have used?

@dselivanov
Copy link

dselivanov commented Oct 1, 2018 via email

@cdeterman
Copy link
Owner

@dselivanov fair enough, just thought would ask as you are the most active NVIDIA user that I am aware.

cdeterman added a commit that referenced this issue Oct 4, 2018
…to be working normally locally on Intel GPU.
@cdeterman
Copy link
Owner

I did discover a general issue whenever the dimensions exceeded 128 (the internal padded sizes). But I believe I have fixed the problem now. Hopefully this latest commit addresses the original issue provided by @TiMaG.

@Toniiiio
Copy link
Author

Toniiiio commented Oct 5, 2018

Hi @cdeterman , sry for the late response.
I installed the latest dev version of the package and it works for 15000 X 100.

ndim <- 15000
AA <- gpuMatrix(rnorm(ndim*100), nrow = ndim, ncol = 100)
BB <- dist(AA)

I also tried, 50000. It failed then, but i think that is more due to the hardware limit (i took the p2.xlarge instance on aws).

@cdeterman
Copy link
Owner

@TiMaG glad to hear it is working now. Yes, I suspect it would be a hardware limit. I will close this issue now unless another issue arises. Sorry it took so long to resolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants