Broadcast using shmem #1

FSM-GIT · 2016-12-28T05:21:27Z

Hi there
Once I intend to broadcast a variable in attached program called Td0 which is an unsigned int array containing 256 elements, I have faced a problem in the output (after broadcasting ). The last 8 elements of mentioned array did copy wrongly, the only solution which I found is to pass the number of elements more than 256 ( actual size) to 256+8!! I can not understand what is happened.
Kindly find the enclosed.
aesd.c.txt
aesd_epi_helper.c.txt
aesd_epi_helper.h.txt
device.c.txt
Makefile.txt

jamesaross · 2016-12-28T20:00:09Z

I am not able to replicate this with a simple test case similar to what you provided (below). The shmem_broadcast32 operations appear to work as designed. I also see two issues with your code:

Your code invokes a second DMA operation (on channel 1) before the blocking for the first DMA on that channel completes. This issue is outside of the scope of SHMEM and is part of the CORPTHR2 SDK. It may be related to your issue.
The shmem_broadcast32 routine is a 32-bit operation and your code calls for 256 elements within Td4s, which is an 8-bit array. You should divide that N=256 value by 4 (see below).

#include <shmem.h>

long pSyncA[SHMEM_BCAST_SYNC_SIZE] = { SHMEM_SYNC_VALUE };

typedef unsigned int u32;
typedef unsigned char u8; 

int main( void )
{
   shmem_init();

   const int N = 256;
   const int M = 44; 
   int id = shmem_my_pe();
   int npes = shmem_n_pes();

   u32* rk   = (u32*)shmem_malloc(M * sizeof(u32));
   u32* Td0  = (u32*)shmem_malloc(N * sizeof(u32));
   u8*  Td4s =  (u8*)shmem_malloc(N * sizeof(u8));

   for (int i = 0; i < M; i++) {
      if (id == 0) rk[i] = i + 1;
   }   

   for (int i = 0; i < N; i++) {
      if (id == 0) Td0[i] = i;
      else Td0[i] = 0;
      if (id == 0) Td4s[i] = (u8)i;
      else Td4s[i] = 0;
   }   

   shmem_barrier_all();

   shmem_broadcast32 (rk, rk, M, 0, 0, 0, npes, pSyncA);
   shmem_broadcast32 (Td4s, Td4s, N/4, 0, 0, 0, npes, pSyncA); // 32-bit broadcasting 8-bit values (divide by 4)
   shmem_broadcast32 (Td0, Td0, N, 0, 0, 0, npes, pSyncA);

   int err = 0;
   for (int i = 0; i < M; i++) if (rk[i] != (i+1)) err++;
   for (int i = 0; i < N; i++) if (Td0[i] != i) err++;
   for (int i = 0; i < N; i++) if (Td4s[i] != i) err++;
   if (err) printf("# %d: ERROR: %d values were not broadcast correctly\n", id, err);
}

Compile with:

coprcc -fhost -std=c99 -I. -I/usr/local/browndeer/coprthr2/include -I../src/ -L/usr/local/browndeer/coprthr2/lib -lcoprthr2_dev -lcoprthr_hostcall -lesyscall -le-lib -L$PATH_TO_SHMEM_LIBRARY -lshmem test.c -o test.x

Run with:

coprsh -np 16 ./test.x

FSM-GIT · 2017-01-02T07:00:25Z

Hi,
I have checked your concerns about my previous example. Below you can find another example (AES encryption) which faced the same problem. I would like to broadcast Te0 and rk variables. rk is broadcast properly. Again I have the problem with Te0 which I need increase the elements from 256 to 256+8.
Kindly find the enclosed.

Thanks

aese.c.txt
aese_epi_helper.c.txt
aese_epi_helper.h.txt
device.c.txt
Makefile.txt

jamesaross · 2017-01-03T20:35:18Z

@FSM-GIT
Could you send your your ../shmem.mk file? I will look into this again after I can compile your code.

For your information (this may not be your problem):
The Epiphany DMA engine is peculiar in that the DMA engine can be idle before the data has arrived in the core. I happen to know that the coprthr_wait(COPRTHR2_E_DMA_0) code is spinning on the DMA status register to return for completion. There are actually two completion modes when DMA'ing data

When the DMA engine is done working
When the data actually arrives in local memory

The COPRTHR-2 API, presently, doesn't provide a mechanism to identify and wait for the data to arrive on the core. There is a method to do this in software:

copy the last byte of data in the remote array via dereferencing the memory address
invert its value
write that inverted value to the last byte in the local array and keep a temporary copy
initialize the DMA operation
wait for the DMA engine to be done working
spin-wait until the last byte in the local array is different from your temporary copy (or equal to the non-inverted value)

I have seen the error I described with small arrays being incompletely copied after a DMA and immediately used.

FSM-GIT · 2017-01-04T06:14:51Z

Thank you for your prompt reply.
Please find the enclosed.

shmem.mk.txt

jamesaross · 2017-01-05T17:22:57Z

@FSM-GIT
Thanks for getting me your build details.

I was able to reproduce your error this time. There appears to be two issues but they are both easy to fix.

Issue 1:
I will push out a permanent fix to the repository later, but the immediate fix is to comment out line 33 and uncomment line 34 in ./src/shmem_memcpy.c

In short, the compiler does unsafe things when the __attribute__((hot)) function attribute is used and it really shouldn't have much of a performance impact for this routine because it's mixed with inline assembly (a recipe for problems).

You will have to rebuild the library after you've made the change.

Issue 2:
The DMA problem I described previously is a real issue with your code. The code below should fix your problem. You must follow this recipe every time you use the DMA engine or you cannot guarantee the data has arrived. Do this similarly in your aesd device code.

   if (id == 0)
   {
      volatile u32* plast = Te0 + 255;
      u32 inv_last = ~temp_Te0[255];
      *plast = inv_last;
      coprthr_memcopy_align(Te0, temp_Te0, 256 * sizeof(u32), COPRTHR2_M_DMA_0);
      coprthr_wait(COPRTHR2_E_DMA_0);
      while (*plast == inv_last);
      plast = rk + 43;
      inv_last = ~temp_rk[43];
      *plast = inv_last;
      coprthr_memcopy_align(rk, temp_rk, 44 * sizeof(u32), COPRTHR2_M_DMA_0);
      coprthr_wait(COPRTHR2_E_DMA_0);
      while (*plast == inv_last);
   }

Alternatively, you can do without the DMA copy and manually copy arrays with a loop. It's not much data.

You should be able to remove the shmem_barrier_all() call immediately following the DMA. It had the side effect of delaying the shmem_broadcast32() long enough so that the data arrived. But since this code guarantees the data has arrived, you no longer need to delay execution.

Other comments:
You may use shmem_my_pe() instead of coprthr_get_thread_id(), but I think they should both return the same value. You can also use shmem_n_pes() to get the total number of threads. There is also no need to use coprthr_tls_sbrk() and coprthr_tls_brk() when using the shmem memory routines. Also, the shmem_barrier_all() is presently faster than the coprthr_barrier() routine and using just one should reduce code size.

FSM-GIT · 2017-01-06T04:47:01Z

Thank you so much for your attention.

jamesaross closed this as completed Dec 28, 2016

jamesaross reopened this Jan 3, 2017

jamesaross closed this as completed Jan 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast using shmem #1

Broadcast using shmem #1

FSM-GIT commented Dec 28, 2016 •

edited

jamesaross commented Dec 28, 2016 •

edited

FSM-GIT commented Jan 2, 2017

jamesaross commented Jan 3, 2017

FSM-GIT commented Jan 4, 2017

jamesaross commented Jan 5, 2017 •

edited

FSM-GIT commented Jan 6, 2017

Broadcast using shmem #1

Broadcast using shmem #1

Comments

FSM-GIT commented Dec 28, 2016 • edited

jamesaross commented Dec 28, 2016 • edited

FSM-GIT commented Jan 2, 2017

jamesaross commented Jan 3, 2017

FSM-GIT commented Jan 4, 2017

jamesaross commented Jan 5, 2017 • edited

FSM-GIT commented Jan 6, 2017

FSM-GIT commented Dec 28, 2016 •

edited

jamesaross commented Dec 28, 2016 •

edited

jamesaross commented Jan 5, 2017 •

edited