Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast using shmem #1

Closed
FSM-GIT opened this issue Dec 28, 2016 · 6 comments
Closed

Broadcast using shmem #1

FSM-GIT opened this issue Dec 28, 2016 · 6 comments

Comments

@FSM-GIT
Copy link

FSM-GIT commented Dec 28, 2016

Hi there
Once I intend to broadcast a variable in attached program called Td0 which is an unsigned int array containing 256 elements, I have faced a problem in the output (after broadcasting ). The last 8 elements of mentioned array did copy wrongly, the only solution which I found is to pass the number of elements more than 256 ( actual size) to 256+8!! I can not understand what is happened.
Kindly find the enclosed.
aesd.c.txt
aesd_epi_helper.c.txt
aesd_epi_helper.h.txt
device.c.txt
Makefile.txt

@jamesaross
Copy link
Collaborator

jamesaross commented Dec 28, 2016

I am not able to replicate this with a simple test case similar to what you provided (below). The shmem_broadcast32 operations appear to work as designed. I also see two issues with your code:

  • Your code invokes a second DMA operation (on channel 1) before the blocking for the first DMA on that channel completes. This issue is outside of the scope of SHMEM and is part of the CORPTHR2 SDK. It may be related to your issue.
  • The shmem_broadcast32 routine is a 32-bit operation and your code calls for 256 elements within Td4s, which is an 8-bit array. You should divide that N=256 value by 4 (see below).
#include <shmem.h>

long pSyncA[SHMEM_BCAST_SYNC_SIZE] = { SHMEM_SYNC_VALUE };

typedef unsigned int u32;
typedef unsigned char u8; 

int main( void )
{
   shmem_init();

   const int N = 256;
   const int M = 44; 
   int id = shmem_my_pe();
   int npes = shmem_n_pes();

   u32* rk   = (u32*)shmem_malloc(M * sizeof(u32));
   u32* Td0  = (u32*)shmem_malloc(N * sizeof(u32));
   u8*  Td4s =  (u8*)shmem_malloc(N * sizeof(u8));

   for (int i = 0; i < M; i++) {
      if (id == 0) rk[i] = i + 1;
   }   

   for (int i = 0; i < N; i++) {
      if (id == 0) Td0[i] = i;
      else Td0[i] = 0;
      if (id == 0) Td4s[i] = (u8)i;
      else Td4s[i] = 0;
   }   

   shmem_barrier_all();

   shmem_broadcast32 (rk, rk, M, 0, 0, 0, npes, pSyncA);
   shmem_broadcast32 (Td4s, Td4s, N/4, 0, 0, 0, npes, pSyncA); // 32-bit broadcasting 8-bit values (divide by 4)
   shmem_broadcast32 (Td0, Td0, N, 0, 0, 0, npes, pSyncA);

   int err = 0;
   for (int i = 0; i < M; i++) if (rk[i] != (i+1)) err++;
   for (int i = 0; i < N; i++) if (Td0[i] != i) err++;
   for (int i = 0; i < N; i++) if (Td4s[i] != i) err++;
   if (err) printf("# %d: ERROR: %d values were not broadcast correctly\n", id, err);
}

Compile with:

coprcc -fhost -std=c99 -I. -I/usr/local/browndeer/coprthr2/include -I../src/ -L/usr/local/browndeer/coprthr2/lib -lcoprthr2_dev -lcoprthr_hostcall -lesyscall -le-lib -L$PATH_TO_SHMEM_LIBRARY -lshmem test.c -o test.x

Run with:

coprsh -np 16 ./test.x

@FSM-GIT
Copy link
Author

FSM-GIT commented Jan 2, 2017

Hi,
I have checked your concerns about my previous example. Below you can find another example (AES encryption) which faced the same problem. I would like to broadcast Te0 and rk variables. rk is broadcast properly. Again I have the problem with Te0 which I need increase the elements from 256 to 256+8.
Kindly find the enclosed.

Thanks

aese.c.txt
aese_epi_helper.c.txt
aese_epi_helper.h.txt
device.c.txt
Makefile.txt

@jamesaross
Copy link
Collaborator

@FSM-GIT
Could you send your your ../shmem.mk file? I will look into this again after I can compile your code.

For your information (this may not be your problem):
The Epiphany DMA engine is peculiar in that the DMA engine can be idle before the data has arrived in the core. I happen to know that the coprthr_wait(COPRTHR2_E_DMA_0) code is spinning on the DMA status register to return for completion. There are actually two completion modes when DMA'ing data

  1. When the DMA engine is done working
  2. When the data actually arrives in local memory

The COPRTHR-2 API, presently, doesn't provide a mechanism to identify and wait for the data to arrive on the core. There is a method to do this in software:

  1. copy the last byte of data in the remote array via dereferencing the memory address
  2. invert its value
  3. write that inverted value to the last byte in the local array and keep a temporary copy
  4. initialize the DMA operation
  5. wait for the DMA engine to be done working
  6. spin-wait until the last byte in the local array is different from your temporary copy (or equal to the non-inverted value)

I have seen the error I described with small arrays being incompletely copied after a DMA and immediately used.

@jamesaross jamesaross reopened this Jan 3, 2017
@FSM-GIT
Copy link
Author

FSM-GIT commented Jan 4, 2017

Thank you for your prompt reply.
Please find the enclosed.

shmem.mk.txt

@jamesaross
Copy link
Collaborator

jamesaross commented Jan 5, 2017

@FSM-GIT
Thanks for getting me your build details.

I was able to reproduce your error this time. There appears to be two issues but they are both easy to fix.

Issue 1:
I will push out a permanent fix to the repository later, but the immediate fix is to comment out line 33 and uncomment line 34 in ./src/shmem_memcpy.c

In short, the compiler does unsafe things when the __attribute__((hot)) function attribute is used and it really shouldn't have much of a performance impact for this routine because it's mixed with inline assembly (a recipe for problems).

You will have to rebuild the library after you've made the change.

Issue 2:
The DMA problem I described previously is a real issue with your code. The code below should fix your problem. You must follow this recipe every time you use the DMA engine or you cannot guarantee the data has arrived. Do this similarly in your aesd device code.

   if (id == 0)
   {
      volatile u32* plast = Te0 + 255;
      u32 inv_last = ~temp_Te0[255];
      *plast = inv_last;
      coprthr_memcopy_align(Te0, temp_Te0, 256 * sizeof(u32), COPRTHR2_M_DMA_0);
      coprthr_wait(COPRTHR2_E_DMA_0);
      while (*plast == inv_last);
      plast = rk + 43;
      inv_last = ~temp_rk[43];
      *plast = inv_last;
      coprthr_memcopy_align(rk, temp_rk, 44 * sizeof(u32), COPRTHR2_M_DMA_0);
      coprthr_wait(COPRTHR2_E_DMA_0);
      while (*plast == inv_last);
   }

Alternatively, you can do without the DMA copy and manually copy arrays with a loop. It's not much data.

You should be able to remove the shmem_barrier_all() call immediately following the DMA. It had the side effect of delaying the shmem_broadcast32() long enough so that the data arrived. But since this code guarantees the data has arrived, you no longer need to delay execution.

Other comments:
You may use shmem_my_pe() instead of coprthr_get_thread_id(), but I think they should both return the same value. You can also use shmem_n_pes() to get the total number of threads. There is also no need to use coprthr_tls_sbrk() and coprthr_tls_brk() when using the shmem memory routines. Also, the shmem_barrier_all() is presently faster than the coprthr_barrier() routine and using just one should reduce code size.

@FSM-GIT
Copy link
Author

FSM-GIT commented Jan 6, 2017

Thank you so much for your attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants