###**Intrinsic Functions**
- Specialized functions provided by the CUDA programming model. They are callable only from the device. They do not need to include any additional headers in your program.
- These functions often offer an alternative to standard functions that are faster but may have less numerical accuracy, they are majorly used in mathematical functions.


### **Thread Synchronization**

Threads **within a block** can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() **intrinsic** function; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed

In [None]:
# __syncthreads() example

  // declares a shared memory segment that is accessible by all threads in the same block. More on this later.
__shared__ float partialSum[SIZE];
partialSum[threadIdx.x] = X[blockIdx.x * blockDim.x + threadIdx.x];
unsigned int t = threadIdx.x;
for(unsigned int stride = 1; stride < blockDim.x; stride *= 2){
     __syncthreads();
     if(t % (2*stride) == 0)
          partialSum[t] += partialSum[t+stride];
}

The __syncthreads() statement in the for-loop ensures that all partial sums for the previous iteration have been generated and before any one of the threads is allowed to begin the current iteration

#### **Thread Divergence**

In [None]:
# Consider this example

# Does this code work properly? why?
if{
     ...
     __syncthreads();
}else{
     ...
     __syncthreads();
}

If a thread in a block executes the then-path and another executes the else-path, they would be waiting at different barrier synchronization points and end up waiting for each other forever. so if __syncthreads() exists in the kernel, it must be executed by all threads. In this sense, the above code can be fixed as follows:

In [None]:
if{
     ...
}
else{
     ...
}
__syncthreads();

### **Requirement 3**

A) A cuda program is required to find the sum of an input array as efficiently as possible. The program reads the array elements from an external file (around 10 million floating-point numbers) and prints their sum in the console, **prints nothing more**. Use only 1 block for your kernel.

B) A cuda program is required to carry out binary search on an input array. similarly you wil read the input array from a file and the target element as a command line argument. You should use only 1 block and carry out the process efficiently. Print **ONLY** the index of the target number, or -1 if not found.

**Check** the samples on the e-learning course page