# Shared Memory Programming

## Authors of homework:
+ Aleksandra Jamróz
+ Mireia Alba Kesti Izquierdo


We want to handle shared memory between processes.
We will have different kind of access:

* _Read Only_ memory access to the shared data.
* _Read and Write_  - In this case, we can call the global variable from our distributed code and modify it inside

In [None]:
import multiprocessing as mp
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import time
from multiprocessing.sharedctypes import Value, Array, RawArray
from multiprocessing import Process, Lock
import ctypes

In [None]:
import importlib

# Shared Memory
Python passes the parameters by value. That means, it passes not the memory position where the variable is stored, but also it passes the value of the parameter variable.

That is why if we change the value in the invoked function, the changes will not be reflected out of the function.

But for the objects and structures, it passes the value of the memory position where the object/structure is stored. That is why the NumPy arrays will reflect the changes made inside them.

We can play with this condition, but at the same time, we need to take care about how we handle the data.

# Race Condition
The *Race Condition* happens when two or more processes can access shared memory, trying to read and write in the same memory position, without control.

This first parallel Multiple Instructions, Multiple Data (MIMD) parallel program will try to make deposits and withdraws over a shared memory value, called balance. It will start with a value of 100.

The use of Pool execution is oriented to Single Instruction, Multiple Data (SIMD) parallel processes: We define just one execution function, which will be applied to multiple data.

Now, if we will execute a Multiple Instruction, Multiple Data (MIMD) program, we need to use the *multiprocessing.Process* class.

In this case, we construct as many of multiprocessing.Process objects we need, assigning which function will be executed in each one of the parallel processes, and passing the values of their arguments.

After defining the Process objects, we start those processes with the method start() and each one of them will start to run.

To synchronize the execution, we can wait until one parallel process ends using the method join().

In [None]:
import freerunning as fr

The method _withdraw_ is:

<code>
def withdraw(balance):

        for _ in range(10000):
                balance.value = balance.value - 1
</code>

and the method _deposit_ is:

<code>
def deposit(balance):

        for _ in range(10000):
                balance.value = balance.value + 1
</code>

In [None]:
def perform_transactions(): 
  
    # initial balance (in shared memory) 
    balance = mp.Value('i', 100)
  
    # creating new processes 
    p1 = mp.Process(target=fr.withdraw, args=(balance,)) 
    p2 = mp.Process(target=fr.deposit, args=(balance,)) 
  
    # starting processes 
    p1.start() 
    p2.start() 
  
    # wait until processes are finished 
    p1.join() 
    p2.join() 
  
    # print final balance 
    print("Final balance = {}".format(balance.value)) 

We will run 10 times the main function "perform_transactions".
Check 2 things: final value and time used

In [None]:
start_time=time.time()
for _ in range(10): 
    # perform same transaction process 10 times 
    perform_transactions()
print("--- %s seconds ---" % (time.time() - start_time))

Here, the final balances are not correct!! Why?<br>
We have two different parallel processes reading and writing the same shared variables. In this case, we say: we have a race condition

We do not control the state of the shared variable *balance*. We need to know, how the execution is realized.

In any case, check the execution time.

# Shared Memory with locks and semaphores

If we need to handle a concurrent access to read/write memory positions, we will need to improve our program (a lot)...

* We will need to define a process and resources manager
* We will define the shared memory structures
* We will need to associate the shared memory and resource manager to the pool of processes

We will need several new modules and objects to handle shared memory information:
* From multiprocessing module we will import Process and Lock objects. We will use Lock objects to set locks in our shared memory objects
* From multiprocessing.sharedctypes we import shared memory objects like:
    * Value: a single shared memory value object, like integers, float, etc
    * Array: an object to handle shared memory arrays, like matrixes, with locks. Multiprocess safe shared memory object
    * RawArray: It is like an Array, but without locks. This kind of objects are Unsafe (but fast access)
* Import ctypes for data type definition in shared structures. The different values for c_types are available in: https://docs.python.org/3/library/ctypes.html#module-ctypes

![image.png](attachment:ac8d59c1-64f3-4838-9827-e7534f9ebd7b.png)

### Modifications in our code

We will invoke a new methods, more complex, but define an access control.

We will use a Value object, called balance, and a global mp.Lock called... lock.

The external lock method allows us to block the access to all shared variables inside the blocked block code:

<code>
def withdraw2(balance, lock):

        for _ in range(10000):
            lock.acquire()
            balance.value = balance.value - 1
            lock.release()
</code>
In this case, the shared Value variable balance is blocked to be accessed from other processes for read and write.
So, once blocked, the balance.value is read, decreased by one and then released to be accessed by other processes.

In the same way the method deposit2:

<code>
def deposit2(balance, lock):

        for _ in range(10000):
            lock.acquire()
            balance.value = balance.value + 1
            lock.release()
</code>

Using this access control, we avoid race condition.

In [None]:
def perform_transactions2(): 
  
    # initial balance (in shared memory) 
    balance = mp.Value('i', 100) 
  
    # creating a lock object 
    lock = mp.Lock() 
  
    # creating new processes 
    p1 = mp.Process(target=fr.withdraw2, args = (balance,lock)) 
    p2 = mp.Process(target=fr.deposit2, args = (balance,lock)) 
  
    # starting processes 
    p1.start() 
    p2.start() 
  
    # wait until processes are finished 
    p1.join() 
    p2.join() 
  
    # print final balance 
    print("Final balance = {}".format(balance.value)) 

In [None]:
start_time = time.time()
for _ in range(10): 
    # perform same transaction process 10 times 
    perform_transactions2() 
print("--- %s seconds ---" % (time.time() - start_time))

## Distributed dot product with shared accumulator
In this first example we will implement the distributed dot product, first, without lock access, and then, with lock access.
We will observe the different hazzards if we do not use a lock, but if the lock is misplaced, it will cause a performance degradation.

In [None]:
import importlib
import myfunctions3 as my

In [None]:
importlib.reload(my)

In [None]:
NUMCORES = 4
NUMDATA = 1000000
NUM_CHUNK = 64

We generate our data: 2 random vectors of type float, with values between [0,1) and prepare for our dot function

In [None]:
data_X = np.random.random(NUMDATA)
data_Y = np.random.random(NUMDATA)

This is our reference result (the correct one)

In [None]:
print(np.dot(data_X,data_Y))

In [None]:
Vdata = list(zip(data_X,data_Y))
print("Lenght of data: ", len(Vdata))
print("Example element from data: ", Vdata[0])

Now, we will split the zip file in NUM_CHUNK vectors (**warning: spoiler alert**)

In [None]:
sDV = np.split(np.array(list(Vdata)), NUM_CHUNK)

This first version ivokes a wrong function. 
Check the exit result.

In [None]:
def distributed_dot_1(sdV, numprocs):
    A = Value('d') #create a shared accumulator A 
    A.value = 0    #Initialize to 0
    with mp.Pool(processes = numprocs,
                 initializer = my.dot_init,
                 initargs = [A]) as p:
        accu = p.map(my.shared_dot_1, sdV)
    return A.value

In [None]:
%%time
a = distributed_dot_1(sDV, NUMCORES)
print("Result: ", a)

In [None]:
init_time = time.time()
for c in range(10):
    a = distributed_dot_1(sDV,NUMCORES)
final_time = time.time()
print("Average time over 10 samples: {0}".format((final_time-init_time)/10))

Run several times and compare the results with the reference result.
Why it is every time different? And where are the errors?

#### Second version:
Now, we will invoke the function shared_dot_2

In [None]:
def distributed_dot_2(sdV,numprocs):
    A = Value('d')                  # create a shared accumulator A 
    A.value = 0                     # Initialize to 0
    with mp.Pool(processes = numprocs, 
                 initializer = my.dot_init,
                 initargs=[A]) as p:
        accu = p.map(my.shared_dot_2, sdV)
    return A.value

In [None]:
%%time
a = distributed_dot_2(sDV, NUMCORES)
print("Result: ", a)

In [None]:
init_time = time.time()
for c in range(10):
    a = distributed_dot_2(sDV,NUMCORES)
final_time = time.time()
print("Average time over 10 samples: {0}".format((final_time-init_time)/10))

Now, the returned value is correct, but the function is not optimal. Check in the shared_dot_2 function, and analyze where the lock is defined.

### Third version
Now, we will invokes the function shared_dot 3. Compare times

In [None]:
def distributed_dot_3(sdV, numprocs):
    A = Value('d')                  # create a shared accumulator A 
    A.value = 0                     # Initialize to 0
    with mp.Pool(processes = numprocs,
                 initializer = my.dot_init,
                 initargs=[A]) as p:
        accu = p.map(my.shared_dot_3, sdV)
    return A.value

In [None]:
%%time
a = distributed_dot_3(sDV, NUMCORES)
print("Result: ", a)

In [None]:
init_time = time.time()
for c in range(10):
    a = distributed_dot_3(sDV, NUMCORES)
final_time = time.time()
print("Average time over 10 samples: {0}".format((final_time-init_time)/10))

Ok, now, the average time taken by the third code is less than one half of the second version.<br/>
Analize where the get_lock() is placed, and look for an explanation.

## Second Section:

Image filtering: When we filter an image, we apply a mask over each position in an image. For example:
A smooth filter calculate the value in a pixel (x,y) position as the average value of the n neighborhoods. 
![image.png](attachment:image.png)

With this type of filters, we will need to calculate the average of positions (x-1,y-1)+(x,y-1)+(x+1,y-1)+(x-1,y)+(x,y)+(x+1,y)+(x-1,y+1)+(x,y+1)+(x+1,y1+1) (or the weighted values if the filter matrix has not the same values for each position).

We need to implement a filter for images of size (1280x1080) (a matrix of 1280x1080 pixels of size), The values will been between 0 an 255, and the results should be integer values between those values.
To calculate the borders, we use the next available value. For example: if y is the row, for the upper border (where y=0) we will replace (x-1,y)+(x,y)+(x+1,y)+(x-1,y)+(x,y)+(x+1,y)+(x-1,y+1)+(x,y+1)+(x+1,y1+1).

You will have 2 matrices: Image and Filter. The first one will be a random matrix or a preloaded image (in order to be more generic, we will use in the begining a random matrix). You don't know the original size (that means: you will have to calculate the shape of image)
If the image have more than one layer, the filter must be applied to all the layers

Things to do before starts:
* Which information you will distribute? Pixel position? row position? column position?
* Which information will you collect? Pixel, columns? rows?
* How will you collect the information?


In [None]:
image = np.array(Image.open('cat.jpg'))
print(image.dtype)

In [None]:
plt.figure()
plt.imshow(image)

Now, we calculate the number of elements (bytes in this case) in our image.

In [None]:
data_buffer_size = image.shape[0] * image.shape[1] * image.shape[2]
print(data_buffer_size)

Then, we create an instance of Shared Array object (which contains a lock property and get_lock() element). We need to know the C data type (in order to know how many bytes uses the single data type and the length of the linear array)

In [None]:
shared_space = Array(ctypes.c_byte, data_buffer_size)

Because the use of a Matrix as a linear vector could be hard to understand and visualize, we could create a new variable of type NumPy vector and set to use the prelocated memory space.<br/>

Once we create that NumPy variable, we can reshape this variable to see like a 3 dimensional vector (tensor)

In [None]:
shared_data = my.tonumpyarray(shared_space)
result_matrix = shared_data.reshape(image.shape)

In [None]:
print("Image shape: ", image.shape)
print("Result matrix shape: ", result_matrix.shape)

In [None]:
my_filter = np.ones((3,3),dtype=np.uint8)
my_filter

This first example, we will copy in the shared space the original image, in parallel.<br/>

In this case, we do not have memory hazzards (in theory), because we never try to access the same row from different parallel processes.

In [None]:
importlib.reload(my)

rows = range(image.shape[0])
with mp.Pool(processes = NUMCORES, initializer = my.pool_init, initargs = [shared_space, image, my_filter]) as p:
    p.map(my.parallel_shared_imagecopy, rows)

The difference with the previous session code, is: we are not returning the filtered row data, and construct the final image. We are writing in a shared memory 

In [None]:
print("Shared data type: ", type(shared_data))

In [None]:
plt.figure()
plt.imshow(result_matrix)

## Shared memory assignment

In the myfunctions module, we have the function: edge_filter, which gets a filter, and applies it to the image, writing the filtered row in to the shared r/w memory.
Check the code implementation, and modify it. Analyze the code. Which error we have? Check the order of instructions. It is efficent?

Implements several edge filters, like the described in https://en.wikipedia.org/wiki/Edge_detection

In [None]:
# This is the this is the diferential second order differential edge filter
my_filter = np.zeros((3,3))
my_filter[0][0] = -0.25
my_filter[2][0] = -0.25
my_filter[0][2] = 0.25
my_filter[2][2] = 0.25
my_filter

In [None]:
# Filtering the image
rows = range(image.shape[0])
with mp.Pool(processes = 8, initializer = my.pool_init, initargs = [shared_space, image, my_filter]) as p:
    p.map(my.edge_filter, rows)

Time for various NUMCORES values:
| core number | time |
|----|--------|
| 1  | 1min 12s|
| 2  | 38.1 s  |
| 4  | 20.1 s  |
| 8  | 21.8 s  |
| 16 | 23.6 s  |
| 32 | 25.7 s  |
| 64 | 31.5 s  |
| 128| 39.4 s  |


My conclusions here are really similar to those from last session. Optimal value for my computer was 4 or 8 cores, because I have hardware like this. Values below that gave results that could be better, while during execution with bigger values I have little problems with writing on the keyboard. The bigger core numbers, the worse got the results. Computer lost time for switching between executed theads, because their number was grater than my hardware.

Those values will not be the same for all machines. That's why I specified *my computer* earlier. My partner has different computer and her results are diffent from mine


In [None]:
plt.figure()
plt.imshow(result_matrix)