# DECSKS-22: Redesign of <code>remap_step</code> algorithm

# Terse summary

This notebook recasts the version of DECSKS-21 which created separate convect routines for velocity and configuration variables as well as furnished pre-computing the CFL numbers/correctors c at the start of the time step rather than recomputing it each time for configuration variables, which always use the same set of corrects c for a stage s upon each full time step. This noteboook creates Cython modules for some of the more problematic methods, boundaryconditions.py ->< boundaryconditions.pyx, flux.py --> flux.pyx, and takes convect_XXXXXX.remap_assignment -> remap.pyx module. Of course, corresponding changes inside the lib.convect_XXXXXX routines were needed, and have been accomplished in this notebook. The version in this notebook is extended in DECSKS-18 part 3 (note that DECSKS-18 part 2 was still using numpy-based implementations, i.e. the same skeleton created from DECSKS-09, which recorded the development of DECSKS-2.0, the array-based implementation).

# Motivation

the current remapping stepthrough is the following, where as usual we denote the two postpoints colloquially as <code>k1</code> and <code>k2</code>:

    lib.remap_step -> lib.remap_assignment
    
                   -> index == 'nearest': lib.boundaryconditions.nonperiodic(f_k1)
                   
                                                  -> boundaryconditions.lower_boundary(f_k1) factors in boundary
                                                  -> boundaryconditions.upper_boundary(f_k1) factors in boundary
                                              
                                                  -> periodize postpointmesh to produce unique postpoints for
                                                     all off-grid postpoints that have already been factored in
                                                     above and hence zeroed out in density container
                                                     
                                                  <- 
                                                  
                                            (remap rule depends on sign of vx prepoints)
                                                  
                                            mask: f_neg <- f[vx < 0] assigned at postpoints by advanced indexing
                                            mask: f_pos <- f[vx >=0] assigned at postpoints by advanced indexing                      
                                            f_k1  <- consolidate data from both masked arrays

                   -> index == 'contiguous': lib.boundaryconditions.nonperiodic(f_k2)
                   
                   
                                                  -> boundaryconditions.lower_boundary(f_k2) factors in boundary
                                                  -> boundaryconditions.upper_boundary(f_k2) factors in boundary
                                              
                                                  -> periodize postpointmesh to produce unique postpoints for
                                                     all off-grid postpoints that have already been factored in
                                                     above and hence zeroed out in density container
                                                                                              
                                                  <-
                                                  
                                            (remap rule depends on sign of vx prepoints)      
                                                  
                                            mask: f_neg <- f[vx < 0] assigned at postpoints by advanced indexing
                                            mask: f_pos <- f[vx >=0] assigned at postpoints by advanced indexing                                                                    
                                            f_k2  <- consolidate data from both masked arrays

        f_remapped <- f_k1 + f_k2                                            
          

the step at the bottom of each branch labeled as assigning new densities according to 
<code>advanced indexing</code> relies on indexing by taking in arrays of indices that is furnished by numpy built-ins to slice the indexing in the expected way, and a further layer is that it ignores masked entries. 

Essentially, we map according to:

        f_k[xpostpointmesh, vxprepointmesh] = f_old[xprepointmesh, vxprepointmesh]
        
The left hand side takes as indices the postpoint mesh, which maps each <code>[i,j]</code> on the right-hand side to postpositions <code>[k1,j]</code> or <code>[k2, j]</code> (incidentally, above we have written it redundantly for clarity, but do note that <code>f_old[xprepointmesh, vxprepointmesh] = f_old</code>, as the just slices <code>[i,j]</code> in order.

It turns out, using this implementation where we rely on numpy advanced indexing: two or more postpoints cannot be shared, a given postpoint will get overwritten by the next one that has the same postpoint. This is due to how indices are read in and stored at the C level in numpy (row major ordering convention used in storing multidimensional arrays in equivalent 1D blocks of contiguous memory). Note, it is not possible to insert an increment operator inside this operation as it is not accessible at the python level. We have exploited the fact that previous boundary conditions have created the situation or can be made to create the situation where no two postpionts are shared. If this situation happens or can be created with minimial expense, then significant efficiency can be inherited by doing the remapping in "one step".

As we have been adding more boundary conditions, the fortune of just having unique postpoints has not been a given, and some manipulations (often esoteric, at times clever) has been needed to achieve the required effect of the boundary conditions while also insisting on unique postpoints that do not cause unintended mappings in the one step remap.

In our latest case of a symmetry boundary condition, several tactics have been attempted to achieve the same as above where a set of unique postpoints are decided on that in general will not conflict with the boundary conditions so that we may reuse our algorithm. The most promising strategy had been partitioning the problem into two separate remappings (partners vs. non-partners). In the end of our efforts, this was seen to be either wasteful storage-wise, or a looping implementation was unavoidable, which in the worst case scenario makes the symmetry boundary condition perhaps 100 times slower than the numpy based remap (avoiding a looping mechanism was the reason for working out how a one-step matrix mapping can be used instead).

Further, it is has alwyas been noted that the current remapping procedure is esoteric, perhaps difficult for others besides the author to follow. And, because of this unique demand of having unique postpoints, it makes extending to different boundary conditions more involved, and unnatural. The efficiency that we have exploited unfortunately comes with too much baggage. <font color = "purple">The goal of this notebook is to dispense of this routine, and code an equivalent method that is natural, obvious, but still efficient.</font>

Invariably, this will require the remap algorithm to take on the aforementioned evil looping mechanism where we loop through each prepoint and assigning it to a postpoint location incrementally. Going back to the C-level will permit us to do what was stated as a problem above. That is, we can "insert" an incremental operator instead of an assignment operator, so that overlapping postpoints have their densities added on top of each other, not having one overwrite the other.

This will be accomplished by coding the routines in Cython using typed ndarrays. We note that, if successful, it can invite reconsidering the boundary condition application above and overall control flow. As concerns the former, if efficiency makes it of negligible cost for example, in an absorbing or charge collecting boundary condition, we might assign all such postpoints that hit the walls (i < 0) to have the x postpoint corresponding to the wall (i = 0), which is physically appealing. In the past, we had skipped this unnecessary reassignment, and just zeroed out all such entires whose postpoints go off-grid. If we had a charge collecting surface, then we could sum all particles at i = 0, rather than as we do now where we sum all postpoints i < 0 if the left wall is charge collecting. The difference here is an "extra" step, and the benefit is straight-forwardness and readability to other users than just this author, easy adapation even and obvious extensions to other boundary conditions. If the cost of incuding an extra step like this is small, we should pursue such things. Since we are now reconsidering the overall architecture of the remap step, it is opportune to consider things now rather than later.

## Problem setup: shared postpoints example

We wish to map densities from a 2D array <code>f_old</code> to their new locations as prescribed by their 2D map <code>(xpostpointmesh[k,:,:], vxprepointmesh)</code> to locations in the 2D array <code>f_new</code>. We begin with the case that was abandoned in DECSKS-18, as the quest to have unshared postpoints (which would avoid the demonstrated problem below) was seen to be too computationally expensive.

See output below for a presentation of the problem particulars along with descriptions. Here, we have a symmetry boundary at the left edge, <code>i = 0</code>, and must (shown below) account for entering partner particles for all those that exit. For more details, DECSKS-18 part 3 can be reviewed.

In [1]:
import numpy as np

Nx, Nvx = 5, 6
xprepointmesh = np.outer(np.arange(Nx), np.ones(Nvx))
vxprepointmesh = np.outer(np.ones(Nx), np.arange(Nvx))
vxprepointvaluemesh = np.outer(np.ones(Nx), np.linspace(-2,2,Nvx))

# type as int64, for all objects that also serve as indices later

xprepointmesh = xprepointmesh.astype(int)
vxprepointmesh = vxprepointmesh.astype(int)

print "the prepoints of each density packet are at the following locations\n"
print "(xprepointmesh, vxprepointmesh) = "
print xprepointmesh, "\n"
print vxprepointmesh

the prepoints of each density packet are at the following locations

(xprepointmesh, vxprepointmesh) = 
[[0 0 0 0 0 0]
 [1 1 1 1 1 1]
 [2 2 2 2 2 2]
 [3 3 3 3 3 3]
 [4 4 4 4 4 4]] 

[[0 1 2 3 4 5]
 [0 1 2 3 4 5]
 [0 1 2 3 4 5]
 [0 1 2 3 4 5]
 [0 1 2 3 4 5]]


Note, these are read as duples at respective array locations, e.g. <code>(xprepointmesh[0,0], vxprepointmesh[0,0]) = (0,0)</code> and so on.

In [2]:
dx, dt = 1., 1.
CFL = vxprepointvaluemesh * dt / dx
S = np.where(vxprepointvaluemesh < 0, np.ceil(vxprepointvaluemesh), np.floor(vxprepointvaluemesh)).astype(int)
alpha = CFL - S

xpostpointmesh = np.zeros((2,Nx,Nvx))
xpostpointmesh[0,:,:] = xprepointmesh + S
xpostpointmesh[1,:,:] = xpostpointmesh[0,:,:] + np.sign(vxprepointvaluemesh)
xpostpointmesh = xpostpointmesh.astype(int)

print "\nAccording to the chosen velocity grid (not shown here), the nearest neighbor grid point sto the true exact x-postpoints are given in xpostpointmesh[0,:,:] = "
print xpostpointmesh[0,:,:]

print "\nthe particles that exit the domain (i < 0) have corresponding entering particles which reach postpoint locations -i"
print "\nhence, we have an xpostpointmesh for the nearest neighbors (we worry about the contiguous gridpoints later) with reentering particles positions at\n"

exiting_prepoints = np.where(xpostpointmesh[0,:,:] < 0)
xpostpointmesh[0,exiting_prepoints[0], exiting_prepoints[1]] = -xpostpointmesh[0,
                                                                               exiting_prepoints[0], 
                                                                               exiting_prepoints[1]]

print xpostpointmesh[0,:,:]
print "\nwe note that on the right-hand side, we have some exiting particles. We do not concern ourselves with boundary conditions right now beyond the symmetry condition"
print "\nthe specific boundary conditions on the right would dictate what happens to these postpoints. Here, we are not concerned with the right boundary so arbitrarily, we choose postpoint positions in x that are on grid for any xpostpointmesh > 4 so we can move forward with our analysis of how to take care of the left boundary"
xpostpointmesh[0,:,:] = np.where(xpostpointmesh[0,:,:] > 4, 3, xpostpointmesh[0,:,:])
print "\nthus we have the final xpostpointmesh"
print xpostpointmesh[0,:,:]


print "\nA symmetry boundary on the left requires us to remap re-entering particles which are partners of the exiting particles (i < 0) and which reach the mirror locations in (x,vx). Such particles have oppositely directed velocity vx -> -vx, in terms of indices here j -> Nvx - 1 - j, hence we have the final postpointmesh for vx as"

vxprepointmesh[exiting_prepoints[0],
               exiting_prepoints[1]] = Nvx - 1 - vxprepointmesh[exiting_prepoints[0], 
                                                                exiting_prepoints[1]] 

print "\nvxprepointmesh = "
print vxprepointmesh


According to the chosen velocity grid (not shown here), the nearest neighbor grid point sto the true exact x-postpoints are given in xpostpointmesh[0,:,:] = 
[[-2 -1  0  0  1  2]
 [-1  0  1  1  2  3]
 [ 0  1  2  2  3  4]
 [ 1  2  3  3  4  5]
 [ 2  3  4  4  5  6]]

the particles that exit the domain (i < 0) have corresponding entering particles which reach postpoint locations -i

hence, we have an xpostpointmesh for the nearest neighbors (we worry about the contiguous gridpoints later) with reentering particles positions at

[[2 1 0 0 1 2]
 [1 0 1 1 2 3]
 [0 1 2 2 3 4]
 [1 2 3 3 4 5]
 [2 3 4 4 5 6]]

we note that on the right-hand side, we have some exiting particles. We do not concern ourselves with boundary conditions right now beyond the symmetry condition

the specific boundary conditions on the right would dictate what happens to these postpoints. Here, we are not concerned with the right boundary so arbitrarily, we choose postpoint positions in x that are on grid for any xpostpoi

Note that in terms of (i,j), where i labels the xpostpointmesh and j labels the vxprepointmesh (we call it the "pre"pointmesh to have a physical correspondence that technically vx is not advected, hence stays the same as before, it is x that is evolved here. True, here we did modify the prepointmesh for vx; however, these are physically interpreted as prepoints, the entries that changed just corersponded to flipping the role of exiting particles to re-entering particles, but those velocities themselves are still properly "pre"points). We have the following shared postpointe mappings:

    prepoint          postpoint

    (0,0)       ->      (2,5)
    (0,5)       ->      (2,5)
    
    (0,1)       ->      (1,4)
    (0,4)       ->      (1,4)

    (2,5)       ->      (3,4)
    (4,4)       ->      (3,4)
    
    (1,5)       ->      (3,5)
    (3,5)       ->      (3,5)
    (4,5)       ->      (3,5)
    
Some of these were due to re-entering particles having postpoints that conflicted with other purely on-grid particles during the advection, whereas others are arbitrary (consequence of us assigning random postpoint values to any point that exceeded the right boundary since this does not change the focus, which is the symmetry boundary condition)

#### Problem statement: 

Map each <code>f_old[i,j]</code> to <code>f_new[ xpostpointmesh[0,,i,j], vxprepointmesh[i,j] ]</code>. Currently, DECSKS does this by the obvious index slicing:

    f_new[ xpostpointmesh[0,:,:], vxprepointmesh ] = f_old
    
Note that it is redundant to write <code>f_old[ xprepointmesh_old, vxprepointmesh_old]</code> as by definition <code>f_old[ xprepointmesh_old, vxprepointmesh_old ] = f_old</code>

Note, we are omitting some details here (i.e. positive vs. negative values of velocity change the remap rule, also fractions are appropriated in the assignment that we are not fussing with here). We come back to this later.

The above works efficiently! However, it only works if no two duples in the mapping <code>xpostpointmesh[0,:,:], vxprepointmesh)</code> are the same, otherwise as mentioned we have the side effect that only one of such shared mappings will be assigned to that postpoint duple, both cannot be incremented as there are no means to introduce such an operator given what is available to edit at the python level (would have to go to the numpy/C level to modify it).

## Demonstration of DECSKS-2.0 failure to remap correctly

First, we show that this mapping does not work for this case. DECSKS-2.0 essentially performs the following remap assignment implementation. We have neglected some details here, but this is the gist of it:

### DECSKS-2.0 remap procedure (gist)

In [3]:
def remap_assignment_DECSKSv2(f_old, map1, map2):
    """
    inputs:
    f_old -- (ndarray, ndim=2) density to be reassigned to f_new in locations given by 
    map1, map2 -- (ndarrays, ndmin=2, dtype=int) together these two 2D arrays consitute a set of duples
                   which inform of all postpoints for each [i,j]
            
    output:
    f_new -- (ndarray, ndim=2) density containing f_old rearranged according to (map1, map2)
    """
    f_new = np.zeros_like(f_old)
    f_new[map1, map2] = f_old
    
    return f_new

In [4]:
print "Suppose we have the following (randomly generated) f_old = "
f_old = 0.25 * np.random.rand(Nx*Nvx).reshape(Nx,Nvx)
print f_old

print "\nwe try to remap according to the above postpoint maps. Here, we try the nearest gridpoints first, which are given by xpostpointmesh[0,:,:]"

Suppose we have the following (randomly generated) f_old = 
[[ 0.07410666  0.18762865  0.01967699  0.03751813  0.02081788  0.11954142]
 [ 0.16753418  0.05198838  0.20529856  0.20655263  0.06315393  0.20142827]
 [ 0.24756821  0.13373349  0.03306002  0.20355135  0.05995287  0.226658  ]
 [ 0.02382589  0.12959642  0.04575815  0.2119707   0.16695347  0.14143475]
 [ 0.06547624  0.02450841  0.03750793  0.02408325  0.10685255  0.06675414]]

we try to remap according to the above postpoint maps. Here, we try the nearest gridpoints first, which are given by xpostpointmesh[0,:,:]


In [5]:
print "\nThe above remap function (DECSKS-2.0) gives the following f_new = "
f_new1 = remap_assignment_DECSKSv2(f_old, xpostpointmesh[0,:,:], vxprepointmesh)

print f_new1


The above remap function (DECSKS-2.0) gives the following f_new = 
[[ 0.24756821  0.05198838  0.01967699  0.03751813  0.          0.        ]
 [ 0.02382589  0.13373349  0.20529856  0.20655263  0.02081788  0.16753418]
 [ 0.06547624  0.12959642  0.03306002  0.20355135  0.06315393  0.11954142]
 [ 0.          0.02450841  0.04575815  0.2119707   0.10685255  0.06675414]
 [ 0.          0.          0.03750793  0.02408325  0.16695347  0.226658  ]]


For example, we have from the above output (comparing prepoint meshes with postpoint meshes) that <code>(2,2)</code> -> <code>(2,2)</code>. Incidentally, this is a unique postpoint, no other particle is mapped to <code>(2,2)</code>. We check to see if this remapping happened.

In [6]:
f_new1[2,2] == f_old[2,2]

True

As an example of a shared postpoint, we noted (0,0), and (0,5) both should map to <code>f_new</code> at a position (2,5). This does not happen with the above attempt at remapped in one index slice:

In [7]:
f_new1[2,5] == (f_old[0,0] + f_old[0,5])

False

In fact, based on the above shared postpoint mappings summarized above, we need the following to hold as well:

In [8]:
f_new1[3,4] == (f_old[2,5] + f_old[4,4]) 

False

In [9]:
f_new1[3,5] == (f_old[1,5] + f_old[3,5] + f_old[4,5]) 

False

However, none of them do. Instead, because of row major ordering and how array indexing is done, only the last of these are actually assigned to the above postpoints. That is, we have:

In [10]:
f_new1[2,5] == f_old[0,5]

True

In [11]:
f_new1[3,4] == f_old[4,4]

True

In [12]:
f_new1[3,5] == f_old[4,5]

True

The packets are not cumulatively incremented at each shared postpoint location, but <b>instead the last one overwrites the others.</b>

## Development of a looping routine

We develop a remapping routine by first coding a mapping function <code>remap_assignment</code> that succeeds where the above failed. That is, we have no restriction on shared postpoints, we will properly increment as much density as reaches a particular postpoint. As a first pass, we code this in an obvious way in python, which will be extremely slow. We then try to improve the performance by falling back on Cython.

### Python version: <code>remap_assignment</code>

The loop should be implemented as follows:

In [13]:
#loop through each i,j and increment the density at each postpoint duple
def remap_assignment_py(f_old, map1, map2):
    
    f_new = np.zeros_like(f_old) 
    for i in range(f_old.shape[0]):
        for j in range(f_old.shape[1]):
            f_new[map1[i,j], map2[i,j]] += f_old[i,j]
            
    return f_new

Check to see if remapping succeeded where the above failed

In [14]:
f_new2 = remap_assignment_py(f_old, xpostpointmesh[0,:,:], vxprepointmesh)

In [15]:
f_new2[2,5] == (f_old[0,0] + f_old[0,5]) 

True

In [16]:
f_new2[2,5] == (f_old[0,0] + f_old[0,5])

True

In [17]:
f_new2[3,5] == (f_old[1,5] + f_old[3,5] + f_old[4,5]) 

True

As expected, this works fine. The only issue is its speed when the dimensions become sufficiently large. Before we compare speed, we write a Cython version:

### Cython version: <code>remap_assignment</code>

In [18]:
%load_ext Cython

In [19]:
%%cython

import numpy as np
cimport numpy as np

DTYPE = np.float64
DTYPEINT = np.int64

ctypedef np.float64_t DTYPE_t
ctypedef np.int64_t DTYPEINT_t

def remap_assignment_cy(np.ndarray[DTYPE_t, ndim=2] f_old,  np.ndarray[DTYPEINT_t, ndim=2] map1, np.ndarray[DTYPEINT_t, ndim=2] map2, int dim1, int dim2):
   
    cdef np.ndarray[DTYPE_t, ndim=2] f_new = np.zeros([f_old.shape[0], f_old.shape[1]], dtype = DTYPE)
    cdef int i
    cdef int j
    for i in range(dim1):
        for j in range(dim2):
            f_new[map1[i,j], map2[i,j]] += f_old[i,j]
    return f_new

Note we have apply a thin Python wrapper to this routine <code>def</code> function (as opposed to a <code>cdef</code> method which is only available within the Cython module). 

Now we try to use this Cython function

In [20]:
f_new3 = remap_assignment_cy(f_old, xpostpointmesh[0,:,:], vxprepointmesh, f_old.shape[0], f_old.shape[1])

In [21]:
f_new3[2,5] == (f_old[0,0] + f_old[0,5])

True

In [22]:
f_new3[2,5] == (f_old[0,0] + f_old[0,5])

True

In [23]:
f_new3[3,5] == (f_old[1,5] + f_old[3,5] + f_old[4,5]) 

True

This also checks out.

## Performance comparisons

Now that we know the implementation works, use any postpointmesh, and do no need to fuss with if it still works if we have shared postpoints. We know it will. Hence, for ease of implementation here, we just periodize the postpoint mesh. Incidentally, this will produce all unique postpoints, but indeed we have just proven the above algorithms do not mind if they are shared or not.

In [24]:
# SETUP the problem

Nx, Nvx = 1024, 1024
xprepointmesh = np.outer(np.arange(Nx), np.ones(Nvx))
vxprepointmesh = np.outer(np.ones(Nx), np.arange(Nvx))
vxprepointvaluemesh = np.outer(np.ones(Nx), np.linspace(-2,2,Nvx))

# type as int64, for all objects that also serve as indices later
xprepointmesh = xprepointmesh.astype(int)
vxprepointmesh = vxprepointmesh.astype(int)

dx, dt = 1., 1.
CFL = vxprepointvaluemesh * dt / dx
S = np.where(vxprepointvaluemesh < 0, np.ceil(vxprepointvaluemesh), np.floor(vxprepointvaluemesh)).astype(int)
alpha = CFL - S

xpostpointmesh = np.zeros((2,Nx,Nvx))
xpostpointmesh[0,:,:] = xprepointmesh + S
xpostpointmesh[1,:,:] = xpostpointmesh[0,:,:] + np.sign(vxprepointvaluemesh)
xpostpointmesh = xpostpointmesh.astype(int)
xpostpointmesh = np.mod(xpostpointmesh, Nx)

f_old = 0.25 * np.random.rand(Nx*Nvx).reshape(Nx,Nvx)

#### DECSKS-2.0 routine

In [25]:
%timeit f_new = remap_assignment_DECSKSv2(f_old, xpostpointmesh[0,:,:], vxprepointmesh)

100 loops, best of 3: 8.93 ms per loop


#### Python loop

In [26]:
%timeit f_new2 = remap_assignment_py(f_old, xpostpointmesh[0,:,:], vxprepointmesh)

1 loops, best of 3: 935 ms per loop


#### Cython loop

In [27]:
%timeit f_new3 = remap_assignment_cy(f_old, xpostpointmesh[0,:,:], vxprepointmesh, f_old.shape[0], f_old.shape[1])

100 loops, best of 3: 7.56 ms per loop


<b>So we see Cython is able to restore the slow looping of python into fast C implementations, which is competitive with numpy built-ins.</b> The savings in going from python loops to Cython loops is ~ 119 times (note this figure was true for a certain run-through of the above timeit tests. Likely, we have run them a few times since then giving slightly different times, so the exact figure here won't be the same but is basically the same). It is unsurprising since numpy are hard-coded C routines. Shortcutting to Cython permits us to produce similar efficiency gains.

Note, we are exploiting <a href = "http://docs.cython.org/src/tutorial/numpy.html#efficient-indexing">efficient indexing</a> here by typing the ndarrays, e.g.

    np.ndarray[DTYPE_t, ndim=2] f_old
    
in function parameters and declarations like

    cdef np.ndarray[DTYPE_t, ndim=2] f_new

and we use compile time data types (there exist all such corresponding data types that have the suffice <code>_t</code>).

To emphasize how big of a difference this makes. We remove these declarations below and replace them with just the more generic type information (i.e. <code>np.ndarray</code> rather than <code>np.ndarray[DTYPE_t, ndim=2]</code>).

## Demonstration of Cython performance loss without ndarray typing

In [28]:
%%cython

import numpy as np
cimport numpy as np

DTYPE = np.float64
DTYPEINT = np.int64

ctypedef np.float64_t DTYPE_t
ctypedef np.int64_t DTYPEINT_t

def remap_assignment_cy2(np.ndarray f_old,  np.ndarray map1, np.ndarray map2, int dim1, int dim2):
   
    cdef np.ndarray f_new = np.zeros([f_old.shape[0], f_old.shape[1]], dtype = DTYPE)
    cdef int i
    cdef int j
    for i in range(dim1):
        for j in range(dim2):
            f_new[map1[i,j], map2[i,j]] += f_old[i,j]
    return f_new

In [29]:
%timeit f_new4 = remap_assignment_cy2(f_old, xpostpointmesh[0,:,:], vxprepointmesh, f_old.shape[0], f_old.shape[1])

1 loops, best of 3: 797 ms per loop


Which is closer to the time it takes strict python looping. <b>This clearly communicates the bottlenecks in large arrays is the array lookups and assignments</b> (the <code>[]</code> operator still uses full Python operations).

To optimize further, we can remove bounds checking on the arrays with the decorator

    cimport cython
    @cython.boundscheck(False)

## Cython <code>remap_assignment</code> function with <code>@cython.boundscheck(False)</code>

In [30]:
%%cython

import numpy as np
cimport numpy as np
cimport cython

DTYPE = np.float64
DTYPEINT = np.int64

ctypedef np.float64_t DTYPE_t
ctypedef np.int64_t DTYPEINT_t

@cython.boundscheck(False)
def remap_assignment_cy3(np.ndarray[DTYPE_t, ndim=2] f_old,  np.ndarray[DTYPEINT_t, ndim=2] map1, np.ndarray[DTYPEINT_t, ndim=2] map2, int dim1, int dim2):
    cdef np.ndarray[DTYPE_t, ndim=2] f_new = np.zeros([f_old.shape[0], f_old.shape[1]], dtype = DTYPE)
    cdef int i
    cdef int j
    for i in range(dim1):
        for j in range(dim2):
            f_new[map1[i,j], map2[i,j]] += f_old[i,j]
    return f_new

In [31]:
%timeit f_new5 = remap_assignment_cy3(f_old, xpostpointmesh[0,:,:], vxprepointmesh, f_old.shape[0], f_old.shape[1])

100 loops, best of 3: 5.81 ms per loop


which shaves off another millisecond at the cost of another part of a Python soul (ok, too dramatic). That is, by turning this off we agree to not use a staple of python: negative indexing. Also, we agree to be careful to not access data outside the bounds of our array. Which at best crashes our program, and at worst corrupts data elsewhere in some possibly devious, undetected way. Internally, Cython will treat the indices now as <code>unsigned int</code> types, so that should it encounter a negative index, it wraps around and corresponds to a very large positive value instead and you will suddenly be instructing Cython to access memory it does not have allocated for the purpose intended, it may even be memory it does not have permission to access leading to a segmentation fault.

<b>In general, the remap rule depends on the sign of the CFL number.</b> In DECSKS-2.0 we managed to reduce a column-by-column (velocity is constant for each row in a given column) implementation which is an $N_{vx}$ step problem to a two-step problem by using masked arrays. Instead of checking each column's velocity sign and applying the remap rule required, we instead extracted all negative CFL columns and non-negative velocity columns, then apply the remap rule to the array with all negative CFL once, and then the other remap rule to the other array with only non-negative CFL affiliation. We then consolidate the data for a return. The numpy/mask operations, and the consolitation are all $O(N^2)$ operations with $O(N^2)$ conditional checks for the specified condition (in this case, negative vs. non-negative). If we have control over this loop at the C level, we can remove $N$ conditional checks by noting the specific structure of our problem has CFL numbers that only vary by column since $\mathcal{C}_j = v_j\Delta t / \Delta x$

Hence, rather than doing something as (what the masked implementation does under the hood with equivalent C code):

    for i in range(dim1):
        for j in range(dim2):
            if CFL[i,j] >= 0:
                f_nonneg[i,j] = f[i,j]
            else:
                f_neg[i,j] = f[i,j]

We can do something as:

    for j in range(dim2):
        if CFL[0,j] >= 0: # (0,j) chosen arbitrarily as CFL[:,j] = constant
            for i in range(dim1):
                f_nonneg[i,j] = f[i,j]
        else:
            for i in range(dim1):
                f_neg[i,j] = f[i,j]

This will be done in our Cython implementation.

We consider an implementation where we first extract all prepoint densities into two arrays pertaining to non-negative vs. negative velocity prepoint values as this determines the remap rule, but decide to explore sifting the density columns based on velocity prepoint column not by masked arrays, but by an equivalent function in Cython. We are assembling pieces that may be combined in a single module in the end, or used in conjunction at the very least.

Looking ahead, the remap rule that must be applied has the form (here, <code>ma</code> refers to masked arrays, the <code>_pos</code> and <code>_neg</code> refer to non-negative ("positive") and negative CFL affiliations):

        f_pos[ z.postpointmesh[0,:,:], vz.prepointmesh ] = f_old_ma - Uf_ma
        f_neg[ z.postpointmesh[0,:,:], vz.prepointmesh ] = f_old_ma + Uf_ma
        
        f_remapped = # consolidate f_pos and f_neg columns
       
for the nearest index <code>k1</code>, then for <code>k2</code> we have
       
        f_pos[ z.postpointmesh[1,:,:], vz.prepointmesh ] = Uf_ma
        f_neg[ z.postpointmesh[1,:,:], vz.prepointmesh ] = -Uf_ma
        
        f_remapped = # consolidate f_pos and f_neg columns
        
We could partition the update as follows instead. We do not need to mask <code>f_old</code>. Suppose we code a remapping function <code>remap</code> in Cython. Then we can can be updated without masking the density <code>f_old</code>, but only <code>Uf</code>. That is, we first map

    f_k1 = remap(f_old, *args)
    
then, isolate through a to-be-coded function <code>sift</code> that is in Cython and which allocates the so-labelled (with respect to CFL sign affiliation) entries to otherwise zero arrays <code>Uf_neg</code> and <code>Uf_nonneg</code>

    Uf_neg, Uf_pos = sift(Uf, vxprepointvaluemesh) 
    
then, we update:

    f_k1 -= remap(Uf_pos)
    f_k2 += remap(Uf_neg)
    
This eliminates an unnecessary "mask" operation on <code>f_old</code>.

As for the contiguous point <code>k2</code>, we have the total rule:

        f_pos[ z.postpointmesh[1,:,:], vz.prepointmesh ] = Uf_ma
        f_neg[ z.postpointmesh[1,:,:], vz.prepointmesh ] = -Uf_ma
        f_k2 = # consolidate f_pos and f_neg

We can pursue the following:

        f_k2 += remap(Uf_pos)
        f_k2 -= remap(Uf_neg)

## A Cython implemented <code>sift</code> function vs. NumPy <code>mask</code>

<font color = "purple">Below we refer to a parameter <code>f</code>, but it is meant to be general</code></font>

We use this routine actually only tio sift <code>Uf</code> as negative vs. non-negative with respect to <code>vx.prepointmesh</code>. <b>Note, in DECSKS we should sift based on CFL, not on $v_x$, her we use $v_x$ for convenience and demonstration purposes only since the author mistakenly did not realize the difference (described just below) and we did not feel like correcting something (vxprepointvaluemesh -> CFLnumbers) that does not change the steps</b>.

The reason it matters in practice to use CFL vs. velocity is that the CFL is ultimately what determines the form of the remap rule. Often, we state, even in past writeups, that this rule depends on the sign of $v_x$, however this is under the assumption that $\Delta t > 0$. For high order split schemes, it is well known the negative time steps are required, not just possible (e.g. O11-6, O14-6, ...). Hence, the form that we apply for the remap must depend on CFL = $v_j\text{frac}_s\Delta t / \Delta x$, which $\text{frac}_s$ is the splitting coefficient for stage $s$, and can be negative!

With this understanding, we proceed with the demonstration and development of a Cython routine.

In DECSKS, we perform the following verbatim, where we include both steps, i.e. first we develop the mask, then we apply it, then apply the converse.,

### DECSKS-2.0 masked array performance test

In [106]:
%%timeit

import numpy.ma as ma

mask_neg =  (vxprepointvaluemesh < 0)
f_old_ma = ma.array(f_old.copy())
f_old_ma.mask = mask_neg

# after we also reverse the procedure

mask_pos = np.logical_not(mask_neg) # pos means non-negative here
f_old_ma.mask = mask_pos

100 loops, best of 3: 14.8 ms per loop


Timeit tests on just the mask application constitutes the majority of the processor time.

In [375]:
# Setup for timeit test of just the mask application (note the timeit test above does not cache the objects used)

import numpy.ma as ma

mask_neg = (vxprepointvaluemesh < 0)
f_old_ma = ma.array(f_old.copy())

time just the mask application...

In [374]:
%timeit f_old_ma.mask = mask_neg

100 loops, best of 3: 5.48 ms per loop


This is unsurprising it costs the most since the mask application must loop over ever element.

Creating the masks themselves incidentally takes minimal time:

In [107]:
%timeit mask_neg = (vxprepointvaluemesh < 0)

1000 loops, best of 3: 1.05 ms per loop


Nonetheless, this is an expensive operation. Though, as was mentioned in DECSKS-09, this was by far the least expensive of several options, not considering Cython based implementations that is.

## <font color = "purple">Cython: Multidimensional ndarrays implementation</font> $(*)$

In [108]:
%%cython

import numpy as np
cimport numpy as np

ctypedef np.float64_t DTYPE_t

def sift(np.ndarray[DTYPE_t, ndim=2] In, np.ndarray[DTYPE_t, ndim=2] vx):

    cdef int i
    cdef int j
    cdef int k = 0
    
    cdef int Nx = vx.shape[0]
    cdef int Nvx  = vx.shape[1]
    cdef np.ndarray[DTYPE_t, ndim=2] out_nonneg = np.zeros((Nx, Nvx))
    cdef np.ndarray[DTYPE_t, ndim=2] out_neg = np.zeros((Nx, Nvx))
    
    for i in range(Nx):
        for j in range(Nvx):
            if vx[i,j] >= 0:
                out_nonneg[i,j] = In[i,j]
            else:
                out_neg[i,j] = In[i,j]
                
    return out_nonneg, out_neg

In [109]:
%timeit f_nonneg, f_neg = sift(f_old, vxprepointvaluemesh)

100 loops, best of 3: 6.29 ms per loop


For comparison sake, we consider a C-based version that uses native C objects

### Cython version with C structs

In [202]:
%%cython --annotate

import numpy as np
cimport numpy as np
from libc.math cimport sqrt 
from libc.stdlib cimport malloc, free
cimport cython
 
cdef struct Duple:
    double x
    double y
    
ctypedef np.float64_t DTYPE_t

def sift(np.ndarray[DTYPE_t, ndim=2] In, np.ndarray[DTYPE_t, ndim=2] vx):

    cdef int i
    cdef int j
    cdef int k = 0
    cdef int N = 0
    cdef int Nx = vx.shape[0]
    cdef int Nvx  = vx.shape[1]
    
    cdef Duple * prepoint
    cdef double * f
      
    for i in range(Nx):
        for j in range(Nvx):
            if vx[i,j] >= 0:
                N += 1
    
    # Allocate memory
    prepoint = <Duple *> malloc(N*sizeof(Duple))
    f = <double *> malloc(N*sizeof(double))
    
    for i in range(Nx):
        for j in range(Nvx):
            if vx[i,j] >= 0:
                f[k] = In[i,j]
                prepoint.x = i
                prepoint.y = j
                k += 1               
                
    return None

As can be seen above, the implementation suggests rare Python interaction (i.e. mostly white space above).

In [308]:
%timeit sift(f_old, vxprepointvaluemesh)

The slowest run took 12.14 times longer than the fastest. This could mean that an intermediate result is being cached 
10 loops, best of 3: 5.97 ms per loop


this is of course the fastest, however, we are not factoring in the conversion from an array of C struct <code>Duple</code> data types to a numpy array. This can be done by generating an appropriate <code>PyObject *</code> using the Python-C API. The conversion should take appreciable time such that we anticipate (and do not care to investigate) that the actual use of the above code where we used native C types and its associated conversions to and from <code>PyObject</code> to C types and back again would approach the Cython performance. Hence, we see Cython permits accessing C like speeds (as experience has told us before this), though we acknowledge this does not use a 2D C array, but a Python array with static typing for fast indexing.

# <code>remap_step</code> and <code>boundaryconditions</code> redesign

thus, we consider how to combine these routines to reinvent the <code>remap_step</code> orchestration and what changes happen to the boundary conditions. Note, we hold a thin Python wrapper around these C functions, this permits calling the function from our Python interface, at the cost of some small overhead.

The entire <code>remap_step</code> routine is as follows (docstring omitted):

In [None]:
def remap_step(
        sim_params,
        f_old,
        Uf,
        n,
        z,
        vz,
        charge
        ):
    
    
    # remap to nearest neighbor cell center
    f_k1 = remap_assignment(
        sim_params,
        f_old,
        Uf,
        z,
        vz,
        charge,
        index = 'nearest'    # remaps to nearest neighbor index
        )

    # remap to contiguous cell center
    f_k2 = remap_assignment(
        sim_params,
        f_old,
        Uf,
        z,
        vz,
        charge,
        index = 'contiguous'    # remaps to contiguous neighbor of above
        )

    f_remapped = f_k1 + f_k2


Two mapping calls are invariably required at some level of the abstraction, <code>k1</code> and <code>k2</code> are the nearest gridpoint to the true postpoint, and the contiguous gridpoint along the same direction of travel, respectively. Both have different remap rules, both have different remap rules for the flux <code>Uf</code> (described above).

#### <code>remap_assignment</code> (DECSKS-2.0, docstring and longer comments omitted)

In [None]:
def remap_assignment(
        sim_params,
        f_old,
        Uf,
        z,
        vz,
        charge,
        index = 'nearest'
        ):

    
    mask_neg =  (z.CFL.frac < 0)
    mask_pos = np.logical_not(mask_neg)

    if index == 'nearest':

        # copy arrays so changes can be made without affecting the template f_old, Uf
        f_BCs_applied = f_old.copy() # BCs will be applied to f_old
        Uf_BCs_applied = Uf.copy()

        # APPLY BOUNDARY CONDITIONS
        # e.g. for absorbing boundaries, we zero out all prepoint density entries
        # which exit the domain
        f_BCs_applied, Uf_BCs_applied, z = \
          eval(sim_params['boundarycondition_function_handle'][z.str])(
              f_BCs_applied, Uf_BCs_applied, z, vz, sim_params, charge, k = 0)

        f_old_ma = ma.array(f_BCs_applied)
        Uf_ma = ma.array(Uf_BCs_applied)
        f_pos, f_neg = ma.zeros(f_BCs_applied.shape), ma.zeros(f_BCs_applied.shape)

        # mask out negative values
        f_old_ma.mask = mask_neg
        Uf_ma.mask = mask_neg

        f_pos[ z.postpointmesh[0,:,:], vz.prepointmesh ] = f_old_ma - Uf_ma

        # mask out all positive values
        f_old_ma.mask = mask_pos
        Uf_ma.mask = mask_pos

        f_neg[ z.postpointmesh[0,:,:], vz.prepointmesh ] = f_old_ma + Uf_ma

    elif index == 'contiguous':
        # APPLY BOUNDARY CONDITIONS
        # e.g. for absorbing boundaries, we zero out all prepoint density entries
        # which exit the domain
        f_old, Uf, z = \
          eval(sim_params['boundarycondition_function_handle'][z.str])(
              f_old, Uf, z, vz, sim_params, charge, k = 1)

        # initialize masked array for flux Uf; mask out negative values
        f_pos, f_neg = ma.zeros(f_old.shape), ma.zeros(f_old.shape)
        Uf_ma = ma.array(Uf)
        Uf_ma.mask = mask_neg

        f_pos[ z.postpointmesh[1,:,:], vz.prepointmesh ] = Uf_ma

        # mask out all positive values
        Uf_ma.mask = mask_pos

        f_neg[ z.postpointmesh[1,:,:], vz.prepointmesh ] = -Uf_ma


    # "wherever there is negative data, assign f_neg, else assign f_pos
    f_new = np.where(mask_neg == True, f_neg.data, f_pos.data)
    return f_new


The majority of the above is just (masked arrays and copies). The boundary conditions module which is called above by magic <code>eval</code> function above with a function handle (str) that is assembled based on the inputs from the user in <code>etc/params.dat</code> before the simulation starts and stored in the dictionary <code>sim_params</code>, takes the following form where we start at either <code>periodic</code> or <code>nonperiodic</code> and are redirected according to strings available in the dictionary <code>sim_params</code> which communicate which boundary condition is specified (hence which subroutine to follow). For example, often we have $v_x$ have periodic BCs, we enter on the top method, periodize the mesh, then exit. Note the long comment about how to evade shared postpoints, do not dwell long on this as it is about to deprecated given improvements just after this section. It is provided for completeness).

#### lib.boundaryconditions: DECSKS-2.0

In [None]:
import numpy as np

def periodic(f_old,
             Uf,
             z,
             vz,
             sim_params, # used in boundaryconditions.nonperiodic
             charge,
             k = None
             ):
    """Applies periodic boundary conditions to
    postpointmesh

    inputs:
    f_old -- (ndarray, ndim=2) density array
    z -- (instance) phase space variable being evolved

        z.postpointmesh -- (ndarray, ndim=3),
                           shape = (2, x.N, vx.N)

    outputs:
    f_old -- (ndarray, ndim=2) Array with both periodic
             BCs being enforce
    z    -- (instance) phase sapce variable being evolved with
             updated attribute z.postpointmesh

    f_old, Uf returned for symmetry with nonperiodic routine below
    """
    z.postpointmesh = np.mod(z.postpointmesh, z.N)

    return f_old, Uf, z

def periodize_postpointmesh(zpostpointmesh, zN):
    """Applies periodic boundary conditions to
    postpointmesh[k,:,:]

    inputs:
    z.postpointmesh -- (ndarray, ndim=2), shape = (x.N, vx.N)

    outputs:
    z.postpointmesh -- (ndarray, ndim=2), shape = (x.N, vx.N)
                        periodic BCs applied

    """
    zpostpointmesh = np.mod(zpostpointmesh, zN)

    return zpostpointmesh

def nonperiodic(f_old,
                Uf,
                z,
                vz,
                sim_params,
                charge,
                k = 0
                ):
    """orchestrates applying nonperiodic boundary conditions
    to the array w with total active grid points Nw. Nonperiodic
    boundary conditions require handling both left and right
    boundaries

    inputs:
    f_old -- (ndarray, ndim=2) density array
    z -- (instance) phase space variable being evolved

    outputs:
    f_old -- (ndarray, ndim=2) density with both left and right
             nonperiodic BCs enforced
    Uf -- (ndarray, ndim=2) high order fluxes with both left and right
             nonperiodic BCs enforced


    z returned (no changes) for symmetry with periodic routine above
    """
    # lower boundary
    f_old, Uf = eval(sim_params['BC'][z.str]['lower'] +
                           '_lower_boundary')(f_old, Uf,
                                              z.postpointmesh[k,:,:], z, vz,
                                              sim_params, charge)

    # upper boundary
    f_old, Uf = eval(sim_params['BC'][z.str]['upper'] +
                           '_upper_boundary')(f_old, Uf,
                                              z.postpointmesh[k,:,:], z, vz,
                                              sim_params, charge)

    # since the relevant entries of f_old and Uf that exit the domain
    # are zeroed out, in order to have a clean addition as before
    # we map their postpoints to their corresponding periodic BC locations
    # so that there are no two shared postpoints by construction

    # map z.postpointmesh to periodic locations
    z.postpointmesh[k,:,:] = periodize_postpointmesh(z.postpointmesh[k,:,:], z.N)
    # note that if there were shared postpoints, then lib.convect.remap_assignment
    # would not allocate the correct densities since it is a matrix sum
    # rather than a (slow) loop (where we could have used +=). For example
    # should more than one prepoint share a common postpointss, only one
    # cell's density would be allocated to the postpoint, the rest would be
    # overwritten, not incrementally summed

    # note the other function, periodic, periodizes z.postpointmesh.shape = (2, z.N, vz.N)
    # here, we only want to periodize the postpointmesh pertaining to the index 'nearest'
    # or 'contiguous' so that we do not tarnish the postpointmesh when passing through 'nearest'
    # so that 'contiguous' is unphysically periodized, hence most boundaries are evaded.

    return f_old, Uf, z

def symmetric_lower_boundary(f_old, Uf, zpostpointmesh, z, vz, sim_params, charge):
    f_entering = np.where(zpostpointmesh < 0, f_old, 0) # = f_exiting
    Uf_entering = np.where(zpostpointmesh < 0, -Uf, Uf)

def absorbing_lower_boundary(f_old, Uf, zpostpointmesh, z, vz, sim_params, charge):

    f_old = np.where(zpostpointmesh <= 0, 0, f_old)
    Uf = np.where(zpostpointmesh <= 0, 0, Uf)

    return f_old, Uf

def absorbing_upper_boundary(f_old, Uf, zpostpointmesh, z, vz, sim_params, charge):

    f_old = np.where(zpostpointmesh >= z.N, 0, f_old)
    Uf = np.where(zpostpointmesh >= z.N, 0, Uf)

    return f_old, Uf

def charge_collection_lower_boundary(f_old, Uf, zpostpointmesh, z, vz, sim_params, charge):

    # this discriminates vx vs. x, as the boundary condition function handle
    # sim_params['BC'][z.str]['lower' or 'upper'] for z.str = 'vx' is never set to 'charge_collection'
    # in lib.read
    f_absorbed = np.where(z.postpointmesh <= 0, f_old, 0)
    sigma_n = np.sum(vz.prepointmesh * f_absorbed * vz.width)

    # passed by reference, original value is modified, no need for explicit return
    sim_params['sigma'][z.str]['lower'] = \
      sim_params['sigma'][z.str]['lower'] + charge*sigma_n

    f_old, Uf = absorbing_lower_boundary(f_old, Uf, zpostpointmesh, z, vz, sim_params, charge)

    return f_old, Uf

def charge_collection_upper_boundary(f_old, Uf, zpostpointmesh, z, vz, sim_params, charge):

    # this discriminates vx vs. x, as the boundary condition function handle
    # sim_params['BC'][z.str]['lower' or 'upper'] for z.str = 'vx' is never set to 'charge_collection'
    # in lib.read
    f_absorbed = np.where(z.postpointmesh <= 0, f_old, 0)
    sigma_n = np.sum(vz.prepointmesh * f_absorbed * vz.width)

    # passed by reference, no need for explicit return
    sim_params['sigma'][z.str]['upper'] = \
      sim_params['sigma'][z.str]['upper'] + charge*sigma_n

    f_old, Uf = absorbing_upper_boundary(f_old, Uf, zpostpointmesh, z, vz, sim_params, charge)

    return f_old, Uf

To develop a better implementation, we do still use the same skeleton above, but clean up the "extras" and remove masks that work efficiently, but not as efficiently as equivalent, more natural, Cython routines.

the design is presented above, but in words it is the following:

    call boundaryconditions.orchestrator (orchestrator = periodic or nonperiodic)
    
    if periodic:               -> periodize mesh
                               -> call remap function (f, Uf)
                           
    else: # is nonperiodic, hence boundary conditions at each edge need to be factored in

        apply left boundary    -> updates z.postpointmesh[0,:,:] and z.postpointmesh[1,:,:]
        
                                   e.g. if absorber wall, all such [i,j] are zeroed out and
                                           their postpoints are reset to be that of the wall itself)
                                           
                                           e.g. if charge collecting, all such [i,j] are zeroed out and
                                           their postpoints [k1,j] are reset to be that of the wall itself
                                           AND all such entries that were zeroed out are added to the total
                                           charge at the wall in an object sigma
                                
        apply right boundary   -> updates z.postpointmesh[0,:,:] and z.postpointmesh[1,:,:]
        
                                   e.g. if absorber wall, all such [i,j] are zeroed out and
                                           their postpoints are reset to be that of the wall itself)
                                           
                                           e.g. if charge collecting, all such [i,j] are zeroed out and
                                           their postpoints [k1,j] are reset to be that of the wall itself
                                           AND all such entries that were zeroed out are added to the total
                                           charge at the wall in an object sigma
                               -> call remap function (f, Uf)
                           
We minimize function calls (hence reduce overhead costs), and this is also easier to follow than the current stepthrough.

We roughly consider the case of symmetry boundaries. Suppose a symmetry boundary is at the left edge. Then the above control flow is the same and the following comments are made:

        apply left symmetry boundary    -> updates z.postpointmesh[0,:,:] and z.postpointmesh[1,:,:]
        
                                           (some postpoints from partners may get pushed beyond the RIGHT wall)
                                                                      
        apply right boundary            -> updates z.postpointmesh[0,:,:] and z.postpointmesh[1,:,:]
        
                                           (those particles that were pushed outside the RIGHT wall
                                           are treated the same as all such others that are pushed beyond
                                           the right-wall that are non-partners but happened to have
                                           right-going velocity)  
                                    
                                           Note: since we apply the left boundary first, then the right, if we 
                                           tried to make a right boundary a symmetry boundary, then we would require
                                           applying the left symmetry boundary again to account for any partners
                                           that hit the LEFT wall. Thus, since the choice of boundary condition
                                           is up to the user, we should insist the user just chooses the left
                                           boundary as the symmetry boundary since there is zero need to have
                                           it be the right boundary instead of the left.
                                           
                                       -> call remap function (f, Uf)
                                       
                  

Wishfully, we may have stated goals (above) of reusing <code>f</code> and <code>Uf</code>, but after thinking it through it seems too prohibitive. Copying the array and mangling it to achieve the boundary condition is the most natural way of proceeding, also it provides the highest level of abstraction. The other route would require us to ultimately code remap assignments for not every boundary condition, but for every possible pair of boundary conditions, which seems to be terrible practice.

We modify the remapping routine and the passing to boundary conditions as follows. The orchestrator <code>remap_step</code> is also made to be a set of known instructinos rather than rely on keyword arguments for the control flow, as it is unneeded given the procedure is the same each time.

### <code>remap_step</code> DECSKS-2.1 in lib.convect_configuration

Note the identical implementation is involved in <code>lib.convect_velocity</code> with the exception that accessing the object <code>z.CFL.numbers</code> does not require a stage (and the parameter <code>s</code> is not even passed to this function) because <code>z.CFL.numbers</code> is a 2D array given that we cannot precompute them all for $v_x$ at the start of the simulation like we can for $x$, since the accelerations $a_x$ are not tied to a grid and can take on any value, whereas the velocities $v_x$ which goven the CFL numbers for $x$ are grid values that never change.

In [None]:
def remap_step(
        sim_params,
        f_template,
        Uf_template,
        s, n,
        z,
        vz,
        charge
        ):
    """Orchestrates remapping of all advected moving cells to the grid,

    First, we map the appropriate proportion to the nearest grid point (k1) to
    the exact (non-integral in general) postpoint.

    Then, we map the remaining fraction to the contiguous grid point (k2)
    defined to be in the same direction of travel as the advection

    inputs:
    sim_params -- (dict) simulation parameters
    f_template = f_old --
              (ndarray, dim=2) density from previous time step, full grid.
              We call this a template because it is used as a template
              in using it for achieving the boundary conditions.

              (1) we copy a template and send it to the boundary conditions
                  when preparing to remap density to the k1 postpoints.
                  To achieve the BCs, we modify the density and fluxes
                  (e.g. zeroing out any absorbed at walls), then remap
                  to a container f_k1

              (2) we repeat the same as above, but this time in preparation
                  for mapping at the contiguous grid point k2. If we used
                  the same density as was returned in (1), it already has
                  been mangled to acheive boundary conditions. Thus, we
                  need a fresh template. Since there are only two postpoints
                  (k1 and k2), we do not need a copy of the original template
                  but can instead just modify the template as its utility
                  ends here.

    Uf_template = Uf_old -- (ndarrray, ndim=2), the flux values for the
                            full grid. See the description for f_template
                            on why this is named as Uf_template rather
                            than Uf_old.

    s -- (int) current splitting stage; used to access the pre-computed
            correctors c, which depend on the stage. Note, for physical
            velocity (e.g. in lib.convect_velocity) we must compute the
            correctors in each pass as these are not tied to any grid
            hence are difference from one full time step to the next
    n  -- (int) current time step
    z -- (instance) phase space variable
    vz -- (instance) generalized velocity variable for phase space variable z
    charge -- (int) -1 or +1, indicates charge species

    outputs:
    f_remapped -- (ndarray, dim=2) density with all densities reallocated
                 according to their postpoint maps:

      postpoint map = (z.postpointmesh[k1 or k2, :,:], vz.prepointmesh)

    here, we emphasize that it is z that is being advected, hence
    these have distinct postpoints. While we may modify the values
    of the generalized velocity in special circumstances (e.g.
    symmetry boundary condition), the velocities of the prepoints
    are interpreted physically as being the same before (prepoints)
    and after (postpoints) since we are evolving the system with
    splitting routine (one variable evolved at a time)
    """

    f_copy = np.copy(f_template)
    Uf_copy = np.copy(Uf_template)

    # Prepare for remapping to k1

    # apply boundary conditions for density packets reaching [k1, j]
    # here, we pass k = 0 as a parameter, as it refers to z.postpointmesh[k,:,:]
    # and k1 corresponds to the storage k = 0
    f_copy, Uf_copy = \
      eval(sim_params['boundarycondition_function_handle'][z.str])(
          f_copy, Uf_copy, z, vz, sim_params, charge, k = 0)

    Uf_nonneg, Uf_neg = DECSKS.lib.remap.sift(Uf_copy, z.CFL.numbers[s,:,:])

    # remap all [i,j] to postpoints [k1, j], we assign per the piecewise rule:
    #
    #        f_k1[k1,j] =  f_copy[i,j] - Uf_copy[i,j] if CFL >= 0
    #
    #                      f_copy[i,j] + Uf_copy[i,j] if CFL < 0
    #
    # we accomplish the above through the following set of operations in order to minimize the computational cost
    f_k1 = np.zeros_like(f_template)
    f_k1  = DECSKS.lib.remap.assignment(f_copy, z.postpointmesh[0,:,:], vz.prepointmesh, f_k1.shape[0], f_k1.shape[1])
    f_k1 += DECSKS.lib.remap.assignment(Uf_neg, z.postpointmesh[0,:,:], vz.prepointmesh, f_k1.shape[0], f_k1.shape[1])
    f_k1 -= DECSKS.lib.remap.assignment(Uf_nonneg, z.postpointmesh[0,:,:], vz.prepointmesh, f_k1.shape[0], f_k1.shape[1])


    # Prepare for remapping to k2
    # We do not need the information in f_template, and Uf_template hereafter, so we may modify these directly

    # apply boundary conditions for density packets reaching [k2, j]
    # here, we pass k = 0 as a parameter, as it refers to z.postpointmesh[k,:,:]
    # and k2 corresponds to the storage k = 1
    
    f_template, Uf_template = \
      eval(sim_params['boundarycondition_function_handle'][z.str])(
          f_template, Uf_template, z, vz, sim_params, charge, k = 1)

    Uf_nonneg, Uf_neg = DECSKS.lib.remap.sift(Uf_template, z.CFL.numbers[s,:,:])

    # remap all [i,j] to postpoints [k2, j], we assign per the piecewise rule:
    #
    #        f_k2[k2,j] =  -Uf_template[i,j] if CFL >= 0
    #
    #                      +Uf_template[i,j] if CFL < 0
    #
    # we accomplish the above through the following set of operations in order to minimize the computational cost
    
    f_k2 = np.zeros_like(f_template)
    f_k2 -= DECSKS.lib.remap.assignment(Uf_neg, z.postpointmesh[1,:,:], vz.prepointmesh, f_k2.shape[0], f_k2.shape[1])
    f_k2 += DECSKS.lib.remap.assignment(Uf_nonneg, z.postpointmesh[1,:,:], vz.prepointmesh, f_k2.shape[0], f_k2.shape[1])

    f_remapped = f_k1 + f_k2

    return f_remapped

Note the zero python interaction inside each loop. Making local copies to typed ndarrays inside a Cython function so that the bottleneck is accomplished in pure C permits C level speed; as those portions are pure C

#### <code>remap_assignment</code> (<font color = "green">Removed</font> from DECSKS-2.1)

### Cython module: <code>boundaryconditions.pyx</code>

In [211]:
%%cython --annotate

import numpy as np
cimport numpy as np
cimport cython

DTYPE = np.float64
DTYPEINT = np.int64

ctypedef np.float64_t DTYPE_t
ctypedef np.int64_t DTYPEINT_t

# PYTHON METHODS

def periodic(f_old,
             Uf,
             z,
             vz,
             sim_params,
             charge,
             k = 0
             ):
    """Applies periodic boundary conditions to
    postpointmesh

    inputs:
    f_old -- (ndarray, ndim=2) density array
    z -- (instance) phase space variable being evolved

        z.postpointmesh -- (ndarray, ndim=3),
                           shape = (2, x.N, vx.N)

    outputs:
    f_old -- (ndarray, ndim=2) Array with both periodic
             BCs being enforce
    z    -- (instance) phase sapce variable being evolved with
             updated attribute z.postpointmesh

    f_old, Uf returned for symmetry with nonperiodic routine below
    """
    z.postpointmesh[k,:,:] = np.mod(z.postpointmesh[k,:,:], z.N)

    return f_old, Uf

def nonperiodic(f_old,
                Uf,
                z,
                vz,
                sim_params,
                charge,
                k = 0
                ):
    """orchestrates applying nonperiodic boundary conditions
    to the array w with total active grid points Nw. Nonperiodic
    boundary conditions require handling both left and right
    boundaries

    inputs:
    f_old -- (ndarray, ndim=2) density array
    z -- (instance) phase space variable being evolved

    outputs:
    f_old -- (ndarray, ndim=2) density with both left and right
             nonperiodic BCs enforced
    Uf -- (ndarray, ndim=2) high order fluxes with both left and right
             nonperiodic BCs enforced


    z returned (no changes) for symmetry with periodic routine above
    """
    # lower boundary
    f_old, Uf = eval(sim_params['BC'][z.str]['lower'] +
                           '_lower_boundary')(f_old,
                                              Uf,
                                              z.postpointmesh[k,:,:],
                                              vz.prepointmesh,
                                              z.N, vz.N, k, charge,
                                              sim_params,
                                              z, vz)

    # upper boundary
    f_old, Uf = eval(sim_params['BC'][z.str]['upper'] +
                           '_upper_boundary')(f_old,
                                              Uf,
                                              z.postpointmesh[k,:,:],
                                              vz.prepointmesh,
                                              z.N, vz.N, k, charge,
                                              sim_params,
                                              z, vz)


    return f_old, Uf

# CYTHON METHODS
@cython.boundscheck(False)
def absorbing_lower_boundary(np.ndarray[DTYPE_t, ndim=2] f_old,
                              np.ndarray[DTYPE_t, ndim=2] Uf_old,
                              np.ndarray[DTYPEINT_t, ndim=2] zpostpointmesh,
                              np.ndarray[DTYPEINT_t, ndim=2] vzprepointmesh,
                              int Nz, int Nvz, int k, int charge,
                              sim_params,
                              z, vz):

    # vars here are typed as C data types to minimize python interaction
    cdef int i, j
    for i in range(Nz):
        for j in range(Nvz):
            if zpostpointmesh[i,j] <= 0:
                f_old[i,j] = 0
                Uf_old[i,j] = 0
                zpostpointmesh[i,j] = 0 # set postpoint at the absorber

    # permanently copy to instance attribute
    z.postpointmesh[k,:,:] = zpostpointmesh

    return f_old, Uf_old

@cython.boundscheck(False)
def absorbing_upper_boundary(np.ndarray[DTYPE_t, ndim=2] f_old,
                              np.ndarray[DTYPE_t, ndim=2] Uf_old,
                              np.ndarray[DTYPEINT_t, ndim=2] zpostpointmesh,
                              np.ndarray[DTYPEINT_t, ndim=2] vzprepointmesh,
                              int Nz, int Nvz, int k, int charge,
                              sim_params,
                              z, vz):

    # vars here are typed as C data types to minimize python interaction
    cdef int i, j
    for i in range(Nz):
        for j in range(Nvz):
            if zpostpointmesh[i,j] >= Nz - 1:
                f_old[i,j] = 0
                Uf_old[i,j] = 0

                zpostpointmesh[i,j] = Nz - 1 # set postpoint at the absorber

    # permanently copy to instance attribute
    z.postpointmesh[k,:,:] = zpostpointmesh

    return f_old, Uf_old

@cython.boundscheck(False)
def charge_collection_lower_boundary(
        np.ndarray[DTYPE_t, ndim=2] f_old,
        np.ndarray[DTYPE_t, ndim=2] Uf_old,
        np.ndarray[DTYPEINT_t, ndim=2] zpostpointmesh,
        np.ndarray[DTYPEINT_t, ndim=2] vzprepointmesh,
        int Nz, int Nvz, int k, int charge,
        sim_params,
        z, vz
        ):
    # vars here are typed as C data types to minimize python interaction

    # for any such density packet whose postpoint predicts exceeding the lower boundary of
    # the domain, zero out (absorb) the density, and add contribution to total charge
    # density sigma at that boundary
    cdef int i, j
    cdef DTYPE_t sigma_n = 0 # at current time step n
    cdef DTYPE_t vzwidth = vz.width
    for i in range(Nz):
        for j in range(Nvz):
            if zpostpointmesh[i,j] <= 0:
                sigma_n += vzprepointmesh[i,j] * f_old[i,j]
                f_old[i,j] = 0
                Uf_old[i,j] = 0
                zpostpointmesh[i,j] = 0 # set postpoint at the absorber

    sigma_n *= charge
    sigma_n *= vzwidth

    # update cumulative charge density
    sim_params['sigma'][z.str]['lower'] += sigma_n
    z.postpointmesh[k,:,:] = zpostpointmesh

    return f_old, Uf_old

@cython.boundscheck(False)
def charge_collection_upper_boundary(
        np.ndarray[DTYPE_t, ndim=2] f_old,
        np.ndarray[DTYPE_t, ndim=2] Uf_old,
        np.ndarray[DTYPEINT_t, ndim=2] zpostpointmesh,
        np.ndarray[DTYPEINT_t, ndim=2] vzprepointmesh,
        int Nz, int Nvz, int k, int charge,
        sim_params,
        z, vz
        ):
    # vars here are typed as C data types to minimize python interaction

    # for any such density packet whose postpoint predicts exceeding the lower boundary of
    # the domain, zero out (absorb) the density, and add contribution to total charge
    # density sigma at that boundary
    cdef int i, j
    cdef DTYPE_t sigma_n = 0 # at current time step n
    cdef DTYPE_t vzwidth = vz.width
    for i in range(Nz):
        for j in range(Nvz):
            if zpostpointmesh[i,j] >=  Nz - 1:
                sigma_n += vzprepointmesh[i,j] * f_old[i,j]
                f_old[i,j] = 0
                Uf_old[i,j] = 0
                zpostpointmesh[i,j] = Nz - 1  # set postpoint at the absorber

    sigma_n *= charge
    sigma_n *= vzwidth

    # update cumulative charge density
    sim_params['sigma'][z.str]['upper'] += sigma_n
    z.postpointmesh[k,:,:] = zpostpointmesh

    return f_old, Uf_old


### Cython module <code>remap.pyx</code>

In [214]:
%%cython --annotate

import numpy as np
cimport numpy as np
cimport cython

DTYPE = np.float64
DTYPEINT = np.int64

ctypedef np.float64_t DTYPE_t
ctypedef np.int64_t DTYPEINT_t


def assignment(np.ndarray[DTYPE_t, ndim=2] f_old,  np.ndarray[DTYPEINT_t, ndim=2] map1, np.ndarray[DTYPEINT_t, ndim=2] map2, int dim1, int dim2):

    cdef np.ndarray[DTYPE_t, ndim=2] f_new = np.zeros([dim1, dim2], dtype = DTYPE)
    cdef int i
    cdef int j

    for i in range(dim1):
        for j in range(dim2):
            f_new[map1[i,j], map2[i,j]] += f_old[i,j]
    return f_new

def sift(np.ndarray[DTYPE_t, ndim=2] f, np.ndarray[DTYPE_t, ndim=2] CFL):
    """
    inputs:
    f -- (ndarray, ndim=2, dtype = float64) density like object
    vz -- (ndarray, ndim=2, dtype = float64) generalized velocity grid

    output:
    f_nonneg, f_neg -- (ndarrays, ndim=2, dtype = float64) containers holding
                    density like entires from f according to

                    pos -- for each f[i,j] such that its vz[i,j] >= 0
                    neg -- for each f[i,j] such that its vz[i,j] < 0

    """
    cdef int i, j
    cdef int dim1 = f.shape[0]
    cdef int dim2 = f.shape[1]

    cdef np.ndarray[DTYPE_t, ndim=2] f_nonneg = np.zeros((dim1, dim2))
    cdef np.ndarray[DTYPE_t, ndim=2] f_neg = np.zeros((dim1, dim2))

    for j in range(dim2):
        # CFL[0,j] = CFL[1,j] = ... = const, we choose to examine (0,j) arbitrarily
        if CFL[0,j] >= 0: 
            for i in range(dim1):
                f_nonneg[i,j] = f[i,j]
        else:
            for i in range(dim1):
                f_neg[i,j] = f[i,j]
    return f_nonneg, f_neg

    return f_nonneg, f_neg




In [138]:
%%cython --a

import numpy as np
cimport numpy as np
cimport cython

DTYPE = np.float64
DTYPEINT = np.int64

ctypedef np.float64_t DTYPE_t
ctypedef np.int64_t DTYPEINT_t


@cython.boundscheck(False)
def absorbing_lower_boundary(np.ndarray[DTYPE_t, ndim=2] f_old, np.ndarray[DTYPE_t, ndim=2] Uf,
                             np.ndarray[DTYPEINT_t, ndim=2] zpostpointmesh,
                             z, vz, sim_params, charge):


    cdef int i
    cdef int j
    cdef Nz = z.N
    cdef Nvz = vz.N

    for i in range(Nz):
        for j in range(Nvz):
            if zpostpointmesh[i,j] >= Nz:
                f_old[i,j] = 0
                Uf[i,j] = 0
                zpostpointmesh[i,j] = Nz-1 # set postpoint at the absorber

    return f_old, Uf, zpostpointmesh

#### Performance from DECSKS-2.0 and Cython boundary conditions

Setup

In [204]:
class var:
    def __init__(self, Nz, mapping, name = 'x'):
        self.N = Nz
        
        if name == 'x':
            self.postpointmesh = mapping
        if name == 'vx':
            self.prepointmesh = mapping
        

In [205]:
x = var(Nx, xpostpointmesh, name = 'x')
vx = var(Nvx, vxprepointmesh, name = 'vx')
Uf = .25*f_old # arbitrary
charge = -1
sim_params = {}
f_old2 = np.copy(f_old)
Uf2 = np.copy(Uf)

In [206]:
f_old2 = np.copy(f_old)
Uf2 = np.copy(Uf)

We currently use <code>np.where</code> calls to accomplish this. For performance comparison, this routine it

#### DECSKS-2.0 absorbing boundary condition

In [208]:
%%timeit 

def absorbing_lower_boundary_DESCKSv2(f_old, Uf, zpostpointmesh, z, vz, sim_params, charge):

    f_old = np.where(zpostpointmesh <= 0, 0, f_old)
    Uf = np.where(zpostpointmesh <= 0, 0, Uf)
    return f_old, Uf

absorbing_lower_boundary_DESCKSv2(f_old2, Uf2, x.postpointmesh[0,:,:], x, vx, sim_params, charge)
# also we finalize the postpointmesh map after this functino call per

np.mod(x.postpointmesh[0,:,:], x.N)

The slowest run took 54.42 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 21 ms per loop


Note, we include a periodization of the mesh following the application, as it needs to be done in DECSKS so we have unshared postpoints (is done in the main orchestrator <code>lib.boundaryconditions.nonperiodic</code>).

#### Cython absorbing boundary condition

In [212]:
f_old2 = np.copy(f_old)
Uf2 = np.copy(Uf)

In [213]:
%timeit absorbing_lower_boundary(f_old2, Uf2, x.postpointmesh[0,:,:], vx.prepointmesh, x.N, vx.N, 0, -1, sim_params, x, vx)

1000 loops, best of 3: 1.24 ms per loop


So we can see a cost saves of approximately 23 times here. We see similar cost savings throughout. As an example, we compared a simulation of DECSKS-2.0 with charge collecting on both edges of the domain on LF2 splitting, Nx = 240, Nvx = 400, Nt = 500 with DECSKS-2.1 which uses the above Cython routines, and we saw savings from taking approximately 2.1 seconds per time step in DECSKS-2.0 to taking ~ 0.6 seconds per time step in DECSKS-2.1.

## Conclusions

Our original goal was to obtain comparable speeds by using more obvious implementations such as looping so that we removed a very particular requirement on DECSKS-2.0, which was that it relied on there being zero shared postpoints in a given time substep of advecting density packets. Here, we have recasted the code so that more obvious, physically intuitive, looping mechanisms have been used which not only removes this awkward requirement on shared postpoints, but incidentally is significantly more efficient than the previous version. Now that the requirement of shared postpoints is removed, we may go back and finish out details of handling the symmetry boundary condition. Which hitherto, had been plagued by this problem of being unable to produce unique postpoints in an efficient way. This is no longer a problem, so we may proceed forward by working out the details and coding it directly in DECSKS-18 part 3. <b>DECSKS-18 part 3 notebook contains this version developed here and extends it</b>