# Data Processing in Numpy array
Most operations in __numpy__ array will replace explicit loops with array expression called _vectorization_. In general, vectorized array operations will often be one or two orders of magnitude faster than their pure Python equivalents. 

## Expressing Conditional Logic
The `numpy.where` function is a vectorized version of the ternary expression `x if condition else y`. 
Let's see an example with three arrays, the result will choose one of two values from two different arrays based on the value of the condition array. First we see how we implement it in pure Python.

In [1]:
import numpy as np
xarr = np.array([1, -2, 9, 10, 11])
yarr = np.array([-90, 80, -8, -3, -4])
cond = np.array([True, False, False, True, False])

In [2]:
%timeit result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]

The slowest run took 4.94 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.98 µs per loop


In [3]:
%timeit  np.where(cond, xarr, yarr)

The slowest run took 14.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 777 ns per loop


You can see the speed of __`numpy.where`__ compared with the pure Python list comprehension. And the last two arguments don't need to be arrays, can be scalars.

## Mathematical and statistical methods
Like Python, __Numpy__ provides a set of mathematical functions that computes about an entire array or the data along an axis.

In [4]:
arr = np.random.randn(5,4)
arr.mean()

-0.15163549150097266

In [5]:
np.mean(arr)

-0.15163549150097266

In [6]:
arr.mean(axis=1)

array([-0.25128266,  0.33337404, -0.31549523, -0.15318665, -0.37158696])

Some methods like __`cumsum`, `cumprod`__ do not aggregate and generate intermediate results instead.

In [7]:
arr.cumsum()

array([-0.37949997, -2.14641548, -1.55373182, -1.00513062, -0.36900039,
       -0.40271052,  0.71099561,  0.32836556,  1.31535799,  1.29582345,
        0.11128676, -0.93361537, -0.65769574, -0.21221027, -1.63270306,
       -1.54636199, -2.45615362, -3.31862652, -3.20733213, -3.03270983])

## Methods for Boolean Arrays
Since Boolean values are cast to 1 (True) or 0 (False), thus __`sum`__ functions can be used to count __True__ values in the array. And __`Any`, `All`__ functions are very useful with Boolean Arrays.

In [8]:
arr = np.random.randn(100)
(arr < 0).sum()

46

In [9]:
test = np.array([True, False, True, True, False, False])

In [10]:
test.any()

True

In [11]:
test.all()

False

## Unique and Set logic
__`np.unique`__ returns the sorted unique values in array. 

In [12]:
example = np.array([1, 0, 3, -4, 3, 2])
np.unique(example)

array([-4,  0,  1,  2,  3])

Another function __`np.in1d`__ tests membership of the values in one array in another and returns a boolean array.

In [13]:
np.in1d(example, np.arange(4))

array([ True,  True,  True, False,  True,  True], dtype=bool)

## Linear Algebra
In numpy, all the linear algebra related function can be found in __`numpy.linalg`__.

In [14]:
from numpy.linalg import inv, qr
X = np.random.randn(5, 5)

In [15]:
inv(X)

array([[ 0.71448496, -0.32493562,  0.19468827,  0.31805817, -0.63898844],
       [ 0.25442574,  0.34923123,  0.65247942, -0.2178518 , -0.40347203],
       [-0.51427498,  0.15989739, -0.29823836,  0.56999732,  0.11021132],
       [-0.01270894,  0.46703051, -0.36882987,  0.57904664, -0.00403454],
       [-0.82340277,  0.43713491, -0.59232273, -0.65955452,  0.58128716]])

In [16]:
q, r = qr(X)

In [17]:
q

array([[-0.17632222,  0.34355213,  0.71455306, -0.00899825, -0.58327983],
       [-0.12356112, -0.5834957 ,  0.49589348, -0.54997087,  0.30965645],
       [-0.45225815, -0.64656397, -0.13786154,  0.42703364, -0.41958797],
       [-0.19815253,  0.06872435, -0.47235159, -0.71736191, -0.46721345],
       [-0.84252078,  0.34458192, -0.03717187,  0.02202823,  0.41177062]])

## Reshaping Arrays
As has been shown in the previous examples, Numpy array can be converted to another shape without copying any data using __`reshape`__ functions.

In [18]:
test = np.arange(9)
test

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [19]:
test.reshape((3,3))

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

If one of shape dimensions is -1, the value can be infered from the data.

In [20]:
example = np.arange(18).reshape((9,-1))
example

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15],
       [16, 17]])

There are two ways to convert a high dimension arrays to one dimension: __`ravel`__ does not produce a copy of the underlying data, while __`flatten`__ will return a copy of the data.

In [21]:
arr = np.arange(15).reshape((3,5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [22]:
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [23]:
arr.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

There are two different major orders: 
* C/ _row major order_: traverse higher dimensions first
* Fortran/ _column major order: traverse higher dimensions last

### CPU Cache effects
Memory layout can affect performance:

In [24]:
x = np.zeros((10000, ))
y = np.zeros((10000* 78, ))[::78]
x.shape, y.shape

((10000,), (10000,))

In [38]:
%timeit x.sum()

The slowest run took 22.92 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.44 µs per loop


In [39]:
%timeit y.sum()

The slowest run took 19.94 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 16.4 µs per loop


Since CPU pulls dasta from __main memory to its cache in blocks__, if many array items consecutively operated on fit in a single block, then __fewer transfers needed and faster__.

## Concatenating and Splitting Arrays
__`numpy.concatenate`__ takses a sequence of arrays and joins them together in order along the input axis. 

In [25]:
t1 = np.arange(8).reshape(4,2)
t2 = np.random.randn(4,2)
np.concatenate([t1,t2], axis = 0)

array([[ 0.        ,  1.        ],
       [ 2.        ,  3.        ],
       [ 4.        ,  5.        ],
       [ 6.        ,  7.        ],
       [ 1.63071283,  0.90284541],
       [-1.4417776 ,  0.48401095],
       [-1.69559607, -0.1112482 ],
       [ 0.74022337, -0.10461158]])

In [26]:
np.concatenate([t1, t2], axis = 1)

array([[ 0.        ,  1.        ,  1.63071283,  0.90284541],
       [ 2.        ,  3.        , -1.4417776 ,  0.48401095],
       [ 4.        ,  5.        , -1.69559607, -0.1112482 ],
       [ 6.        ,  7.        ,  0.74022337, -0.10461158]])

In [27]:
np.vstack((t1, t2))

array([[ 0.        ,  1.        ],
       [ 2.        ,  3.        ],
       [ 4.        ,  5.        ],
       [ 6.        ,  7.        ],
       [ 1.63071283,  0.90284541],
       [-1.4417776 ,  0.48401095],
       [-1.69559607, -0.1112482 ],
       [ 0.74022337, -0.10461158]])

In [28]:
np.hstack((t1, t2))

array([[ 0.        ,  1.        ,  1.63071283,  0.90284541],
       [ 2.        ,  3.        , -1.4417776 ,  0.48401095],
       [ 4.        ,  5.        , -1.69559607, -0.1112482 ],
       [ 6.        ,  7.        ,  0.74022337, -0.10461158]])

In [29]:
first, second, third = np.split(t1, [1,3])

In [30]:
first

array([[0, 1]])

## Repeating elements

In [31]:
arr = np.arange(3)
arr

array([0, 1, 2])

In [32]:
arr.repeat(3)

array([0, 0, 0, 1, 1, 1, 2, 2, 2])

In [33]:
arr.repeat([1,2,3])

array([0, 1, 1, 2, 2, 2])

In [34]:
arr2 = np.arange(15).reshape(3, 5)

In [35]:
arr2.repeat([2, 3, 4, 1, 7], axis =1)

array([[ 0,  0,  1,  1,  1,  2,  2,  2,  2,  3,  4,  4,  4,  4,  4,  4,  4],
       [ 5,  5,  6,  6,  6,  7,  7,  7,  7,  8,  9,  9,  9,  9,  9,  9,  9],
       [10, 10, 11, 11, 11, 12, 12, 12, 12, 13, 14, 14, 14, 14, 14, 14, 14]])

In [36]:
np.tile(arr, 2)

array([0, 1, 2, 0, 1, 2])

In [37]:
np.tile(arr, (2, 3))

array([[0, 1, 2, 0, 1, 2, 0, 1, 2],
       [0, 1, 2, 0, 1, 2, 0, 1, 2]])