Die-hard C++ or Fortran users among physicists often say that python is too slow. 

True, python is an interpreted language and it is slow.

Even python advocates like me realize it, but we think that the (lack of) speed of python is not really an issue, e.g. because:

* The time spent computing is balanced by a much smaller development time;
* Profiling is easy, which means that one can find the parts of the code that are slow, optimize them, and even write them in faster languages so that they can be compiled and used from python;

And most importantly: 

* Some python tools like numpy are as fast as plain C. 

In this tutorial, you will understand what is numpy and why it's fast, and learn just what's needed about numpy for the usual machine learning operations. 



## Installation

Numpy is the core of scientific python, so it's installed as a dependency for most of the scientific python packages. For example, you will get it if you install scikit-learn, matplotlib, or Keras. Also, numpy is installed by default on the usual platforms as a service for jupyter notebooks, such as Google Colab or FloydHub. 

If you don't have it, you can install it with [Anaconda](https://thedatafrog.com/en/install-anaconda-data-science-python/), by doing: 

```
conda install numpy
```

Then, traditionally, numpy is imported in the following way:

In [1]:
import numpy as np

Please keep importing as `np`, it will make your code clearer to you and other people.

## The numpy array : Why is it fast? 

The main purpose of numpy is to provide a very efficient data structure called the numpy array, and the tools to manipulate such arrays. 

Why is the numpy array so fast? 

Because, under the hood, arrays are processed with compiled code, optimized for the CPU. In particular, numpy operations are parallel as they use [SIMD](https://en.wikipedia.org/wiki/SIMD) (Single Operation Multiple Data). 

To see how fast numpy is, we can time it. 

Let us create a large list with one million integers, and a numpy array from this list: 

In [4]:
lst = range(1000000)
arr = np.array(lst)
arr

array([     0,      1,      2, ..., 999997, 999998, 999999])

Now let's compute the square of all integers, and see how much time it takes. 

We start by the list: 

In [6]:
%timeit squares = [x**2 for x in lst]

229 ms ± 6.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


And we do the same for the array:

In [7]:
%timeit squares = arr**2

624 µs ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


As you can see, this is more than 300 times faster. 

We can in principle loop on the numpy array like this:

In [8]:
%timeit squares = [x**2 for x in arr]

243 ms ± 7.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


But then, we completely lose the benefits of numpy! Indeed, when we do `arr**2` we use the square function of numpy, which is intrinsically parallel. When we loop, we process the elements one by one with basic python. So: 

**Never ever loop on a numpy array! You'll be tempted to do so, but there should be no exception!**

## Numpy array data types

We have seen that numpy arrays are processed by compiled code with SIMD. 

For this to work, the elements in a numpy array must be: 

* of a basic type, e.g. integers or floats. The full range of possibilities is given on [this page](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html);
* of the same type, so that they have the same size, e.g. 64 bits floats, or 16 bits integers.

On the contrary, python lists can contain heterogeneous objects of any type. 

Here are a few ways to create numpy arrays with different types:

In [10]:
from sys import getsizeof as sizeof

# numpy guesses that it should use integers
x = np.array([0, 1, 2])
print(x.dtype)
# or floats: 
x = np.array([0., 1., 2.])
print(x.dtype)
# here we specify a python compatible type,
# interpreted by numpy as int64
x = np.array([0., 1., 2.], dtype=int)
print(x.dtype)
# here we specify that we want 8 bits integers
x = np.array([0, 1, 2], dtype=np.int8)
print(x.dtype)

int64
float64
int64
int8


This makes it easy to estimate the size of a numpy array in memory, to see if you're going to blow up your computer before you actually do. 

---

*Exercise*

For example, let's consider a sample of 1000 images, each with 200x200 pixels, and 3 color channels per pixel. 
The color index ranges from 0 to 255, and can thus be encoded as an 8 bit integer. 

Assuming you store the data of all images in a single numpy array, what would be its size in memory in GB? 

--- 

## Numpy element wise operations

I call element wise operations all the operations that affect the array elements, but preserve the array shape.

All the usual operators are implemented in numpy, for arrays. For example: 

In [56]:
x = np.array(range(5))
x**2

array([ 0,  1,  4,  9, 16])

Note that these operators, in numpy, are element wise:

In [57]:
x+1

array([1, 2, 3, 4, 5])

The element-wise equivalents of the functions in the python `math` package are available directly from the `numpy` package, with the same name, e.g.:

In [60]:
np.exp(x)

array([1.        , 2.71828183, 7.3890561 ])

Finally, binary operators are available: 

In [58]:
x = np.array([0, 1, 2])
y = np.array([1, 2, 3])
x+y

array([1, 3, 5])

In [59]:
x*y

array([0, 2, 6])

## Numpy array shape

So far, we have only seen arrays with a single dimension. But often, more dimensions are used. 

Multidimensional arrays can be created from a list of lists, e.g.:

In [62]:
x = np.array([[0, 1], [2, 3], [4,5]])
x

array([[0, 1],
       [2, 3],
       [4, 5]])

The shape attribute gives us the length of each dimension: 

In [41]:
x.shape

(3, 2)

In this case, we have 3 rows of 2 numbers. The first dimension is the outermost dimension, and the second one the innermost dimension. 

Let's take the example of a 2x2 pixels image, with 3 color channels (red, blue, green) in each pixel:

In [12]:
x = np.array(
    [
        [ [1,2,3], [4,5,6], ],
        [ [7,8,9], [10,11,12]]
    ]
)
print(x)
print(x.shape)

[[[ 1  2  3]
  [ 4  5  6]]

 [[ 7  8  9]
  [10 11 12]]]
(2, 2, 3)


To visualize numpy arrays more easily, I often think of the innermost dimension separately. For example, here, we have a 2x2 pixel array, with a sub-array of size 3 in each pixel.

And as a final example, let's consider a "column vector": 

In [14]:
x = np.array([
    [0],
    [1],
    [2], 
    [3]
])
print(x)
print(x.shape)

[[0]
 [1]
 [2]
 [3]]
(4, 1)


As you can see, the "column vector" has two dimensions, which may be counterintuitive. There is a single number (a scalar) on the innermost dimension.


Please note that in numpy, a dimension can also be called an "axis". 

Very often, numpy arrays of a given shape are built by initializing all elements to a fixed number, or a random number. For example:

In [15]:
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [16]:
np.ones(3)

array([1., 1., 1.])

In [17]:
np.ones_like(x)

array([[1],
       [1],
       [1],
       [1]])

In [52]:
np.random.rand(2,2)

array([[0.78675656, 0.84853769],
       [0.26910344, 0.95098797]])

Many more [random sampling tools](https://docs.scipy.org/doc/numpy-1.16.0/reference/routines.random.html) are available. 

## Numpy array indexing

### Basic indexing

Here is a 1D array: 

In [20]:
x = np.arange(10) + 1
x

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Elements can be accessed directly, using their index in the array (the index starts at 0):

In [21]:
x[1]

2

And, as usual in python sequences, negative arrays start from the end: 

In [24]:
x[-2]

9

The array can be modified in place: 

In [26]:
print(id(x))
x[1] = 0
print(id(x))
print(x)

4748105008
4748105008
[ 1  0  3  4  5  6  7  8  9 10]


For multidimensional arrays, basic indexing is done by specifying a comma separated list of indices: 

In [27]:
x = np.zeros((2,3))
x[0,1] = 1
x

array([[0., 1., 0.],
       [0., 0., 0.]])

### Selection with boolean indexing

Indexing can be used to select array elements according to a mask. 

Let's create our 1D array again:

In [28]:
x = np.arange(10) + 1
x

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

To create a mask, we evaluate a boolean expression for each element in the array. For example, to find all even numbers: 

In [29]:
x%2 == 0

array([False,  True, False,  True, False,  True, False,  True, False,
        True])

What does this expression mean? 

Since `x` is a numpy array, `x%2` is a numpy element-wise operation that evaluates `%2` on all elements of the array, and returns a new array with the results: 

In [30]:
xmod = x%2
xmod

array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0])

Then, we select even numbers by asking the modulo to be equal to zero. Again, the `==` operator is applied to a numpy array, so it is an element-wise operation:  

In [32]:
mask = (xmod == 0)
mask

array([False,  True, False,  True, False,  True, False,  True, False,
        True])

With this mask, we select even numbers and return a new array: 

In [33]:
x[mask]

array([ 2,  4,  6,  8, 10])

In fact, the new array is a view on the original array. **The data is not copied.**

It is possible to invert the mask: 

In [37]:
x[~mask]

array([1, 3, 5, 7, 9])

And of course it's possible to build masks on the fly, which is what is typically done: 

In [38]:
x[x%2==0]

array([ 2,  4,  6,  8, 10])

**In python data science, boolean indexing is used extensively to select data by applying thresholds on chosen variables.** 

### Slicing

In basic python, a slice is defined as a tuple, `start,stop,step`. It allows to select a sequence of elements in a sequence (which, by essence in python, is 1D): 

In [60]:
lst = list(range(1, 10))
print(lst)
lst[1::2]

[1, 2, 3, 4, 5, 6, 7, 8, 9]


[2, 4, 6, 8]

We selected elements: 

* starting at index 1 (value 2);
* stopping after the end of the list, as `stop` is not specified. At this stage, the last element, 9, is included;
* in steps of 2. So 9 does not appear. 

---

*Exercise:*

Play with the slice definition in the cell above. Try to: 

* select all odd numbers with a slice 
* select all numbers larger or equal to 5
* select all even numbers between 2 and 6

---

In the example above, we decided not to specify stop. This is possible for all fields: 

In [72]:
print( lst[:5:] )
print( lst[::2] )
print( lst[3::] )
print( lst[::] )

[1, 2, 3, 4, 5]
[1, 3, 5, 7, 9]
[4, 5, 6, 7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9]


The notation for these can and should be simplified to: 

In [75]:
print( lst[:5] )
print( lst[::2] )
print( lst[3:] )
print( lst[:] )

[1, 2, 3, 4, 5]
[1, 3, 5, 7, 9]
[4, 5, 6, 7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9]


---

*Exercise:*

Could simplify these expressions further by removing more colons? What would happen if you do? Test your hypotheses in the above cell.

---

Numpy slicing is a simple generalization of python slicing to multiple dimensions. To test it, we create a 2D matrix with 4 lines and 5 columns. For this, we use the reshape method that will be discussed in the next section: 

In [76]:
x = np.arange(20).reshape(4,5)
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

We can now use the slice notation on any field of the multidimensional index. Here are a few example: 

* select the second column: 

In [77]:
x[:, 1]

array([ 1,  6, 11, 16])

* select the first two columns: 

In [141]:
x[:, :2]

array([[ 0,  1],
       [ 5,  6],
       [10, 11],
       [15, 16]])

* select columns in steps of 2: 

In [79]:
x[:, ::2]

array([[ 0,  2,  4],
       [ 5,  7,  9],
       [10, 12, 14],
       [15, 17, 19]])

* reverse column order, by specifying a -1 step on the last dimension:

In [81]:
x[:, ::-1]

array([[ 4,  3,  2,  1,  0],
       [ 9,  8,  7,  6,  5],
       [14, 13, 12, 11, 10],
       [19, 18, 17, 16, 15]])

## Numpy array reshaping 

Cases: 

* convert 1d array of labels to column vector
* convert column array to 1D
* batch creation 
* convert to greyscale 

In [123]:
y = x.flatten()
y

array([0, 1, 2, 3, 4, 5])

In [73]:
x.ravel()

array([0, 1, 2, 3, 4, 5])

In [74]:
x.reshape(-1)

array([0, 1, 2, 3, 4, 5])

In [83]:
np.c_[y]

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5]])

In [106]:
np.r_[1:10:2]

array([1, 3, 5, 7, 9])

In [92]:
a  = np.arange?

In [100]:
a = np.arange(6).reshape(2,3)
b = np.arange(6,12).reshape(2,3)
print(a)
print(b)

[[0 1 2]
 [3 4 5]]
[[ 6  7  8]
 [ 9 10 11]]


In [102]:
np.r_[a,b]

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [117]:
np.c_?

In [121]:
np.r_['0,2,0', a, b]

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [122]:
np.c_[a, b]

array([[ 0,  1,  2,  6,  7,  8],
       [ 3,  4,  5,  9, 10, 11]])

In [120]:
np.r_['-1,2,0', a, b]

array([[ 0,  1,  2,  6,  7,  8],
       [ 3,  4,  5,  9, 10, 11]])

## Practical examples

### Image preparation 

* rescale color levels
* turn to greyscale 
* standardize 

### Label management


### Batch creation 


### Standardization




Image of 2x2 pixels with 3 colors: 

In [147]:
x = np.arange(12).reshape(2, 2, 3)
x

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]]])

Select first color:

In [149]:
x[..., 0]

array([[0, 3],
       [6, 9]])

Turn to greyscale by averaging the three color levels:

In [154]:
np.mean(x, -1)

array([[ 1.,  4.],
       [ 7., 10.]])