# Introduction to `numpy`

This Notebook provides an overview of the capabilities of the `numpy` module. It covers Sect. II of [Modules_in__python.ipynb](Modules_in__python.ipynb). 

## Table of Content

- [II. Numpy](#II)
    * [II.1 Array Definition and construction](#II.1)
    * [II.2 Array copies and views](#II.2)
    * [II.3 Shape manipulation](#II.3)
    * [II.4 What makes numpy Arrays useful structures ?](#II.4)
        - [II.4.1 ufunc](#II.4.1)
        - [II.4.2 Aggregation](#II.4.2)
        - [II.4.3 Broadcasting](II.4.3)
        - [II.4.4 Slicing, masking, fancy indexing](#II.4.4)
    * [II.5 Reading arrays from a file and string formatting](#II.5)
    * [II.6 Summary](#II.6)
    * [II.7 References](#VI)

## II. `numpy`:  <a class="anchor" id="II"></a>

`numpy` can be seen as the implementation of mathematical functions and operations for python language. It also introduces one key object `arrays`. 

### II.1 `array` definition and construction:  <a class="anchor" id="II.1"></a>

- A `numpy` array is an object of the type `np.ndarray` (although this type specifier is rarely used directly). Instead one can create arrays in several ways: 

``` python
import numpy as np
np.array([1,2,3,4])   # creates an array from a python list
np.array([[0, 1, 2], [3, 4, 5]])   # Creates a 2D array from a python list
np.arange(5) # similar to the built-in range() function.
np.linspace(1, 10, 10) # creates an array of 10 elements from 1 to 10
np.zeros(10)  # creates an array of 10 elements filled with 0
np.ones(10)   # creates an array of 5 elements filled with 1
np.zeros((2, 5))  # mulitdimensional arrays of 2 rows and 5 columns
```
- 2-D arrays of `shape=(r, c)` are arrays with `r` *rows* and `c` *columns*. 

In [None]:
# Let's try the above commands and visualise the output. 
import numpy as np
#a = np.arange(0,5,0.7)
a = np.zeros((3, 4))
print a
a[2, 3] = 1
print a 

- numpy has also tools to create arrays filled with random elements:

``` python
np.random.random(size=4)  # uniform between 0 and 1
np.random.normal(size=4)  # elements are std-normal distributed

```

In [None]:
# Try out the above commands 

- You can explicitly specify which **data-type** you want:

``` python 
c = np.array([1, 2, 3], dtype=float)
c.dtype
    Out: dtype('float64')
```

In [None]:
# Try out the above commands 

The default data type is floating point. Other possible data types are: 

* **COMPLEX** numbers: 
``` python
d = np.array([1+2j, 3+4j, 5+6*1j])
d.dtype
    Out: dtype('complex128')
```

* **BOOL**:
``` python
e = np.array([True, False, False, True])
e.dtype
    Out: dtype('bool')
```

In [None]:
# Try out the above commands 

* **String**:
``` python
f = np.array(['abc', 'eddafg', 'hjk'])
f.dtype
    Out: dtype('S6')   # <--- String of 6 characters (by default largest elements of the array 
```

* **Other data types**:  `int32`, `int64`, `uint32`, `uint64`

Note that `type(f)` tells you that `f` is a numpy array, while `f.dtype` gives you the *type of the elements* containted in `f`. `dtype` is an attribute of the object `np.array`. If you try to access the attribute dtype of a List, you will get an error message. 

In [None]:
# Difference between type/dtype; application to List/arrays.
f = np.array(['abc', 'eddafg', 'hjk'])
print type(f)
print f.dtype
L = ['abc', 'eddafg', 'hjk']
print type(L)
print L.dtype

- Last but not least, `numpy` is also the package that allows you to calculate many common mathematical function (see also [`ufunc`](#II.4.1)): `np.log10()` (base 10 log), `np.log()` (natural log), `np.exp()`, `np.sin()`, `np.cos()`, etc. See the list of `numpy` mathematical functions [here](https://docs.scipy.org/doc/numpy/reference/routines.math.html)

In [None]:
# create an array of floats and calculate its log / sin / ... 


**Exercise:**   
For the array:
``` python
a = np.array([[1,2,3,4], [4,5,6,7], [2,3,4,5] ])
```
- What is the output of `a.ndim`, `a.shape`, `len(a)` ?     
- How does the above commands relate to the rows, columns, dimensions ?       
- How do you access 2nd item of the first row ?   

*Note:* 
Try to do the same with the following array:
``` python
b = np.array([[1,2,3], []])
```

**Exercise:** Elementwise operations

In the code cell below, try simple arithmetic elementwise operations: 
- add even elements with odd elements using 2 different techniques
- Time the two solution using %timeit.
- Generate an array from a list made of strings and floats. What is the final array type ?
- Generate 2 arrays such that their elements are as follow :    
   `[2**0, 2**1, 2**2, 2**3, 2**4]`    
   `a_i = 2^(3*i) - i `    


### II.2 `array` copies and views:   <a class="anchor" id="II.2"></a>

A slicing operation creates a **view** on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. You can use `np.may_share_memory()` to check if two arrays share the same memory block.
To provide this behaviour, and create a brand new array from the slice of the original one *without modifying the latter*, you may use the method `copy()`: `c = a[0:2].copy()` will create a **new array** that is a **copy** of the first two elements of a. 

**When modifying the view, the original array is modified as well**. Try the cells below to understand how memory allocation work. 

In [None]:
import numpy as np
a = np.arange(10)
a

In [None]:
b =m a[::2]
b

In [None]:
np.may_share_memory(a, b)

In [None]:
b[0] = 12
b

In [None]:
a   # (!)

In [None]:
a = np.arange(10)
c = a[::2].copy()  # force a copfiy
c[0] = 12
a

In [None]:
np.may_share_memory(a, c)

### II.3 Array shape manipulation <a class="anchor" id="II.3"></a>

- **II.3.1 Flattening**:    
The method `ravel()` flattens the array into a single-row array (each row of the array is merged with the previous one). 

In [None]:
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
a.ravel()

In [None]:
a.T   # Transpose the array

In [None]:
a.T.ravel()

**Note**: `a.T` is a property of array `a` that returns the array transposed, while np.transpose(a) is a function that returns a view of the array(a) transposed. As a.T is a property of the object a, it is relatively quicker than the call of a function as you can test using the `%timeit` magic command. For N dim arrays, transpose() allows a bit more than just transposing (see below II.3.4.)

In [None]:
%timeit(a.transpose())
%timeit(a.T)

- **II.3.2 Reshaping**:   
The method `reshape(newshape)` allows one to reorganise the elements of an array, to create a "new" array (see below) that has a different shape. The total number of items of the array has to be the same ! 

In [None]:
a.shape

In [None]:
b = a.ravel()
b = b.reshape((2, 3))
b

In [None]:
# Alternatively 
a.reshape((2, -1))    # unspecified (-1) value is inferred

**WARNING:** Reshaping may return a **view** or a **copy** !

In [None]:
a = np.array([[1, 2, 3], [4, 5, 6]])
b=a.ravel()
b=b.reshape((2,3))
# Let's modify b and show a to see if we have a view or a copy ... 
b[0, 0] = 99
a


In [None]:
# let's now create an array with np.zeros and reshape it 
a = np.zeros((3, 2))
b = a.T.reshape(3*2)
b[0] = 9
a


To understand this you need to learn more about the memory layout of a numpy array. This is beyond the scope of this class. 

- **II.3.3 Adding a dimension**:

Indexing with the `np.newaxis` object allows us to add an axis to an array. You can also use the `reshape` method.  

In [None]:
z = np.array([1, 2, 3])
z

In [None]:
z[:, np.newaxis]

In [None]:
z[:, np.newaxis].shape

In [None]:
z[np.newaxis, :]

In [None]:
z[np.newaxis, :].shape

In [None]:
# An alternative is to reshape your array
y = np.array([1, 2, 3])

# When one shape dimension is -1, the value is inferred from the length of the array and remaining dimensions.
y = y.reshape((-1,1))   
y.shape

In [None]:
y = np.array([1, 2, 3])
y = y.reshape((1,-1))
y.shape

- **II.3.4. Dimension shuffling**:

In [None]:
a = np.arange(4*3*2).reshape(4, 3, 2)
a.shape

In [None]:
a[0, 2, 1]

In [None]:
b = a.transpose(1, 2, 0)
b.shape

In [None]:
b[2, 1, 0]

In [None]:
# Check that shuffling dimensions creates a view of the array

- **II.3.5. Resizing**: 

Size of an array can be changed with `ndarray.resize`:

In [None]:
a = np.arange(4)
a.resize((8,))   # you give as argument the new shape of the array
a


However, it must not be referred to somewhere else:

In [None]:
b = a
a.resize((4,))   

**Exercises:**

- Use flatten as an alternative to ravel. What is the difference? (Hint: check which one returns a view and which a copy)
- Experiment with transpose for dimension shuffling.


- **II.3.6. Meshgrid**: 

A very useful method that returns coordinate matrices from coordinate vectors. This is extremely useful when you want to evaluate a function on a grid (i.e. $z = f(x, y)$) ... which is something very common in observational astronomy ! This is also useful when you want to do contour plots (to e.g. interpolate over a regular grid). 

The way to proceed is to define your `x` and `y` vectors (corresponding to the (x,y) coordinates on a grid is the following:
``` python
x_vec, y_vec = np.linspace(0, 5, 6), np.linspace(0, 5, 3)
X, Y = meshgrid(x_vec,y_vec)

# Now you can evaluate the function z = (x**2 + y**2)
Z = X**2 + Y**2
```

`X` and `Y` created  with meshgrid() are now arrays of shape (3, 6) (3 rows and 6 columns) containing respectively coordinate x (for X) and y (for Y) of each grid element. This can be generalised to larger dimensions !

So, the array `Z` of shape (3,6) corresponds to points with the following coordinates:

['(0.0,0.0)', '(1.0,0.0)', '(2.0,0.0)', '(3.0,0.0)', '(4.0,0.0)', '(5.0,0.0)']   
['(0.0,2.5)', '(1.0,2.5)', '(2.0,2.5)', '(3.0,2.5)', '(4.0,2.5)', '(5.0,2.5)']   
['(0.0,5.0)', '(1.0,5.0)', '(2.0,5.0)', '(3.0,5.0)', '(4.0,5.0)', '(5.0,5.0)']   

** Note: **
This function supports both indexing conventions through the indexing keyword argument.  Giving the string 'ij' returns a meshgrid with matrix indexing, while 'xy' returns a meshgrid with Cartesian indexing. In the 2-D case with inputs of length M and N, the outputs are of shape (N, M) for 'xy' indexing and (M, N) for 'ij' indexing.  In the 3-D case with inputs of length M, N and P, outputs are of shape (N, M, P) for 'xy' indexing and (M, N, P) for 'ij' indexing. In other words, indexing 'ij' yields a transposed version of the array obtained with indices i,j. See `help(meshgrid)` for more details. 

In [None]:
# Experiment with "meshgrid()" following the code above. 
import numpy as np
x_vec, y_vec = np.linspace(0, 5, 6), np.linspace(0, 5, 3)
X, Y = np.meshgrid(x_vec,y_vec)
def gauss2D(X, Y):
    return np.exp( -0.5*(X**2 + Y**2) )

gauss2D(X, Y)
# Try to write a command that prints at the screen the coordinates of the grid elements (as above) (TIP: you do not need meshgrid)

** Exercise **: 

We will use meshgrid [later](Modules_in__python_matplotlib.ipynb#meshgrid), after we have learned how to visualise results with `python`. 

### II.4 What makes `numpy` arrays useful structures ?  <a class="anchor" id="II.4"></a>

Python is fast *for coding and developping* but python is slow when it comes to *execution*, especially when it comes to execution of `for` loops.    
The reason behind this low speed is e.g. that when it does `for a in range(10): a + b`, it has to check the `type` of `a`, of `b` and of *each value* in those lists before executing. 

`numpy` helps speeding up code through 4 strategies:
1. `ufunc`
2. aggregation
3. broadcasting
4. slicing, masking, fancy indexing

#### II.4.1 `ufunc`: operates elementwise on objects. <a class="anchor" id="II.4.1"></a>

Those `ufunc` are included (compiled) in `numpy`. They include: 

- all mathematic operation: +, -, /, *, `**`, 
- Mathematical expressions: sin, exp, cos, log10, ... 
- Comparison operators <, >, =, ...
- etc ... 

** Example:**
``` python
import numpy as np
# Basic python
a = [1,2,3,4,5]
b = [ a + val for val in a]   # add 5 to each element of the list
# In numpy
a = np.array(a)
b = a + 5                     # add 5 to each element of the array.
```

In [None]:
# implement the above example for a list of 1000 elements 
# use %timeit before calculating b to see improvement in speed


#### II.4.2 *aggregation*:   <a class="anchor" id="II.4.2"></a>

Functions which summarize values of an array such as `min`, `max`, `sum`, `mean`, ... 

** Example: **

``` python
# python version of an agregation
from random import random
c = [ random() for i in range(10000) ]
%timeit min(c)
#same in numpy:
c = np.array(c)
%timeit c.min()  
```
This also works on multidimensional arrays: 

``` python 
M = np.random.randint(0, 10, (10,4))
M.sum(axis=0)
M.sum(axis=1)
```

Aggregation available: 
`np.min()`, `np.max`, `np.prod()`, `np.mean()`, `np.std()`, `np.median()`, `np.any()`, `np.all()`, `np.nanmin()` (and nan versions of above aggregation), `np.argmin()`, `np.argmax()`, `np.percentile()`, ...


In [None]:
import numpy as np
from random import random
c = [ random() for i in range(1000) ]
%timeit min(c)
#same in numpy:
c = np.array(c)
%timeit c.min() 

#### II.4.3 *Broadcasting*:   <a class="anchor" id="II.4.3"></a>

Set of rules by which `ufuncs` operates on arrays of different sizes and/or dimensions. 

The term [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) describes how `numpy` treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.
Application to three cases: 

![From astroML book](Figures/fig_broadcast_visual_1.png)



The rules / how this works:

* If array shapes differ, left-pad the smaller shape with 1s 
* If any dimension does not match, broadcast the dimension with size 1
* If neither non matching dimensions is 1 raise an error

This broadcasting strategy allows one to avoid doing `for` loops for some operations. 


#### II.4.4 Slicing, masking and fancy indexing:    <a class="anchor" id="II.4.4"></a>
	 
- **Mask**: a mask is a boolean array that can be used to "mask" some indices of an array: 

``` python
mask = np.array([False, False, True, False, True, False])
c = np.array([1, 3, 6, 9, 10, 2])
c[mask]
    Out: array([6, 10])
    
mask = (c < 4) | (c > 8)
c[mask]
    Out: array([1, 3, 9, 10, 2])
```
 

In [None]:
mask = np.array([False, False, True, False, True, False])
c = np.array([1, 3, 6, 9, 10, 2])
c[mask]

- **Fancy indexing**: passing a list/array of indices to get elements of a numpy array  (this only works for arrays !) This avoids to loop over the indices. 

``` python
ind = [1, 3, 4]
c[ind]  
   Out: array([3, 9, 10])
```

In [None]:
ind = [1, 3, 4]
print c
print c[ind]

- **Multi-dimensional** array: 

We can apply mask and fancy indexing in multidimension.   
Remember that first index is row, and second is column.   
Remember how slicing works: `a[start:end:step]`   : 
- Omitting one value goes up to the end of the sequence. 
- Omitting the second "colon" implies step=1.  
- With negative steps you count backward
- Start/step can be either positive or negative indices (but then you count from the end). 

In [None]:
a = np.arange(10)
print a
a[a>3]

``` python
M = np.arange(12).reshape((3,4))
    Out: 
    array([[ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11]])

M[0,1] # gives value at row 0 and column 1. 
M[:, 1]  # Combines slices and indices -> all rows of column one
M[M-3 < 2]# can also do masking of n dimensional array
M[[1,0], :2] # Use fancy indexing and slicing - 1st 2 elements, of rows 1 and 2
M[M.sum(axis=1) > 2, 4:] # mixing masking and slicing 
```

An illustration of indexing in numpy arrays:
![Illustration of `np` indexing](Figures/numpy_indexing.png)

**Exercise**:
- Try the different flavours of slicing, using start, end and step: starting from a linspace, try to obtain odd numbers counting backwards, and even numbers counting forwards.

- Reproduce the slices in the diagram above. You may use the following expression to create the array:    
`np.arange(6) + np.arange(0, 51, 10)[:, np.newaxis]`

In [None]:
# Implement the exercise above
np.arange(6) + np.arange(0, 51, 10)[:, np.newaxis]

#### II.5 Reading arrays from a file and string formatting:    <a class="anchor" id="II.5"></a>

Reading tables saved in a formated text file can be done with `numpy.loadtxt('myfile.txt')`, while saving your array is done with `numpy.savetxt('myfile.txt')`.   
Clever loading of text/csv files: `numpy.genfromtxt()`/`numpy.recfromcsv()`. Those commands can fill missing values in a table, read column names, exclude some columns, and guess data-type using `dtype = None`.   
Fast and efficient, but numpy-specific, binary format: `numpy.save()`/`numpy.load()`.

There is another flexible way to read/write in file, which is through the use of the `file()` object. For this, three operations are generally needed: 
``` python
f = open('myfile.txt', 'r')  # 'r' for read mode, 'w' for write mode, 'a' for append mode
f.read()  # this would read the whole file as a single string ; other methods allow one more flexible read
f.close() 
```
If you do `f.read()` twice, you will see an empty string ... as the file then "points" to the end of the file, and there is nothing left to read. Somehow, the methods that access the file object go sequentially through the "string content" of that object. With `read()` you take the string as a whole (which could be a problem memore-wise if the file is large !).    

There is several ways to do this. One is by using a `for` loop:
``` python
f = open('myfile.txt', 'r')
for line in f:
    print repr(line)
```

In [None]:
f = open('data.txt', 'r')
for line in f:
    print repr(line)   # repr(object) return the canonical string representation of the object

In [None]:
f = open('data.txt', 'r')
#for line in f.readlines():
#    print repr(line)
a = f.readlines()
a[10].replace('.', ',')

Each line is being returned as a string. Notice the \n at the end of each line - this is a line return character, which indicates the end of a line.

Alternatively, you could also do:
``` python
f = open('myfile.txt', 'r')
for line in f.readlines():
    print repr(line)
```
BUT `f.readlines()` actually reads in the whole file and splits it into a **list** of lines (while `for line in f` reads one line at a time), so for large files this can be memory intensive. The above option is therefore prefered.     
 
Once a line is read, it is possible to apply string methods, as on normal string:    
- Remove `\n`: `line.strip()`
- Split the string into list of strings: `line.split()`
- Replace a specific character by another: `line.replace(',', '.')`  # replace comma by a dot.
- Access a specific element of a splitter list and convert it to float: `float(line.split()[2])`

To write a file, you basically follow the same procedure: 
``` python
f = open('myfile.txt', 'w')
f.writelines(mylist_of_lines)   # mylist_of_lines contains the lines you want to write. Ensure that they end with `\n`

# you can also use:
f.write(mylist_of_lines[0]+mylist_of_lines[1]+ ... + mylist_of_lines_[n])  # you can use list comprenhesion as argument
f.close()
```

**Exercise:**

Read the file `data.txt` and display the some columns you care about for that file using:
- the file object
- Try to do the same using `numpy.loadtxt()`  
- Try to do the same using using `numpy.genfromtxt()`.   
Bonus:      
- Try to build a numpy array with the data in data.txt as read using f = open('data.txt'). 
- Modify 1 column of the file (replace it with 0) and write the results in `data_new.txt`

**Note: **

Those methods/functions for reading ascii files are not always optimal to read tables containing both string and floats. Other packages, such as `pandas` and `astropy`, offer more flexible functions to read large variety and formats of tables.    

#### Formatting Strings

It often happens that you do not need to save all the decimals of a number, or would like to see it in scientific notation. For that purpose, you need to use the `%` operator to specify the formatting of the variable you want to show at the screen or save in a file. The variable does not appear explicitly in the string but after it in a tuple, preceded by the `%`. Within the string, the `%` operator will be followed by a format string such as `%f` for a float or `%e` for scientific notation. The sequence `'%.2f'%variable` basically tells that the `%` operator converts the `variable` into a float with 2 digits after the dot. This is generalized to a sequence of variable, by defining the tuple object that contains all the variables to be formatted (but you need to specify the format you want for those, the association between the format and the variable being done easily as you have put your variable into a tuple-object).   

Example:
``` python
print '%i is the square of %i' %(4.000, 2)
    Out: 4 is the square of 2
```

In [None]:
a, b = 5.00000, 2.32425
print '%.i dsasoigaos %.3f hgfkafk' %(a,b)
print '%i' %a, 'afadjsfha %.3e' %b 

Here are some commonly used formatting characters:
- `%s`: String (or any object with a string representation, like numbers)
- `%d` or `%i`: Integers
- `%.<number_of_digits>f`: Floating point numbers with fixed number of digits to the right of the dot. 
- `%.<number_of_digits>e`: scientific notation with fixed number of digits to the right of the dot.
You may find more about string formatting in [python 2 documentation](https://docs.python.org/2/library/stdtypes.html#string-formatting).  

**Note**: There is another very useful way in python to save "full objects" and access and use them latter using all their characteristics. This can be done by importing the `pickle` [module](https://docs.python.org/2/library/pickle.html), or even better (faster) [cPickle]( http://docs.python.org/library/pickle.html#module-cPickle). When you want to write a pickle into a file, simply open your file (`pkl_file = open()`), use `pickle.dump(obj, pkl_file, protocol=-1)`, and close your file (`pkl_file.close()`). To read an object saved in a pickle file, you can follow the same procedure but use `	obj = pickle.load(pkl_file)` instead of `pickle.dump()`. The `pandas` module also allows you to read/write pickle objects: see `pandas.read_pickle()` and `pandas.to_pickle()`

In [None]:
# Create three float variables a, b, c and give them some value (e.g. a=2.3, b=3, c=-5). 
# Print the sentence: `a=2.00, b=3 and c=-5.00e+00` using the formating format described above.
a, b, c = 2.3, 3, -5
print "a = %.2f , b= %i and c= %.2e" %(a,b,c)

In [None]:
# Create a 1-D array of 5 floats and print their value with 2 digits floats. TIP: use list comprehension
a = np.linspace(0,1,5)
print ['%.2f' %i for i in a]

### II.6 Summary:   <a class="anchor" id="II.6"></a>

What do you need to know to get started?

- Know how to create arrays : `np.array`, `np.arange`, `np.ones`, `np.zeros`.

- Know the shape of the array with `array.shape`, then use *slicing* to obtain different views of the array: `array[start:end:step]` (and variations around that syntax). Adjust the shape of the array using reshape or flatten it with ravel.

- Obtain a subset of the elements of an array and/or modify their values with masks (`a[a < 0] = 0`).

- Know miscellaneous operations on arrays, such as finding the mean or max (`ufunct`: `array.max()`, `array.mean()`). No need to retain everything, but have the reflex to search in the documentation (online docs, `help()`, `lookfor()`)!!

- Master the *indexing* with arrays of integers, as well as *broadcasting*. Know more NumPy functions to handle various array operations.

- Be able to read/write date into a file, and format numbers at screen (or when writing them into files): `open()`, `close()`, `np.savetxt()/np.loadtxt()`, use of `%` operator. 


## II.7 References and supplementary material: <a class="anchor" id="VI"></a>

- Excellent video introducing numpy (and that inspired part of the numpy section of this notebook) by J. Vandeplas: https://www.youtube.com/watch?v=EEUXKG97YRw

- Numpy quick-start: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
