# Python modules & introduction to `numpy`, `scipy`, `matplotlib`

## I.  What is a module ?

We have seen the main structures/objects (e.g. `list`, `dictionary`) during the last lecture as well as methods attached to those (e.g. `list.append(val)`). We have also seen how to build functions (kind of scripts that allow you to automatize some sequences of operations). 

If you quit from the Python interpreter and enter it again, the definitions you have made (functions and variables) are lost. Therefore, if you want to write a somewhat longer program, you are better off using a text editor to prepare the input for the interpreter and running it with that file as input instead. This is known as creating a script. As your program gets longer, you may want to split it into several files for easier maintenance. You may also want to use a handy function that you’ve written in several programs without copying its definition into each program.

To support this, Python has a way to put definitions in a file and use them in a script or in an interactive instance of the interpreter. Such a file is called a module; definitions from a module can be imported into other modules or into the main module (the collection of variables that you have access to in a script executed at the top level and in calculator mode).

A module is a file containing Python definitions and statements. The file name is the module name with the suffix `.py` appended. When you want to make use of a module in a program or in an interactive window, you should simply do: 

In file `myawesomemodule.py`, we have written
``` python
def talk():
    print 'What do you want master ?'
    return
```


``` python 
import myawesomemodule
myawesomemodule.talk()   # call the function talk() you have defined in myawesomemodule.py
    Out: 'What do you want master ?'

# you can also do this the following way
import myawesomemodule as ms
ms.talk()

# or, if you want to only call the function talk:
from myawesomemodule import talk
talk()

```

**Exercise:**

Create a module `area` that calculates areas of simple geometric figures.   
Let's start with a square, and a circle.


In [10]:
# Import the module `area` you have created and calculate the area of a square of side s=2; circle of radius r=2)
import os, shutil
os.chdir('../Lecture_2')
!pwd
import myawesomemodule, area
myawesomemodule.talk()
print area.square(2)
print area.circle(2)

/mnt/hgfs/work_nb/Desktop-OSX/Ulg-Admin-Rbt/COURS-ENSEIGNEMENT-theses-master/Cours-Methodes-numeriques-programmation/Lectures/Lecture_2
What do you want master ?
4
12.5664


In python, there is `.pyc` files often generated with the same name as the module/~~script~~ you are importing/~~running~~. Imagine you have a module `sayhello.py`. When you import your module the first time, a file `sayhello.pyc` will be created, that contains an already-"byte-compiled" version of `sayhello.py` to speed up the execution. The modification time of the version of `sayhello.py` is used to create `spam.pyc` is recorded in `sayhello.pyc`, and the `.pyc` file is ignored if these don't match. A program does not run faster when read from a `pyc` instead of a `py`, but **it is loaded** more quickly.  

If you have loaded your module in an `Ipython` and then modify it with an editor, it won't account directly for your modifications (because it the `pyc` has not been re-compiled). You need to [reload()](https://docs.python.org/2/library/functions.html#reload) your module to implement your changes in the current session. 

** Notes: **
When a module named `spam` is imported, the interpreter first searches for a built-in module with that name. If not found, it then searches for a file named `spam.py` in a list of directories given by the variable `sys.path` (see below). `sys.path` is initialized from these locations:
- the directory containing the input script (or the current directory).
- `PYTHONPATH` (a list of directory names, with the same syntax as the shell variable PATH).
- the installation-dependent default.

After initialization, Python programs can modify `sys.path`. The directory containing the script being run is placed at the beginning of the search path, ahead of the standard library path. This means that scripts in that directory will be loaded instead of modules of the same name in the library directory. This is an error unless the replacement is intended.

You can create your own modules ovbiously, but one of the great asset of python is the availability of a **huge** library of third-party modules, often designed by scientists for scientists (including astronomers !). Some of these modules introduce new structures/objects that you may have to masterize/understand by yourself. There is some modules that **YOU NEED TO KNOW** to efficiently work on data-science related projects: `numpy`, `scipy` and `matplotlib` ... We will extensively use those modules when studying numerical methods for data analysis. 


## II. `numpy`:

`numpy` can be seen as the implementation of mathematical functions and operations for python language. It also introduces one key object `arrays`. 

### `array`: 

A `numpy` array is an object of the type `np.ndarray` (although this type specifier is rarely used directly). Instead one can create arrays in several ways: 
``` python
import numpy as np
np.array([1,2,3,4])   # creates an array from a python list
np.array([[0, 1, 2], [3, 4, 5]])   # Creates a 2D array from a python list
np.arange(5) # similar to the built-in range() function
np.linsppace(1, 10, 10) # creates an array of 10 elements from 1 to 10
np.zeros(10)  # creates an array fo 10 elements filled with 0
np.ones(10)   # creates an array of 5 elements filled with 1
np.zeros((2, 5))  # mulitdimensional arrays of 2 rows and 5 columns

```
2-D arrays of `shape=(r, c)` are arrays with `r` *rows* and `c` *columns*. 

numpy has also tools to create arrays filled with random elements:
``` python
np.random.random(size=4)  # uniform between 0 and 1
np.random.normal(size=4)  # elements are std-normal distributed

```

In [None]:
# Use this cell to experiement with several ways of creating an array as shown above.



In [None]:
a = np.array([[1,2,3,4], [4,5,6,7]])
# What is the output of a.dim, a.shape, len(a) and how does it relate to the rows, columns, dimensions.

**Exercise:** Elementwise operations

In the code cell below, try simple arithmetic elementwise operations: 
- add even elements with odd elements
- Time them against their pure python counterparts using %timeit.
- Generate:
   `[2**0, 2**1, 2**2, 2**3, 2**4]`
   `a_j = 2^(3*j) - j `

Python is fast *for coding and developping* but python is slow when it comes to *execution*, especially when it comes to execution of for loops.    
The reason behind this low speed is e.g. that when it does `for a in range(10): a + b`, it has to check the `type` of `a`, of `b` and of *each value* in those lists before executing. 

`numpy` helps speeding up code through 4 strategies:
1. `ufunc`
2. aggregation
3. broadcasting
4. slicing, masking, fancy indexing

### II.1 `ufunc`: operates elementwise on objects. 

Those `ufunc` are included (compiled) in `numpy`. They include: 

- all mathematic operation: +, -, /, *, `**`, 
- Mathematical expressions: sin, exp, cos, log10, ... 
- Comparison operators <, >, =, ...
- etc ... 

** Example:**
``` python
import numpy as np
# Basic python
a = [1,2,3,4,5]
b=[ a + val for val in a]   # add 5 to each element of the list
# In numpy
a = np.array(a)
b = a + 5                     # add 5 to each element of the array.
```

In [None]:
# implement the above example for a list of 1000 elements 
# use %timeit before calculating b to see improvement in speed

### II.2. *aggregation*: 

functions which summarize values of an array such as `min`, `max`, `sum`, `mean`, ... 

** Example: **

``` python
# python version of an agregation
from random import random
c = [ random() for i in range(10000) ]
%timeit min(c)
#same in numpy:
c = np.array(c)
%timeit c.min()  
```
This also works on multidimensional arrays: 

``` python 
M = np.random.randint(0, 10, (10,4))
M.sum(axis=0)
M.sum(axis=1)
```

Aggregation available: 
`np.min()`, `np.max`, `np.prod()`, `np.mean()`, `np.std()`, `np.median()`, `np.any()`, `np.all()`, `np.nanmin()` (and nan versions of above aggregation), `np.argmin()`, `np.argmax()`, `np.percentile()`, ...


In [17]:
import numpy as np
from random import random
c = [ random() for i in range(1000) ]
%timeit min(c)
#same in numpy:
c = np.array(c)
%timeit c.min() 

The slowest run took 41.76 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 20.1 µs per loop
The slowest run took 15.60 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.86 µs per loop


### III.3 *Broadcasting*: 

Set of rules by which `ufuncs` operates on arrays of different sizes and/or dimensions. 

The term [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.
Application to three cases: 

![From astroML book](../Figures/fig_broadcast_visual_1.png)



The rules / how this works:

* If array shapes differ, left-pad the smaller shape with 1s 
* If any dimension does not match, broadcast the dimension with size 1
* If neither non matching dimensions is 1 raise an error

This broadcasting strategy allows one to avoid doing `for` loops for some operations. 


### III.4 Slicing, masking and fancy indexing:  
	 
- **Mask**: a mask is a boolean array that can be used to "mask" some indices of an array: 
``` python
mask = np.array[[False, False, True, False, True, False]] 
c = np.array([1, 3, 6, 9, 10, 2])
c[mask]
    Out: array([6, 10])
    
mask = (c < 4) | (c > 8)
c[mask]
    Out: array([1, 3, 9, 10, 2])
```
 
- **Fancy indexing**: passing a list/array of indices to get elements of a numpy array  (this only works for arrays !) This avoids to loop over the indices. 
``` python
ind = [1, 3, 4]
c[ind]  
   Out: array([3, 9, 10])
```
- **Multi-dimensional** array: first index is row, and second is column
``` python
M = np.arange(12).reshape((3,4))
    Out: 
    array([[ 0,  1,  2,  3],
           [ 4,  5,  6,  7],
           [ 8,  9, 10, 11]])

M[0,1] # gives value at row 0 and column 1. 
M[:, 1]  # Combines slices and indices -> all rows of column one
M[M-3 < 2]# can also do masking of n dimensional array
M[[1,0], :2] # Use fancy indexing and slicing - 1st 2 elements, of rows 1 and 2
M[M.sum(axis=1) > 2, 4:] # mixing masking and slicing 
```
An illustration of indexing in numpy arrays:
![Illustration of `np` indexing](../Figures/numpy_indexing.png)

**Exercise**:
Try the different flavours of slicing, using start, end and step: starting from a linspace, try to obtain odd numbers counting backwards, and even numbers counting forwards.

Reproduce the slices in the diagram above. You may use the following expression to create the array: `np.arange(6) + np.arange(0, 51, 10)[:, np.newaxis]`

In [None]:
# Implement the exercise above

## The standard library
   
This section gives an overview of the very useful modules methods you may need to use at some point to manage your files, directory structures, platform-related file naming conventions, ... 

### `os`: operating system functionality

> “A portable way of using operating system dependent functionality.”

#### Directory and file manipulation:

- Current directory:   `os.getcwd()`    

- List a directory:  `os.listdir(os.curdir)`

- Make a directory:   `os.mkdir('junkdir')`

- Rename the directory:  `os.rename('junkdir', 'foodir')`

- Delete a file:  `os.remove('junk.txt')`


In [14]:
# Experiment with the use of OS and check-out the output
print os.getcwd()
print os.listdir(os.curdir)
fp = open('junk.txt', 'w')    # first create an empty file
fp.close()
print 'junk.txt' in os.listdir(os.curdir)
os.remove('junk.txt')
print 'junk.txt' in os.listdir(os.curdir)

### os.path: path manipulations

`os.path` provides common operations on pathnames:

- Get the absolute path name for a file in a directory: `a = os.path.abspath('junk.txt')`  
``` python
>>> a
    '/Users/cburns/src/scipy2009/scipy_2009_tutorial/source/junk.txt'
```
- Split Path name and file name:  `os.path.split(a)`   

- Get the path part of `a`:  `os.path.dirname(a)`     

- Filename part of `a`:  `os.path.basename(a)`    
    'junk.txt'

- Split file name into name and extension: `os.path.splitext(os.path.basename(a))`   

- Check existence of a file in a path: `os.path.exists('junk.txt')`   

- Check that a filename corresponds to a file: `os.path.isfile('junk.txt')`   

- Check for a directory name: `os.path.isdir('junk.txt')`    

- Pathname corresponding to home of the user: `os.path.expanduser('~')`     

- Create a string by merging pathnames/strings: `os.path.join(os.path.expanduser('~'), 'local', 'bin')`    


### `subprocess`: running an external command

This is also very useful to call an externally compiled program.
- Call a simple command, wait for it to finish, and get the return code:

```  python
import subprocess
subprocess.call('chmod +x filename', shell=True)
```

- Communicate with the process (try for example with some_program.f):

``` python
>>> p1 = subprocess.Popen('./some_program',stdout=subprocess.PIPE)
>>> p1.stdout.readline()
>>> p1.send_signal(signal.SIGSTOP)
>>> p1.send_signal(signal.SIGCONT)
>>> p1.send_signal(signal.SIGKILL)
```

** Notes: **

How to communicate with a program during execution:

Suppose we want to run a program, and check its output while it’s running. For this, we need to read the program’s standard output while it is running, wait for the next line to appear, and end the loop when the output stream is closed. This can be done with:
``` python 
def line_at_a_time(fileobj):
    while True:
        line = fileobj.readline()
        if not line:
            return
        yield line
```

Now, we can run the program and check the output. Suppose “myprogram” prints ERROR to the screen when it encountered an error, and we want to kill the program whenever that occurs:
``` python 
>>> p1 = subprocess.Popen('./my_program',stdout=subprocess.PIPE)
>>> for line in line_at_a_time(p1.stdout):
        if "ERROR" in line:
            p1.send_signal(signal.SIGKILL)
```
Similarly, you can use subprocess.PIPE to send data to stdin.


### Environment variables:

Get environment variable: 

- All defined environment variable:  `os.environ.keys()`   
- Get the path to which corresponds a given env. variable:    
`os.environ['PYTHONPATH']`     
OR    
`os.getenv('PYTHONPATH')`


### `sys`: system-specific information

This is particularly useful if you want to make a quick fix to import some python codes located in a specific directory, or to figure out which python is used when you have multiple python installed on the machine ... (these kind of problems can now be more easily avoided if you install python via conda ...). 

System-specific information related to the Python interpreter.

- Which version of python are you running and where is it installed: 
        * Platform: `sys.platform`
        * Version of python: `sys.version`
        * Location of python used: 'sys.prefix`

- List of command line arguments passed to a Python script: `sys.argv`

- The  list of strings that specifies the search path for modules is initialized from PYTHONPATH, and obtained from:   `sys.path` 

## V. References / To know more:

**Appendix A** of the book *Statistics, data mining and Machine learning in astronomy* by Z. Ivezic et al. in Princeton Series in Modern Astronomy.  

Other useful references to know more about the topics covered in this class: 
    
    - The python tutorial (Chap. 6): https://docs.python.org/2/tutorial/modules.html
    
    - Standard python library: http://www.ster.kuleuven.be/~pieterd/python/html/pure_python/standard_library.html
    
Excellent introducotry video introducing numpy (and that inspired part of the numpy section of this notebook) by J. Vandeplas: https://www.youtube.com/watch?v=EEUXKG97YRw

Scipy lecture notes: (from which part of numpy, scipy, matplotlib tutorial are inspired): http://www.scipy-lectures.org/index.html  (creative Commons 4.0)

Numpy quick-start: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html