# Numerical data in `numpy`

[`numpy`](https://docs.scipy.org/doc/numpy/) is a very powerful library for working with numerical data in Python. It introduces the __Array__ data structure, which can contain multi-dimensional numerical data. The `numpy` library is __not__ part of the Python standard library. However, it comes bundles with [Anaconda](01_anaconda.ipynb) so you should already have it installed. The usual way to import `numpy` is as follows:

    import numpy as np
    
This gives us access to all the `numpy`-functions using the prefix `np`. This is a convention, and you should do the same in your own code.

In [1]:
import numpy as np
np.set_printoptions(precision=3, linewidth=75, edgeitems=1)
np.__version__

'1.13.1'

A simple way to instantiate an array is by giving it a list or another iterable.

In [2]:
data_in_list = [1, 2, 5, 3]
data_as_np = np.array(data_in_list)
data_as_np

array([1, 2, 5, 3])

## Reading data with `numpy`

The `numpy` library comes with a few functions that can read numerical data from text files. The `np.loadtxt`-function is a fast reader with some basic functionality for skipping header lines, converting values to floats, etc. For more sophisticated files the `np.genfromtxt`-function can be used. It also supports handling missing values, but is slower.

Note that these functions take a filename as input, so that you do not need to use the built-in `open`-function to open a file beforehand. Assume we have a text-file containing data of the following form:

In [3]:
!cat data/numpy_simple.txt

     0.25     -89.75    0.022015    0.040741   -0.021861   -0.029058   -0.025845   -0.055643
   180.25     -89.75    0.025000    0.041008    0.021540    0.028705    0.025741    0.055648
     0.25       0.25   -0.045473   -0.013363   -0.029142   -0.034187   -0.026165   -0.070089
   180.25       0.25    0.028523    0.039553   -0.002982   -0.035883    0.024974    0.071998
     0.25      89.75   -0.010242   -0.028406    0.042345        -999   -0.025851   -0.070016
   180.25      89.75   -0.013632   -0.028131   -0.042647   -0.029206    0.025918    0.070023


These data can be loaded with a very simple `np.loadtxt`-command as follows:

In [4]:
data = np.loadtxt('data/numpy_simple.txt')
data

array([[  2.500e-01,  -8.975e+01,   2.201e-02,   4.074e-02,  -2.186e-02,
         -2.906e-02,  -2.584e-02,  -5.564e-02],
       [  1.802e+02,  -8.975e+01,   2.500e-02,   4.101e-02,   2.154e-02,
          2.871e-02,   2.574e-02,   5.565e-02],
       [  2.500e-01,   2.500e-01,  -4.547e-02,  -1.336e-02,  -2.914e-02,
         -3.419e-02,  -2.617e-02,  -7.009e-02],
       [  1.802e+02,   2.500e-01,   2.852e-02,   3.955e-02,  -2.982e-03,
         -3.588e-02,   2.497e-02,   7.200e-02],
       [  2.500e-01,   8.975e+01,  -1.024e-02,  -2.841e-02,   4.235e-02,
         -9.990e+02,  -2.585e-02,  -7.002e-02],
       [  1.802e+02,   8.975e+01,  -1.363e-02,  -2.813e-02,  -4.265e-02,
         -2.921e-02,   2.592e-02,   7.002e-02]])

Arrays in `numpy` have a shape specifying how big the dataset is. In this case we have 6 rows and 8 columns.

In [5]:
data.shape

(6, 8)

## Indexing and vectorization

Similarly to lists and other sequences in Python, `numpy` arrays can be indexed.

In [6]:
data[0]    # First row (= row 0)

array([  2.500e-01,  -8.975e+01,   2.201e-02,   4.074e-02,  -2.186e-02,
        -2.906e-02,  -2.584e-02,  -5.564e-02])

In [7]:
data[2:5]    # Rows 2, 3 and 4 (3rd, 4th and 5th)

array([[  2.500e-01,   2.500e-01,  -4.547e-02,  -1.336e-02,  -2.914e-02,
         -3.419e-02,  -2.617e-02,  -7.009e-02],
       [  1.802e+02,   2.500e-01,   2.852e-02,   3.955e-02,  -2.982e-03,
         -3.588e-02,   2.497e-02,   7.200e-02],
       [  2.500e-01,   8.975e+01,  -1.024e-02,  -2.841e-02,   4.235e-02,
         -9.990e+02,  -2.585e-02,  -7.002e-02]])

However, with multi-dimensional arrays (in this case 2-dimensional), we can specify each dimension in the index separated by commas.

In [8]:
data[4, 5]    # Element in row 4, column 5 (5th row, 6th column)

-999.0

In [9]:
data[:, :2]    # All rows, first two columns

array([[   0.25,  -89.75],
       [ 180.25,  -89.75],
       [   0.25,    0.25],
       [ 180.25,    0.25],
       [   0.25,   89.75],
       [ 180.25,   89.75]])

With `numpy`-arrays most operations are vectorized. That means that we do not need to explicitly loop over the elements.

In [10]:
data[:, 4] + data[:, 6]    # Add columns 4 and 6 together

array([-0.048,  0.047, -0.055,  0.022,  0.016, -0.017])

In [11]:
np.exp(data[:, -1])    # Exponentiate the last (-1) column

array([ 0.946,  1.057,  0.932,  1.075,  0.932,  1.073])

`numpy` also comes with summary functions like `sum`, `mean`, `std`, `var`, `median`, etc that can operate on a whole array or a given dimension (`axis`) of the data. 

In [12]:
np.mean(data)    # Calculate the mean of the whole array

-9.5223768750000009

In [13]:
np.mean(data, axis=0)    # Calculate the mean of each column (along the rows, the 0th axis)

array([  9.025e+01,   8.333e-02,   1.032e-03,   8.567e-03,  -5.458e-03,
        -1.665e+02,  -2.047e-04,   3.202e-04])

## More advanced reading of data

Most datafiles are not as clean as the simple datafile we have been working with above. Let us instead try to load the following file.

In [14]:
!cat data/numpy_header.txt

Simplified dataset based on Ocean Pole Load Tide Deformation Parameters
from Self-Consistent Equilibrium Model of Ocean Pole Tide (Desai, 2002)
Number_longitude_Grid_Points =         2
First_longitude_degrees      =      0.25
Last_longitude_degrees       =    180.25
Longitude_step_degrees       =     180.0
Number_latitude_grid_points  =         3
First_latitude_degrees       =    -89.75
Last_latitude_degrees        =     89.75
Latitude_step_degrees        =      90.0
Longitude   Latitude   u_r^R       u_r^I       u_n^R       u_n^I       u_e^R       u_e^I    
(degrees)  (degrees)  (        )  (        )  (        )  (        )  (        )  (        )
---------  ---------  ----------  ----------  ----------  ----------  ----------  ----------
     0.25     -89.75    0.022015    0.040741   -0.021861   -0.029058   -0.025845   -0.055643
   180.25     -89.75    0.025000    0.041008    0.021540    0.028705    0.025741    0.055648
     0.25       0.25   -0.045473   -0.013363   -

A naive use of `np.loadtxt` will fail because `numpy` tries to interpret the header as data.

In [15]:
data2 = np.loadtxt('data/numpy_header.txt')

ValueError: could not convert string to float: b'Simplified'

Instead we must give the `np.loadtxt` some more information. To get some help about a function and which parameters it takes, you can write a question mark after its name,

    np.loadtxt?
    
or press `<shift>` and `<tab>` inside the paranthesis. Pressing `<shift>` and `<tab>` twice will give even more information.

In this case, we notice that there is an argument called `skiprows` that can be used to ignore the header.

In [16]:
data2 = np.loadtxt('data/numpy_header.txt', skiprows=13)
np.allclose(data, data2)    # Test if data and data2 contains the same elements (within a tolerance)

True

Looking more closely at the data, we also notice that there is one datapoint with the value of -999 that probably designates a missing data point. We can convert this to a `nan`-value to handle it properly as we are reading the data. However, to do so we need to use the more sophisticated `np.genfromtxt`-function.

In [17]:
data3 = np.genfromtxt('data/numpy_header.txt', skip_header=13,
                      missing_values='-999', usemask=True).filled(np.nan)
data3

array([[  2.500e-01,  -8.975e+01,   2.201e-02,   4.074e-02,  -2.186e-02,
         -2.906e-02,  -2.584e-02,  -5.564e-02],
       [  1.802e+02,  -8.975e+01,   2.500e-02,   4.101e-02,   2.154e-02,
          2.871e-02,   2.574e-02,   5.565e-02],
       [  2.500e-01,   2.500e-01,  -4.547e-02,  -1.336e-02,  -2.914e-02,
         -3.419e-02,  -2.617e-02,  -7.009e-02],
       [  1.802e+02,   2.500e-01,   2.852e-02,   3.955e-02,  -2.982e-03,
         -3.588e-02,   2.497e-02,   7.200e-02],
       [  2.500e-01,   8.975e+01,  -1.024e-02,  -2.841e-02,   4.235e-02,
                nan,  -2.585e-02,  -7.002e-02],
       [  1.802e+02,   8.975e+01,  -1.363e-02,  -2.813e-02,  -4.265e-02,
         -2.921e-02,   2.592e-02,   7.002e-02]])

Actually handling the missing data is a two-step process. First we create a masked array, where the mask denotes which data are missing. This can be seen if we look directly at the output from `np.genfromtxt` (note the single `True`-value in the mask).

In [18]:
np.genfromtxt('data/numpy_simple.txt', missing_values='-999', usemask=True)

masked_array(data =
 [[0.25 -89.75 0.022015 0.040741 -0.021861 -0.029058 -0.025845 -0.055643]
 [180.25 -89.75 0.025 0.041008 0.02154 0.028705 0.025741 0.055648]
 [0.25 0.25 -0.045473 -0.013363 -0.029142 -0.034187 -0.026165 -0.070089]
 [180.25 0.25 0.028523 0.039553 -0.002982 -0.035883 0.024974 0.071998]
 [0.25 89.75 -0.010242 -0.028406 0.042345 -- -0.025851 -0.070016]
 [180.25 89.75 -0.013632 -0.028131 -0.042647 -0.029206 0.025918 0.070023]],
             mask =
 [[False False False False False False False False]
 [False False False False False False False False]
 [False False False False False False False False]
 [False False False False False False False False]
 [False False False False False  True False False]
 [False False False False False False False False]],
       fill_value = 1e+20)

See the [`numpy` docs](https://docs.scipy.org/doc/numpy-dev/user/basics.io.genfromtxt.html) for more information about reading data.