# Numerical data in `numpy`

[`numpy`](https://docs.scipy.org/doc/numpy/) is a very powerful library for working with numerical data in Python. It introduces the __Array__ data structure, which can contain multi-dimensional numerical data. The `numpy` library is __not__ part of the Python standard library. However, it comes bundles with [Anaconda](01_anaconda.ipynb) so you should already have it installed. The usual way to import `numpy` is as follows:

    import numpy as np
    
This gives us access to all the `numpy`-functions using the prefix `np`. This is a convention, and you should do the same in your own code.

In [None]:
import numpy as np
np.set_printoptions(precision=3, linewidth=60, edgeitems=1)
np.__version__

## Reading data with `numpy`

The `numpy` library comes with a few functions that can read numerical data from text files. The `np.loadtxt`-function is a fast reader with some basic functionality for skipping header lines, converting values to floats, etc. For more sophisticated files the `np.genfromtxt`-function can be used. It also supports handling missing values, but is slower.

Note that these functions take a filename as input, so that you do not need to use the built-in `open`-function to open a file beforehand. Assume we have a text-file containing data of the following form:

In [None]:
!cat data/numpy_simple.txt

These data can be loaded with a very simple `np.loadtxt`-command as follows:

In [None]:
data = np.loadtxt('data/numpy_simple.txt')
data

Arrays in `numpy` have a shape specifying how big the dataset is. In this case we have 6 rows and 8 columns.

In [None]:
data.shape

## Indexing and vectorization

Similarly to lists and other sequences in Python, `numpy` arrays can be indexed.

In [None]:
data[0]    # First row (= row 0)

In [None]:
data[2:5]    # Rows 2, 3 and 4 (3rd, 4th and 5th)

However, with multi-dimensional arrays (in this case 2-dimensional), we can specify each dimension in the index separated by commas.

In [None]:
data[4, 5]    # Element in row 4, column 5 (5th row, 6th column)

In [None]:
data[:, :2]    # All rows, first two columns

With `numpy`-arrays most operations are vectorized. That means that we do not need to explicitly loop over the elements.

In [None]:
data[:, 4] + data[:, 6]    # Add columns 4 and 6 together

In [None]:
np.exp(data[:, -1])    # Exponentiate the last (-1) column

## More advanced reading of data

Most datafiles are not as clean as the simple datafile we have been working with above. Let us instead try to load the following file.

In [None]:
!cat data/numpy_header.txt

A naive use of `np.loadtxt` will fail because `numpy` tries to interpret the header as data.

In [None]:
data2 = np.loadtxt('data/numpy_header.txt')

Instead we must give the `np.loadtxt` some more information. To get some help about a function and which parameters it takes, you can write a question mark after its name,

    np.loadtxt?
    
or press `<shift>` and `<tab>` inside the paranthesis. Pressing `<shift>` and `<tab>` twice will give even more information.

In this case, we notice that there is an argument called `skiprows` that can be used to ignore the header.

In [None]:
data2 = np.loadtxt('data/numpy_header.txt', skiprows=13)
np.allclose(data, data2)    # Test if data and data2 contains the same elements (within a tolerance)

Looking more closely at the data, we also notice that there is one datapoint with the value of -999 that probably designates a missing data point. We can convert this to a `nan`-value to handle it properly as we are reading the data. However, to do so we need to use the more sophisticated `np.genfromtxt`-function.

In [None]:
data3 = np.genfromtxt('data/numpy_header.txt', skip_header=13,
                      missing_values='-999', usemask=True).filled(np.nan)
data3

Actually handling the missing data is a two-step process. First we create a masked array, where the mask denotes which data are missing. This can be seen if we look directly at the output from `np.genfromtxt` (note the single `True`-value in the mask).

In [None]:
np.genfromtxt('data/numpy_simple.txt', missing_values='-999', usemask=True)

See the [`numpy` docs](https://docs.scipy.org/doc/numpy-dev/user/basics.io.genfromtxt.html) for more information about reading data.