
# Analyzing Patient Data
## Objectives
* Explain what a library is and what libraries are used for.
* Import a Python library and use the functions it contains.
* Read tabular data from a file into a program.
* Select individual values and subsections from data.
* Perform operations on arrays of data.
***

Numpy documentation: https://numpy.org/doc/stable/reference/index.html

# Loading data into Python

In order to load data, we need to access (import in Python terminology) a library named [`numpy`](https://numpy.org/doc/stable/). We first need to import it:

In [1]:
import numpy

Importing a library makes its functionnalities available for us to use.

Once we’ve imported the library, we can ask the library to read our data file for us:

In [2]:
numpy.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')

array([[0., 0., 1., ..., 3., 0., 0.],
       [0., 1., 2., ..., 1., 0., 1.],
       [0., 1., 1., ..., 2., 1., 1.],
       ...,
       [0., 1., 1., ..., 1., 1., 1.],
       [0., 0., 0., ..., 0., 2., 0.],
       [0., 0., 1., ..., 1., 1., 0.]])

Our call to `numpy.loadtxt` read our file but didn’t save the data in memory. The result is returned and outputted to the screen.
In order to save the data in memory, we need to assign the returned data to a variable, like we assigned a value to a variable:

In [3]:
data = numpy.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')

If we want to have a look at the data, we can print the variable's value:

In [4]:
print(data)

[[0. 0. 1. ... 3. 0. 0.]
 [0. 1. 2. ... 1. 0. 1.]
 [0. 1. 1. ... 2. 1. 1.]
 ...
 [0. 1. 1. ... 1. 1. 1.]
 [0. 0. 0. ... 0. 2. 0.]
 [0. 0. 1. ... 1. 1. 0.]]


We can now manipulate it. Let's look at its type:

In [5]:
print(type(data))

<class 'numpy.ndarray'>


We can also find out the type of the data contained in the array:

In [6]:
print(data.dtype)

float64


Or we can see the array's shape:

In [7]:
print(data.shape)

(60, 40)


To print one element from the array, we must provide an index in square brackets (`[]`) after the variable name.
The inflammation data has two dimensions, so we will need to use two indices to refer to one specific value:

In [8]:
print('middle value in data:', data[30, 20])

middle value in data: 13.0


The expression `data[30, 20]` accesses the element at row `30` and column `20`.

And what would be the indices for the first element?:

In [9]:
print('first value in data:', data[0, 0])

first value in data: 0.0


Programming languages like Fortran, MATLAB and R start counting at 1 because that’s what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because it represents an offset from the first value in the array (the second value is offset by one index from the first value). As a result, if we have an M×N array in Python, its indices go from 0 to M-1 on the first axis and 0 to N-1 on the second.
![image](../images/python-zero-index.svg)

# Slicing data
Above we selected a single element, but we can select whole sections or slice. For instance, we can select the first ten days (columns) for the first four patients (rows) with:

In [10]:
print(data[0:4, 0:10])

[[0. 0. 1. 3. 1. 2. 4. 7. 8. 3.]
 [0. 1. 2. 1. 2. 1. 3. 2. 2. 6.]
 [0. 1. 1. 3. 3. 2. 6. 2. 5. 9.]
 [0. 0. 2. 0. 4. 2. 2. 1. 6. 7.]]


The slicing is not an inclusive range: `[from:to[`, for example:

|elements| 1  | 2  | 3  | 4  |
|--------|----|----|----|----|
|indexes | 0  | 1  | 2  | 3  |

We don’t have to start slices at 0:

In [11]:
print(data[5:10, 0:10])

[[0. 0. 1. 2. 2. 4. 2. 1. 6. 4.]
 [0. 0. 2. 2. 4. 2. 2. 5. 5. 8.]
 [0. 0. 1. 2. 3. 1. 2. 3. 5. 3.]
 [0. 0. 0. 3. 1. 5. 6. 5. 5. 8.]
 [0. 1. 1. 2. 1. 3. 5. 3. 5. 8.]]


We can also omit the upper or lower boundary of the slice:

In [12]:
small = data[:3, 36:]
print('small is:')
print(small)

small is:
[[2. 3. 0. 0.]
 [1. 1. 0. 1.]
 [2. 2. 1. 1.]]


# Analyzing data
Numpy library contains several useful functions to work with arrays. For example, we can calculate `data`'s [mean](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.mean.html) value:

In [13]:
print(numpy.mean(data))

6.14875


We could also calculate the [maximum](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.max.html), [minimal](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.min.html) and [standard deviation](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.std.html) of the data:

In [14]:
print('maximum inflammation:', numpy.max(data))
print('minimum inflammation:', numpy.min(data))
print('standard deviation:', numpy.std(data))

maximum inflammation: 20.0
minimum inflammation: 0.0
standard deviation: 4.613833197118566


At any given point, we can learn more about a function by using some help magic 🪄 or tab completion.
For instance, we could look the documentation for the `std` function:

In [15]:
help(numpy.std)

Help on function std in module numpy:

std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>)
    Compute the standard deviation along the specified axis.
    
    Returns the standard deviation, a measure of the spread of a distribution,
    of the array elements. The standard deviation is computed for the
    flattened array by default, otherwise over the specified axis.
    
    Parameters
    ----------
    a : array_like
        Calculate the standard deviation of these values.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the standard deviation is computed. The
        default is to compute the standard deviation of the flattened array.
    
        .. versionadded:: 1.7.0
    
        If this is a tuple of ints, a standard deviation is performed over
        multiple axes, instead of a single axis or all the axes as before.
    dtype : dtype, optional
        Type to use in computing the standard deviation. 

When analyzing data we often want to look at variations in statistical values, such as the maximum inflammation per patient or the average inflammation per day. One way to do this is to create a new temporary array of the data we want, then ask it to do the calculation. 
How would we assign the data of the first patient to a variable `patient_0`?:

In [16]:
patient_0 = data[0, :] # 0 on the first axis (rows), everything on the second (columns)

Then calculate the maximum inflammation for this patient:

In [17]:
print(numpy.max(patient_0))

18.0


We don’t actually need to store the row in a variable of its own. We can call the function directly:

In [18]:
print('maximum inflammation for patient 2:', numpy.max(data[2, :]))

maximum inflammation for patient 2: 19.0


Operations can be done on rows or columns, which we call _axis_.
![image](../images/python-operations-across-axes.png)

With _axis_, we can get the maximum inflammation for each patient over all days (left) or the average for each day (right).

For instance, calculate the average for each day. To which graphic does this correspond?

In [19]:
print(numpy.mean(data, axis=0))

[ 0.          0.45        1.11666667  1.75        2.43333333  3.15
  3.8         3.88333333  5.23333333  5.51666667  5.95        5.9
  8.35        7.73333333  8.36666667  9.5         9.58333333 10.63333333
 11.56666667 12.35       13.25       11.96666667 11.03333333 10.16666667
 10.          8.66666667  9.15        7.25        7.33333333  6.58333333
  6.06666667  5.95        5.11666667  3.6         3.3         3.56666667
  2.48333333  1.5         1.13333333  0.56666667]


When we look at the shape of our array, we get:

In [20]:
print(numpy.mean(data, axis=0).shape)

(40,)


`(40,)` which means we have a 1 dimension vector (`Nx1`) of 40 elements. This is the average inflammation per day for all patients.
And if we'd calculate the average inflammation per patient accross all days?:

In [21]:
print(numpy.mean(data, axis=1))

[5.45  5.425 6.1   5.9   5.55  6.225 5.975 6.65  6.625 6.525 6.775 5.8
 6.225 5.75  5.225 6.3   6.55  5.7   5.85  6.55  5.775 5.825 6.175 6.1
 5.8   6.425 6.05  6.025 6.175 6.55  6.175 6.35  6.725 6.125 7.075 5.725
 5.925 6.15  6.075 5.75  5.975 5.725 6.3   5.9   6.75  5.925 7.225 6.15
 5.95  6.275 5.7   6.1   6.825 5.975 6.725 5.7   6.25  6.4   7.05  5.9  ]


# Exercices
#### 1. Slicing
From the `data` variable defined previously.

 Print the first patient, first inflammation value:

In [22]:
print(data[0,0])

0.0


Print the second patient, 5th inflammation value:

In [23]:
print(data[1,4])

2.0


Print the first three patients, their first fourth inflammation values:

In [24]:
data[0:3, 0:4]

array([[0., 0., 1., 3.],
       [0., 1., 2., 1.],
       [0., 1., 1., 3.]])

***
# Key points

* Import a library into a program using `import libraryname`.
* Use the `numpy` library to work with arrays in Python.
* The expression `array.shape` gives the shape of an array.
* Use `array[x, y]` to select a single element from a 2D numpy array.
* Array indices start at `0`, not `1`.
* Use `low:high` to specify a slice that includes the indices from `low` to `high-1`.
* Use `# some kind of explanation` to add comments to programs.
* Use `numpy.mean(array)`, `numpy.max(array)`, and `numpy.min(array)` to calculate simple statistics.
* Use `numpy.mean(array, axis=0)` or `numpy.mean(array, axis=1)` to calculate statistics across the specified axis.