In [1]:
%%html
<style>
img.seventy {
  max-width:70%;
  max-height:70%;
}
img.seventyfive {
  max-width:75%;
  max-height:75%;
}
img.eighty {
  max-width:80%;
  max-height:80%;
}
img.eightyfive {
  max-width:85%;
  max-height:85%;
}
img.ninety {
  max-width:90%;
  max-height:90%;
}
<style>

# NumPy

### Questions:

- How can I process tabular data files in Python?

### Objectives

- Explain what a library is and what libraries are used for.
- Import a Python library and use the functions it contains.
- Read tabular data from a file into a program.
- Select individual values and subsections from data.
- Perform operations on arrays of data.

We can do a lot with Python's built-in functions, data types and structures, and their methods, however, some tasks require the use of more specialised tools.

We can find these tools inside of Python libraries.


We will first learn how to add libraries to our programs, and use functions from within libraries by using the NumPy library to look at some inflammation data.

First we must import the library using the `import` statement:

```python
import numpy
```

This adds the library to our program.

In [1]:
# Import numpy


Now we will read in our data file, `inflammation-01.csv`, using NumPy's `loadtxt()` function.

Because the `loadtxxt()` function is part of the library, we must use the **dot notation** to access it.

<!---
numpy.loadtxt(fname='swc-python/data/inflammation-01.csv', delimiter=',')
-->

In [None]:
# Load the dataset


```python
numpy.loadtxt(fname='swc-python/data/inflammation-01.csv', delimiter=',')
```

This function call has two parameters, `fname`, and `delimiter`, which take the arguments of the filename and the character used to separate values in different columns, respectively.

This will display the file's contents. You may notice that it does not display all of its contents. This is because they are very large.

We need to save the data to a variable name, so now we'll run:

<!---
data = numpy.loadtxt(fname='swc-python/data/inflammation-01.csv', delimiter=',')

print(data)

type(data)
-->

In [2]:
# Ssave the data and view it 


```python
data = numpy.loadtxt(fname='swc-python/data/inflammation-01.csv', delimiter=',')

print(data)

type(data)
```

The output of `type()` tells us that data is `<class 'numpy.ndarray'>`.

This is a data structure that is specific to the NumPy library, called an **N-dimensional array**.

Each row in this data type corresponds to an individual observation, in this case a patient.

Each column is a different data field, in this case an inflammation measurement for a different day.

NumPy arrays contain one or more elements of the same type.

`numpy.ndarray` is the type of `data` itself. To see the type of the information inside of `data`, we can check one of its **attributes**.


Attributes are basic information about an object, like descriptors. They are accessed in the same way as an object's methods, but do not require parentheses.

<!---
print(data.dtype)

print(data.shape)
-->

In [None]:
# View the dtype and shape attributes


```python
print(data.dtype)

print(data.shape)
```


So we can see that our `data` contains 60 rows and 40 columns of `float64`, which are floating-point numbers, or ones with decimal values.

### Indexing in arrays

In order to access a single value inside of an array, we use an indexing syntax similar to that used with lists.

Now, because our data is 2-dimensional, the index will have two parts: one for the rows, and the other for the columns.

If our data had more dimensions, the indexing could have more parts.

<!---
# Access the value in row 1, column 1
print('first value in data:', data[0, 0])

# Access the value in row 31, column 21
print('middle value in data:', data[30, 20])
-->

<img src="intro_python_images/numpy_array_indices.png" />

In [None]:
# Accessing individual cells in data


```python
# Access the value in row 1, column 1
print('first value in data:', data[0, 0])

# Access the value in row 31, column 21
print('middle value in data:', data[30, 20])
```

Remember that Python indices start with `0`.

`data[0, 0]` is the value found in the upper-left corner of the array, because in Python, array rows are numbered from top-to-bottom, and array columns are numbered from left-to-right.

We can also select sections of data, like a subset of rows and columns:

<!---
print(data[2:7, 0:10])
-->


In [None]:
# Select sections of data


```python
print(data[2:7, 0:10])
```

If you remember the list slicing that we did, this works the same way.

We can specify a start index and an end index (which is not included in the selection) like: `start:end`, only now we can do it once for rows, and a second time for columns.

We can also omit one or both of the start/end indices.

Omitting the start will mean the selection starts at the first row or column.

Omitting the end will mean the selection goes through the last row or column.

Omitting both means the selection will include all of the rows or columns.

You can also access attributes of a subsection of an array.

<!---
small = data[ :3, 36: ]
print('small is:')
print(small)

print(data[ : , : ].shape)
-->

In [4]:
# Accessing attributes of an array subsection


```python
small = data[ :3, 36: ]
print('small is:')
print(small)

print(data[ : , : ].shape)
```

### Analysing data

Now that we can access different portions of the array, let's do something with the data it contains.

We'll start by taking the mean (average) inflammation over all patients and all days:

<!---
print(numpy.mean(data))
-->


In [5]:
# Take the mean inflammation for all patients over all days


```python
print(numpy.mean(data))
```

This function, `numpy.mean()` takes an array as its argument.

We can get some other descriptive information about our dataset.

<!---
maxval = numpy.max(data)
minval = numpy.min(data)
stdval = numpy.std(data)
-->

In [None]:
# Other descriptive statistics can also be calculated


```python
maxval = numpy.max(data)
minval = numpy.min(data)
stdval = numpy.std(data)
```

### Calculating values for a single row or column

In our data, if we want to look at average inflammation for a single patient, or the average inflammation on day 7, we can create a temporary array of just the data of interest.

<!---
patient_0 = data[0, :]   # all rows, but only the first column
print('maximum inflammation for patient 0:', numpy.max(patient_0))
-->

In [None]:
# Calculate maximum inflammation for patient_0


```python
patient_0 = data[0, :]   # all rows, but only the first column
print('maximum inflammation for patient 0:', numpy.max(patient_0))
```

Here, we have stored the information for patient_0 in its own variable before calculating the information we want, but we don't have to do this.

<!---
print('maximum inflammation for patient 2:', numpy.max(data[2, :]))
-->

In [None]:
# Calculate maximum inflammation for patient_0 without creating a temp array


```python
print('maximum inflammation for patient 2:', numpy.max(data[2, :]))
```


### Calculating a value for every row or column

If we want the maximum inflammation for each patient over all days, or the average over all patients for each day, we can perform the operation across an **axis**.

An axis refers to either the horizontal (rows) or vertical (columns) axis.

The descriptive array functions we have been using have a keyword parameter that can be used to specify an axis on which to operate.

To get the average inflammation for all patients for each day and check its shape, we can run:

<!---
print(numpy.mean(data, axis=0))
print(numpy.mean(data, axis=0).shape)
-->


In [6]:
# Calculate average inflammation for all patients for each day
# And check its shape


```python
print(numpy.mean(data, axis=0))
print(numpy.mean(data, axis=0).shape)
```

The shape is `(40, )`, which means it is a vector containing 40 elements (you may remember there were 40 columns in `data`).

<img class="eighty" src="intro_python_images/numpy_avg_infl_day.png" />

We can change the `axis` to get the average inflammation per patient across all days.

<!---
print(numpy.mean(data, axis=1))
print(numpy.mean(data, axis=1).shape)
-->

In [None]:
# Calculate average inflammation per patient across all days
# And view its shape


In [None]:
```python
print(numpy.mean(data, axis=1))
print(numpy.mean(data, axis=1).shape)
```

The shape is `(60,)`, which means it is a vector containing 60 elements (you may remember there were 60 rows in `data`).

<img src="intro_python_images/numpy_avg_infl_patient.png" />

## Summary

- Libraries allow you to do specialised things in Python.
- To import a library, use the `import` statement.
- NumPy provides the n-dimensional array data structure.

### NumPy arrays
- NumPy arrays hold tabular data that is all of the same type.
- There are many ways to select portions of a NumPy array.
- NumPy also provides many methods to go along with the NumPy array data structure.

## Exercises

### 1. Thin Slices

The expression `element[3:3]` produces an empty string, i.e., a string that contains no characters. If data holds our array of patient data, what does `data[3:3, 4:4]` produce? What about `data[3:3, :]`?

In [None]:
# Do Exercise 1 here


### 2. Stacking Arrays
Arrays can be concatenated and stacked on top of one another, using NumPy’s `vstack()` and `hstack()` functions for vertical and horizontal stacking, respectively.

```python
import numpy

A = numpy.array([[1,2,3], [4,5,6], [7, 8, 9]])
print('A = ')
print(A)

B = numpy.hstack([A, A])
print('B = ')
print(B)

C = numpy.vstack([A, A])
print('C = ')
print(C)
A =
[[1 2 3]
 [4 5 6]
 [7 8 9]]

B =
[[1 2 3 1 2 3]
 [4 5 6 4 5 6]
 [7 8 9 7 8 9]]

C =
[[1 2 3]
 [4 5 6]
 [7 8 9]
 [1 2 3]
 [4 5 6]
 [7 8 9]]
```

Write some additional code that slices the first and last columns of `A`, and stacks them into a 3x2 array. Make sure to print the results to verify your solution.

In [None]:
# Do Exercise 2 here


### 3. Change In Inflammation

#### Part 1

The patient data is longitudinal in the sense that each row represents a series of observations relating to one individual. This means that the change in inflammation over time is a meaningful concept. Let’s find out how to calculate changes in the data contained in an array with NumPy.

The `numpy.diff()` function takes an array and returns the differences between two successive values. Let’s use it to examine the changes each day across the first week of patient 3 from our inflammation dataset.

```python
patient3_week1 = data[3, :7]
print(patient3_week1)
```
```python
 [0. 0. 2. 0. 4. 2. 2.]
```

Calling `numpy.diff(patient3_week1)` would do the following calculations

```python
[ 0 - 0, 2 - 0, 0 - 2, 4 - 0, 2 - 4, 2 - 2 ]
```

and return the 6 difference values in a new array.

```python
numpy.diff(patient3_week1)
array([ 0.,  2., -2.,  4., -2.,  0.])
```

Note that the array of differences is shorter by one element (length 6).


When calling `numpy.diff()` with a multi-dimensional array, an `axis` argument may be passed to the function to specify which axis to process. When applying `numpy.diff()` to our 2D inflammation array `data`, which axis would we specify?

In [None]:
# Do Exercise 3.1 here


### Exercise 3
#### Part 2

If the shape of an individual data file is `(60, 40)` (60 rows and 40 columns), what would the shape of the array be after you run the `numpy.diff()` function and why?

In [None]:
# Do Exercise 3.2 here


### Exercise 3
#### Part 3

How would you find the largest change in inflammation for each patient? Does it matter if the change in inflammation is an increase or a decrease?

In [None]:
# Do Exercise 3.3 here


### 4. Rescaling an Array

Write a function `rescale()` that takes an array as input and returns a corresponding array of values scaled to lie in the range `0.0` to `1.0`. (Hint: If `L` and `H` are the lowest and highest values in the original array, then the replacement for a value `v` should be `(v-L) / (H-L)`.)

In [None]:
# Do Exercise 4 here


### 5. Testing and Documenting Your Function

Run the commands `help(numpy.arange)` and `help(numpy.linspace)` to see how to use these functions to generate regularly-spaced values, then use those values to test your `rescale()` function. Once you’ve successfully tested your function, add a doc_string that explains what it does.

In [None]:
# Do Exercise 5 here


### 6. Defining Defaults

Rewrite the `rescale()` function so that it scales data to lie between `0.0` and `1.0` by default, but will allow the caller to specify lower and upper bounds if they want. Compare your implementation to your neighbor’s: do the two functions always behave the same way?

In [None]:
# Do Exercise 6 here
