# Filtering data

Manaully selecting data from an array by giving indices or ranges works if you want contiguous chunks of data, but sometimes you want to be able to grab arbitrary data from an array.

Let's explore the options for more advanced indexing of arrays.

In [1]:
import numpy as np

We'll use the following array in our examples. To make it easier to understand what's going on, the value of the digit after the decimal place is set to match the index (i.e. the value at index <u>3</u> is 10.<u>3</u>):

In [2]:
a = np.array([6.0, 2.1, 8.2, 10.3, 5.4, 1.5, 7.6, 3.7, 9.8])

## Select by index

We've already seen that you can access a single element by giving its index:

In [3]:
a[5]

1.5

and that we can also pass a slice (just like in a Python list):

In [4]:
a[5:7]

array([1.5, 7.6])

To start to get more advanced than what Python lists support, we can also provide a list of element indices that we want. So, if we want elements `5` and `6` then we can pass in a list of those indices:

In [5]:
wanted_elements = [5, 6]
a[wanted_elements]

array([1.5, 7.6])

So this gave us back the item at index `5` an the item at index `6`.

The indices we pass in do not have to be contiguous, they can be from anywhere in the array:

In [6]:
wanted_elements = [4, 7]
a[wanted_elements]

array([5.4, 3.7])

They can be in any order:

In [7]:
wanted_elements = [7, 4]
a[wanted_elements]

array([3.7, 5.4])

And we can even repeat them and the matching value is repeated in the output:

In [8]:
wanted_elements = [4, 4, 7, 4]
a[wanted_elements]

array([5.4, 5.4, 3.7, 5.4])

In all these cases we have made a separate variable to hold the indices and then passed in that variable. We can do this in one step too:

In [9]:
a[[4, 4, 7, 4]]

array([5.4, 5.4, 3.7, 5.4])

The outer `[]` are the "index the array `a`" syntax, and the inner `[]` are the "make a list of these numbers" syntax.

### Exercise #

Using this indexing syntax, extract the following results:

1. `[6.  8.2 5.4 7.6 9.8]`
2. `[ 6.   2.1  8.2 10.3]`
3. `[9.8 1.5 3.7 5.4]`



In [None]:
a = np.array([6.0, 2.1, 8.2, 10.3, 5.4, 1.5, 7.6, 3.7, 9.8])

In [None]:
a[[0, 2, 4, 6, 8]]

In [None]:
a[[0, 1, 2, 3]]

In [None]:
a[[8, 5, 7, 4]]

## Filter by boolean

Let's say that we want all the values in the array that are larger than 4.0. We could do this by manually finding all those indices which match and constructing a list of them to use as a selector:

In [10]:
large_indices = [0, 2, 3, 4, 6, 8]
a[large_indices]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

Or, we can use a list of boolean values, where we set those elements we want to extract to `True` and those we want to be rid of to `False`:

In [11]:
mask = [True, False, True, True, True, False, True, False, True]
a[mask]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

These lists of `True` and `False` are referred to as *boolean arrays*.

With a larger array it would be tedious to create this list by hand. Luckily NumPy provides us with a way of constructing these automatically. If we want a boolean array matching the values which are greater than 4, we can use the same sort of syntax we used for multiplication, but use `>` instead:

In [12]:
a > 4

array([ True, False,  True,  True,  True, False,  True, False,  True])

Or, diagramatically (using ■ for `True` and □ for `False`):

<div class="operation">
<div>
    <table class="array"><tr><td>&nbsp;6.0</td></tr><tr><td>&nbsp;2.1</td></tr><tr><td>&nbsp;8.2</td></tr><tr><td>10.3</td></tr><tr><td>&nbsp;5.4</td></tr><tr><td>&nbsp;1.5</td></tr><tr><td>&nbsp;7.6</td></tr><tr><td>&nbsp;3.7</td></tr><tr><td>&nbsp;9.8</td></tr></table>
</div>
<div>&gt;</div>
<div>4</div>
<span style="font-size: 200%;">⇨</span>
<div>
    <table class="array"><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr><tr><td>■</td></tr><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr></table>
</div>
</div>

This mask can be saved to a variable and passed in as an index:

In [13]:
mask = a > 4
a[mask]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

Or, in one line:

In [14]:
a[a > 4]

array([ 6. ,  8.2, 10.3,  5.4,  7.6,  9.8])

which can be read as "select from `a` the elements where `a` is greater than 4"

<div class="operation">
<div>
    <table class="array"><tr><td>&nbsp;6.0</td></tr><tr><td>&nbsp;2.1</td></tr><tr><td>&nbsp;8.2</td></tr><tr><td>10.3</td></tr><tr><td>&nbsp;5.4</td></tr><tr><td>&nbsp;1.5</td></tr><tr><td>&nbsp;7.6</td></tr><tr><td>&nbsp;3.7</td></tr><tr><td>&nbsp;9.8</td></tr></table>
</div>
<div>[</div>
<div>
    <table class="array"><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr><tr><td>■</td></tr><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr></table>
</div>
<div>]</div>
<span style="font-size: 200%;">⇨</span>
<div>
    <table class="array"><tr><td>&nbsp;6.0</td></tr><tr><td class="empty">&nbsp;</td></tr><tr><td>&nbsp;8.2</td></tr><tr><td>10.3</td></tr><tr><td>&nbsp;5.4</td></tr><tr><td class="empty">&nbsp;</td></tr><tr><td>&nbsp;7.6</td></tr><tr><td class="empty">&nbsp;</td></tr><tr><td>&nbsp;9.8</td></tr></table>
</div>
<span style="font-size: 200%;">⇨</span>
<div>
    <table class="array"><tr><td>&nbsp;6.0</td></tr><tr><td>&nbsp;8.2</td></tr><tr><td>10.3</td></tr><tr><td>&nbsp;5.4</td></tr><tr><td>&nbsp;7.6</td></tr><tr><td>&nbsp;9.8</td></tr></table>
</div>
</div>

### Exercise #

Extract all the values in `a` which are:
1. Less than 5
2. Greater than or equal to 8.2
3. Equal to 6.0


In [1]:
import numpy as np

a = np.array([6.0, 2.1, 8.2, 10.3, 5.4, 1.5, 7.6, 3.7, 9.8])

In [2]:
a[a < 5]

array([2.1, 1.5, 3.7])

In [3]:
a[a >= 8.2]

array([ 8.2, 10.3,  9.8])

In [4]:
a[a == 6.0]

array([6.])

## Setting from filters

Just like at the beginning of the course when we set values in an array with:
```python
a[4] = 99.4
```

We can also use the return value of any filter to set values. For example, if we wanted to set all values greater than 4 to be 0 we can do:

In [15]:
a[a > 4] = 0
a

array([0. , 2.1, 0. , 0. , 0. , 1.5, 0. , 3.7, 0. ])

<div class="operation">
<div>
    <table class="array"><tr><td>&nbsp;6.0</td></tr><tr><td>&nbsp;2.1</td></tr><tr><td>&nbsp;8.2</td></tr><tr><td>10.3</td></tr><tr><td>&nbsp;5.4</td></tr><tr><td>&nbsp;1.5</td></tr><tr><td>&nbsp;7.6</td></tr><tr><td>&nbsp;3.7</td></tr><tr><td>&nbsp;9.8</td></tr></table>
</div>
<div>[</div>
<div>
    <table class="array"><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr><tr><td>■</td></tr><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr><tr><td>□</td></tr><tr><td>■</td></tr></table>
</div>
<div>]</div>
<div>=</div>
<div>0</div>
<span style="font-size: 200%;">⇨</span>
<div>
    <table class="array"><tr><td>&nbsp;6.0</td></tr><tr><td class="empty">&nbsp;2.1</td></tr><tr><td>&nbsp;8.2</td></tr><tr><td>10.3</td></tr><tr><td>&nbsp;5.4</td></tr><tr><td class="empty">&nbsp;1.5</td></tr><tr><td>&nbsp;7.6</td></tr><tr><td class="empty">&nbsp;3.7</td></tr><tr><td>&nbsp;9.8</td></tr></table>
</div>
<div>=</div>
<div>0</div>
<span style="font-size: 200%;">⇨</span>
<div>
    <table class="array"><tr><td>0.0</td></tr><tr><td>2.1</td></tr><tr><td>0.0</td></tr><tr><td>0.0</td></tr><tr><td>0.0</td></tr><tr><td>1.5</td></tr><tr><td>0.0</td></tr><tr><td>3.7</td></tr><tr><td>0.0</td></tr></table>
</div>
</div>

## Filtering multi-dimensional data

In [16]:
grid = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [17]:
grid[grid > 4]

array([5, 6, 7, 8, 9])

<div class="operation">
<div>
<table class="array">
<tbody>
  <tr>
    <td>1</td><td>2</td><td>3</td>
  </tr>
  <tr>
    <td>4</td><td>5</td><td>6</td>
  </tr>
  <tr>
    <td>7</td><td>8</td><td>9</td>
  </tr>
</tbody>
</table>
</div>
<div>[</div>
<div>
<table class="array">
<tbody>
  <tr>
    <td>□</td><td>□</td><td>□</td>
  </tr>
  <tr>
    <td>□</td><td>■</td><td>■</td>
  </tr>
  <tr>
    <td>■</td><td>■</td><td>■</td>
  </tr>
</tbody>
</table>
</div>
<div>]</div>
<span style="font-size: 200%;">⇨</span>
<div>
<table class="array">
<tbody>
  <tr>
    <td>5</td><td>6</td><td>7</td><td>8</td><td>9</td>
  </tr>
</tbody>
</table>
</div>
</div>

In this case, it has lost the information about *where* the numbers have come from. The dimensionality has been lost.

NumPy hasn't much choice in this case, as if it were to return the result with the same shape as the original array `grid`, what should it put in the spaces that we've filtered out?

```
[[? ? ?]
 [? 5 6]
 [7 8 9]]
```

You might think that it could fill those with `0`, or `-1` but any of those could very easily cause a problem with code that follows. NumPy doesn't take a stance on this as it would be dangerous.

In your code, you know what you're doing with your data, so it's ok for you to decide on a case-by-case basis. If you decide that you want to keep the original shape, but replace any filtered-out values with a `0` then you can use NumPy's [`where`](https://numpy.org/doc/stable/reference/generated/numpy.where.html) function. It takes three arguments:
1. the boolean array selecting the values,
2. an array or values to use in the spots you're keeping, and
3. an array or values to use in the spots you're filtering out.

So, in the case where we want to replace any values less-than or equal-to 4 with `0`, we can use:

In [18]:
np.where(grid > 4, grid, 0)

array([[0, 0, 0],
       [0, 5, 6],
       [7, 8, 9]])

<div class="operation">
<div>np.where(</div>
<div>
<table class="array">
<tbody>
  <tr>
    <td>□</td><td>□</td><td>□</td>
  </tr>
  <tr>
    <td>□</td><td>■</td><td>■</td>
  </tr>
  <tr>
    <td>■</td><td>■</td><td>■</td>
  </tr>
</tbody>
</table>
</div>
<div>,</div>
<div>
<table class="array">
<tbody>
  <tr>
    <td>1</td><td>2</td><td>3</td>
  </tr>
  <tr>
    <td>4</td><td>5</td><td>6</td>
  </tr>
  <tr>
    <td>7</td><td>8</td><td>9</td>
  </tr>
</tbody>
</table>
</div>
<div>,</div>
<div>0</div>
<div>)</div>
<span style="font-size: 200%;">⇨</span>
<div>
<table class="array">
<tbody>
  <tr>
    <td>0</td><td>0</td><td>0</td>
  </tr>
  <tr>
    <td>0</td><td>5</td><td>6</td>
  </tr>
  <tr>
    <td>7</td><td>8</td><td>9</td>
  </tr>
</tbody>
</table>
</div>
</div>

Note that this has not affected the original array:

In [19]:
grid

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

One final way that you can reduce your data, while keeping the dimensionality is to use [masked arrays](https://numpy.org/doc/stable/reference/maskedarray.html). This is useful in situations where you have *missing data*. The advantage of masked arrays is that operations like averaging are not affected by the cells that are masked out. The downside is that for much larger arrays they will use more memory and can be slower to operate on.

In [20]:
masked_grid = np.ma.masked_array(grid, grid <= 4)
print(masked_grid)

[[-- -- --]
 [-- 5 6]
 [7 8 9]]


<div class="operation">
<div>np.ma.masked_array(</div>
<div>
<table class="array">
<tbody>
  <tr>
    <td>1</td><td>2</td><td>3</td>
  </tr>
  <tr>
    <td>4</td><td>5</td><td>6</td>
  </tr>
  <tr>
    <td>7</td><td>8</td><td>9</td>
  </tr>
</tbody>
</table>
</div>
<div>,</div>
<div>
<table class="array">
<tbody>
  <tr>
    <td>■</td><td>■</td><td>■</td>
  </tr>
  <tr>
    <td>■</td><td>□</td><td>□</td>
  </tr>
  <tr>
    <td>□</td><td>□</td><td>□</td>
  </tr>
</tbody>
</table>
</div>
<div>)</div>
<span style="font-size: 200%;">⇨</span>
<div>
<table class="array">
<tbody>
  <tr>
    <td class="empty">&nbsp;</td><td class="empty">&nbsp;</td><td class="empty">&nbsp;</td>
  </tr>
  <tr>
    <td class="empty">&nbsp;</td><td>5</td><td>6</td>
  </tr>
  <tr>
    <td>7</td><td>8</td><td>9</td>
  </tr>
</tbody>
</table>
</div>
</div>

In [21]:
np.mean(masked_grid)

7.0

### Exercise #

The `rain` data set represents the prediction of metres of rainfall in an area and is based on the data from [ECMWF](https://apps.ecmwf.int/codes/grib/param-db/?id=228). It is two-dimensional with axes of latitude and longitude.

```python
with np.load("weather_data.npz") as weather:
    rain = weather["rain"]
    uk_mask = weather["uk"]
    irl_mask = weather["ireland"]
    spain_mask = weather["spain"]
```

- Calculate the mean of the entire 2D `rain` data set.
- Look at the `uk_mask` array, including its `dtype` and `shape`
- Filter the `rain` data set to contain only those values from within the UK.
  - Does `[]` indexing, `np.where` or `masked_arrays` make the most sense for this task?
- Calculate the mean (and maximum and minimum if you like) of the data
- Do the same with Ireland and Spain and compare the numbers



In [1]:
import numpy as np

with np.load("weather_data.npz") as weather:
    rain = weather["rain"]
    uk_mask = weather["uk"]
    irl_mask = weather["ireland"]
    spain_mask = weather["spain"]

In [2]:
np.mean(rain)

0.003113156467013889

This is very small since it's in metres, so let's convert it into mm:

In [3]:
rain *= 1000

In [4]:
np.mean(rain)

3.113156467013889

In [5]:
uk_mask.dtype

dtype('bool')

In [6]:
uk_mask.shape

(75, 75)

This is the same as the `rain` array:

In [7]:
rain.shape

(75, 75)

What filtering method to use:
- Using `[]` will lose the original shape, but since we're averaging the whole thing that doesn't matter
- Using `np.where` will be tricky as if we fill the masked-out areads with `0` then it will skew the mean
- Using `masked_array` would work fine

For questions like this, using `[]` is often simplest:

In [8]:
uk_rain = rain[uk_mask]

In [9]:
np.mean(uk_rain)

7.5836526862097005

In [10]:
np.mean(rain[irl_mask])

4.611112633529975

In [11]:
np.mean(rain[spain_mask])

0.8478509374411709

The UK has the heaviest rain, followed by Ireland, followed by Spain.