<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

<!--NAVIGATION-->
< [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) | [Contents](Index.ipynb) | [Fancy Indexing](02.07-Fancy-Indexing.ipynb) >

<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.06-Boolean-Arrays-and-Masks.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


# Comparisons, Masks, and Boolean Logic

This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays.
Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.

In [1]:
import numpy as np

In [2]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [5]:
mask = arr < 5
mask

array([ True,  True,  True,  True,  True, False, False, False, False,
       False])

In [12]:
np.nonzero(arr < 5)

(array([0, 1, 2, 3, 4]),)

In [11]:
(arr < 5).sum()

5

In [8]:
mask.all()

False

In [7]:
mask.any()

True

In [13]:
age = 28

In [14]:
18 <= age < 90

True

In [15]:
(18 <= age) and (age < 90)

True

In [18]:
1 < arr

array([False, False,  True,  True,  True,  True,  True,  True,  True,
        True])

In [19]:
arr < 5

array([ True,  True,  True,  True,  True, False, False, False, False,
       False])

In [26]:
def super_long_computation():
    while True:
        pass

In [30]:
3 & 8

0

In [28]:
("hello") or (super_long_computation())

'hello'

In [21]:
# (1 < arr) and (arr < 5)  # Failing

In [22]:
(1 < arr) & (arr < 5)

array([False, False,  True,  True,  True, False, False, False, False,
       False])

In [10]:
if mask.all():
    print("All the elements are lower than five")
elif mask.any():
    print("At least one element is lower than five")
else:
    print("None of the elements is lower than five")

At least one element is lower than five


## Example: Counting Rainy Days

Imagine you have a series of data that represents the amount of precipitation each day for a year in a given city.
For example, here we'll load the daily rainfall statistics for the city of Seattle in 2014, using Pandas (which is covered in more detail in [Chapter 3](03.00-Introduction-to-Pandas.ipynb)):

In [31]:
import pandas as pd

In [145]:
type(df["PRCP"].head())

pandas.core.series.Series

In [147]:
col = df["PRCP"]
col.head()

DATE
20140101     0
20140102    41
20140103    15
20140104     0
20140105     0
Name: PRCP, dtype: int64

In [150]:
col.index

Int64Index([20140101, 20140102, 20140103, 20140104, 20140105, 20140106,
            20140107, 20140108, 20140109, 20140110,
            ...
            20141222, 20141223, 20141224, 20141225, 20141226, 20141227,
            20141228, 20141229, 20141230, 20141231],
           dtype='int64', name='DATE', length=365)

In [148]:
col.dtype

dtype('int64')

In [149]:
col.values

array([  0,  41,  15,   0,   0,   3, 122,  97,  58,  43, 213,  15,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   5,   0,   0,   0,   0,
         0,  89, 216,   0,  23,  20,   0,   0,   0,   0,   0,   0,  51,
         5, 183, 170,  46,  18,  94, 117, 264, 145, 152,  10,  30,  28,
        25,  61, 130,   3,   0,   0,   0,   5, 191, 107, 165, 467,  30,
         0, 323,  43, 188,   0,   0,   5,  69,  81, 277,   3,   0,   5,
         0,   0,   0,   0,   0,  41,  36,   3, 221, 140,   0,   0,   0,
         0,  25,   0,  46,   0,   0,  46,   0,   0,   0,   0,   0,   0,
         5, 109, 185,   0, 137,   0,  51, 142,  89, 124,   0,  33,  69,
         0,   0,   0,   0,   0, 333, 160,  51,   0,   0, 137,  20,   5,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  38,
         0,  56,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,  18,  64,   0,   5,  36,  13,   0,
         8,   3,   0,   0,   0,   0,   0,   0,  18,  23,   0,   

In [154]:
df.index

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10',
               ...
               '2014-12-22', '2014-12-23', '2014-12-24', '2014-12-25',
               '2014-12-26', '2014-12-27', '2014-12-28', '2014-12-29',
               '2014-12-30', '2014-12-31'],
              dtype='datetime64[ns]', name='DATE', length=365, freq=None)

In [151]:
df = pd.read_csv("data/Seattle2014.csv", index_col="DATE",
                 parse_dates=["DATE"])
df.head()

Unnamed: 0_level_0,STATION,STATION_NAME,PRCP,SNWD,SNOW,TMAX,TMIN,AWND,WDF2,WDF5,WSF2,WSF5,WT01,WT05,WT02,WT03
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2014-01-01,GHCND:USW00024233,SEATTLE TACOMA INTERNATIONAL AIRPORT WA US,0,0,0,72,33,12,340,310,36,40,-9999,-9999,-9999,-9999
2014-01-02,GHCND:USW00024233,SEATTLE TACOMA INTERNATIONAL AIRPORT WA US,41,0,0,106,61,32,190,200,94,116,-9999,-9999,-9999,-9999
2014-01-03,GHCND:USW00024233,SEATTLE TACOMA INTERNATIONAL AIRPORT WA US,15,0,0,89,28,26,30,50,63,72,1,-9999,-9999,-9999
2014-01-04,GHCND:USW00024233,SEATTLE TACOMA INTERNATIONAL AIRPORT WA US,0,0,0,78,6,27,40,40,45,58,1,-9999,-9999,-9999
2014-01-05,GHCND:USW00024233,SEATTLE TACOMA INTERNATIONAL AIRPORT WA US,0,0,0,83,-5,37,10,10,67,76,-9999,-9999,-9999,-9999


In [46]:
#pd.read_csv("data/Seattle2014.csv").loc[:, "STATION_NAME":"PRCP"]

In [51]:
rain = pd.read_csv("data/Seattle2014.csv")["PRCP"].values
rain = rain / 254  # from 1/10 mm to inch
# rain /= 254  # Same as above
rain

array([0.        , 0.16141732, 0.05905512, 0.        , 0.        ,
       0.01181102, 0.48031496, 0.38188976, 0.22834646, 0.16929134,
       0.83858268, 0.05905512, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.01968504, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.3503937 , 0.8503937 , 0.        ,
       0.09055118, 0.07874016, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.2007874 , 0.01968504,
       0.72047244, 0.66929134, 0.18110236, 0.07086614, 0.37007874,
       0.46062992, 1.03937008, 0.57086614, 0.5984252 , 0.03937008,
       0.11811024, 0.11023622, 0.0984252 , 0.24015748, 0.51181102,
       0.01181102, 0.        , 0.        , 0.        , 0.01968504,
       0.7519685 , 0.42125984, 0.6496063 , 1.83858268, 0.11811024,
       0.        , 1.27165354, 0.16929134, 0.74015748, 0.        ,
       0.        , 0.01968504, 0.27165354, 0.31889764, 1.09055

In [56]:
np.median(rain)

0.0

In [52]:
rain.mean()

0.1329737892352497

In [63]:
%timeit sum(rain > 0)

810 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [61]:
%timeit np.sum(rain > 0)

6.27 µs ± 307 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [62]:
%timeit (rain > 0).sum()

5.2 µs ± 511 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [66]:
np.count_nonzero(rain)

150

In [107]:
arr = np.random.randn(10)
arr

array([-0.21527354, -1.32475848,  0.14524345, -0.53353842,  0.26545778,
       -0.50458435,  3.06672074,  0.59312799, -0.35556486,  1.14988032])

In [108]:
a2d = arr.reshape(5, 2)
a2d

array([[-0.21527354, -1.32475848],
       [ 0.14524345, -0.53353842],
       [ 0.26545778, -0.50458435],
       [ 3.06672074,  0.59312799],
       [-0.35556486,  1.14988032]])

In [109]:
subset = a2d[2:5, 0].copy()
subset

array([ 0.26545778,  3.06672074, -0.35556486])

In [110]:
subset[0] = 99
subset

array([99.        ,  3.06672074, -0.35556486])

In [111]:
a2d

array([[-0.21527354, -1.32475848],
       [ 0.14524345, -0.53353842],
       [ 0.26545778, -0.50458435],
       [ 3.06672074,  0.59312799],
       [-0.35556486,  1.14988032]])

In [90]:
# a2d[a2d[:, 0] < 0, 0] = np.array([0, 1])  # Fails

In [91]:
a2d[a2d[:, 0] < 0, 1] = 99

In [92]:
a2d

array([[-1.45175624e+00,  9.90000000e+01],
       [-2.88553564e-03,  9.90000000e+01],
       [ 8.41397583e-01,  6.96313861e-03],
       [-1.00337620e-01,  9.90000000e+01],
       [ 1.19863719e+00,  1.79976103e-01]])

In [71]:
arr[arr < 0]

array([-1.45175624, -0.00288554, -1.78320579, -0.10033762])

In [69]:
arr[np.array([
    True, False, False, False, False, False, False, False, False, True
])]

array([-1.45175624,  0.1799761 ])

In [115]:
rain[rain > 0]

array([0.16141732, 0.05905512, 0.01181102, 0.48031496, 0.38188976,
       0.22834646, 0.16929134, 0.83858268, 0.05905512, 0.01968504,
       0.3503937 , 0.8503937 , 0.09055118, 0.07874016, 0.2007874 ,
       0.01968504, 0.72047244, 0.66929134, 0.18110236, 0.07086614,
       0.37007874, 0.46062992, 1.03937008, 0.57086614, 0.5984252 ,
       0.03937008, 0.11811024, 0.11023622, 0.0984252 , 0.24015748,
       0.51181102, 0.01181102, 0.01968504, 0.7519685 , 0.42125984,
       0.6496063 , 1.83858268, 0.11811024, 1.27165354, 0.16929134,
       0.74015748, 0.01968504, 0.27165354, 0.31889764, 1.09055118,
       0.01181102, 0.01968504, 0.16141732, 0.14173228, 0.01181102,
       0.87007874, 0.5511811 , 0.0984252 , 0.18110236, 0.18110236,
       0.01968504, 0.42913386, 0.72834646, 0.53937008, 0.2007874 ,
       0.55905512, 0.3503937 , 0.48818898, 0.12992126, 0.27165354,
       1.31102362, 0.62992126, 0.2007874 , 0.53937008, 0.07874016,
       0.01968504, 0.1496063 , 0.22047244, 0.07086614, 0.25196

In [114]:
np.median(rain[rain > 0])

0.19488188976377951

In [118]:
arr

array([-0.21527354, -1.32475848,  0.14524345, -0.53353842,  0.26545778,
       -0.50458435,  3.06672074,  0.59312799, -0.35556486,  1.14988032])

In [124]:
a2d

array([[-0.21527354, -1.32475848],
       [ 0.14524345, -0.53353842],
       [ 0.26545778, -0.50458435],
       [ 3.06672074,  0.59312799],
       [-0.35556486,  1.14988032]])

In [129]:
a2d[[0, 3]] 

array([[-0.21527354, -1.32475848],
       [ 3.06672074,  0.59312799]])

In [133]:
a2d[[0, 3], [1, 1]] 

array([-1.32475848,  0.59312799])

In [119]:
subset = arr[[0, 3, 7]]
subset

array([-0.21527354, -0.53353842,  0.59312799])

In [120]:
subset[0] = 99
subset

array([99.        , -0.53353842,  0.59312799])

In [121]:
arr

array([-0.21527354, -1.32475848,  0.14524345, -0.53353842,  0.26545778,
       -0.50458435,  3.06672074,  0.59312799, -0.35556486,  1.14988032])

In [None]:
# use pandas to extract rainfall inches as a NumPy array

The array contains 365 values, giving daily rainfall in inches from January 1 to December 31, 2014.

As a first quick visualization, let's look at the histogram of rainy days, which was generated using Matplotlib (we will explore this tool more fully in [Chapter 4](04.00-Introduction-To-Matplotlib.ipynb)):

This histogram gives us a general idea of what the data looks like: despite its reputation, the vast majority of days in Seattle saw near zero measured rainfall in 2014.
But this doesn't do a good job of conveying some information we'd like to see: for example, how many rainy days were there in the year? What is the average precipitation on those rainy days? How many days were there with more than half an inch of rain?

### Digging into the data

One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range.
For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of time writing code and time computing the result.
We saw in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) that NumPy's ufuncs can be used in place of loops to do fast element-wise arithmetic operations on arrays; in the same way, we can use other ufuncs to do element-wise *comparisons* over arrays, and we can then manipulate the results to answer the questions we have.
We'll leave the data aside for right now, and discuss some general tools in NumPy to use *masking* to quickly answer these types of questions.

## Comparison Operators as ufuncs

In [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) we introduced ufuncs, and focused in particular on arithmetic operators. We saw that using ``+``, ``-``, ``*``, ``/``, and others on arrays leads to element-wise operations.
NumPy also implements comparison operators such as ``<`` (less than) and ``>`` (greater than) as element-wise ufuncs.
The result of these comparison operators is always an array with a Boolean data type.
All six of the standard comparison operations are available:

It is also possible to do an element-wise comparison of two arrays, and to include compound expressions:

As in the case of arithmetic operators, the comparison operators are implemented as ufuncs in NumPy; for example, when you write ``x < 3``, internally NumPy uses ``np.less(x, 3)``.
    A summary of the comparison operators and their equivalent ufunc is shown here:

| Operator	    | Equivalent ufunc    || Operator	   | Equivalent ufunc    |
|---------------|---------------------||---------------|---------------------|
|``==``         |``np.equal``         ||``!=``         |``np.not_equal``     |
|``<``          |``np.less``          ||``<=``         |``np.less_equal``    |
|``>``          |``np.greater``       ||``>=``         |``np.greater_equal`` |

Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape.
Here is a two-dimensional example:

In each case, the result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these Boolean results.

## Working with Boolean Arrays

Given a Boolean array, there are a host of useful operations you can do.
We'll work with ``x``, the two-dimensional array we created earlier.

### Counting entries

To count the number of ``True`` entries in a Boolean array, ``np.count_nonzero`` is useful:

In [None]:
# how many values less than 6?

We see that there are eight array entries that are less than 6.
Another way to get at this information is to use ``np.sum``; in this case, ``False`` is interpreted as ``0``, and ``True`` is interpreted as ``1``:

The benefit of ``sum()`` is that like with other NumPy aggregation functions, this summation can be done along rows or columns as well:

In [None]:
# how many values less than 6 in each row?

This counts the number of values less than 6 in each row of the matrix.

If we're interested in quickly checking whether any or all the values are true, we can use (you guessed it) ``np.any`` or ``np.all``:

In [None]:
# are there any values greater than 8?

In [None]:
# are there any values less than zero?

In [None]:
# are all values less than 10?

In [None]:
# are all values equal to 6?

``np.all`` and ``np.any`` can be used along particular axes as well. For example:

In [None]:
# are all values in each row less than 8?

Here all the elements in the first and third rows are less than 8, while this is not the case for the second row.

Finally, a quick warning: as mentioned in [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb), Python has built-in ``sum()``, ``any()``, and ``all()`` functions. These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays. Be sure that you are using ``np.sum()``, ``np.any()``, and ``np.all()`` for these examples!

### Boolean operators

We've already seen how we might count, say, all days with rain less than four inches, or all days with rain greater than two inches.
But what if we want to know about all days with rain less than four inches and greater than one inch?
This is accomplished through Python's *bitwise logic operators*, ``&``, ``|``, ``^``, and ``~``.
Like with the standard arithmetic operators, NumPy overloads these as ufuncs which work element-wise on (usually Boolean) arrays.

For example, we can address this sort of compound question as follows:

So we see that there are 29 days with rainfall between 0.5 and 1.0 inches.

Note that the parentheses here are important–because of operator precedence rules, with parentheses removed this expression would be evaluated as follows, which results in an error:

``` python
inches > (0.5 & inches) < 1
```

Using the equivalence of *A AND B* and *NOT (NOT A OR NOT B)* (which you may remember if you've taken an introductory logic course), we can compute the same result in a different manner:

Combining comparison operators and Boolean operators on arrays can lead to a wide range of efficient logical operations.

The following table summarizes the bitwise Boolean operators and their equivalent ufuncs:

| Operator	    | Equivalent ufunc    || Operator	    | Equivalent ufunc    |
|---------------|---------------------||---------------|---------------------|
|``&``          |``np.bitwise_and``   ||&#124;         |``np.bitwise_or``    |
|``^``          |``np.bitwise_xor``   ||``~``          |``np.bitwise_not``   |

Using these tools, we might start to answer the types of questions we have about our weather data.
Here are some examples of results we can compute when combining masking with aggregations:

## Boolean Arrays as Masks

In the preceding section we looked at aggregates computed directly on Boolean arrays.
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our ``x`` array from before, suppose we want an array of all values in the array that are less than, say, 5:

We can obtain a Boolean array for this condition easily, as we've already seen:

Now to *select* these values from the array, we can simply index on this Boolean array; this is known as a *masking* operation:

What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is ``True``.

We are then free to operate on these values as we wish.
For example, we can compute some relevant statistics on our Seattle rain data:

In [None]:
# construct a mask of all rainy days
# construct a mask of all summer days (June 21st is the 172nd day)

By combining Boolean operations, masking operations, and aggregates, we can very quickly answer these sorts of questions for our dataset.

## Aside: Using the Keywords and/or Versus the Operators &/|

One common point of confusion is the difference between the keywords ``and`` and ``or`` on one hand, and the operators ``&`` and ``|`` on the other hand.
When would you use one versus the other?

The difference is this: ``and`` and ``or`` gauge the truth or falsehood of *entire object*, while ``&`` and ``|`` refer to *bits within each object*.

When you use ``and`` or ``or``, it's equivalent to asking Python to treat the object as a single Boolean entity.
In Python, all nonzero integers will evaluate as True. Thus:

When you use ``&`` and ``|`` on integers, the expression operates on the bits of the element, applying the *and* or the *or* to the individual bits making up the number:

Notice that the corresponding bits of the binary representation are compared in order to yield the result.

When you have an array of Boolean values in NumPy, this can be thought of as a string of bits where ``1 = True`` and ``0 = False``, and the result of ``&`` and ``|`` operates similarly to above:

Using ``or`` on these arrays will try to evaluate the truth or falsehood of the entire array object, which is not a well-defined value:

Similarly, when doing a Boolean expression on a given array, you should use ``|`` or ``&`` rather than ``or`` or ``and``:

Trying to evaluate the truth or falsehood of the entire array will give the same ``ValueError`` we saw previously:

So remember this: ``and`` and ``or`` perform a single Boolean evaluation on an entire object, while ``&`` and ``|`` perform multiple Boolean evaluations on the content (the individual bits or bytes) of an object.
For Boolean NumPy arrays, the latter is nearly always the desired operation.

<!--NAVIGATION-->
< [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) | [Contents](Index.ipynb) | [Fancy Indexing](02.07-Fancy-Indexing.ipynb) >

<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.06-Boolean-Arrays-and-Masks.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
