# Boolean operations, Boolean masks, and Boolean combinations

In this exercise set, we use forest covertype data (Note: This will take a little while because it downloads a large dataset from the web). This is a multi class dataset that has data for 7 different forest covertypes (stored in the target attribute). There are 581,012 forest plots with 54 attributes each (stored in a 581012 x 54 array) The first ten attributes are numerical, the last 44 are Boolean (true/false) attributes. Each of the Boolean attributes represents a qualitative soil type attribute which is either present or absent. We will refer to all the attributes by their column index. Thus the first attribute is attribute 0 and the last (a Boolean attribute) is attribute 53.

In [1]:
#from sklearn.datasets import load_wine
#wdata = load_wine()
from sklearn.datasets import fetch_covtype
data = fetch_covtype()
print(data.data.shape)  # data.data is the 581012 x 54 array
print(data.target.shape)     # data.target contains the class for each instance

(581012, 54)
(581012,)


[The Boolean arrays and masks notebook](https://github.com/gawron/python-for-social-science/blob/master/numpy/02_06_Boolean_Arrays_and_Masks.ipynb) discusses combining Boolean arrays with Boolean operators `&` (conceptually 'and') and `|` (conceptually 'or') and  `~` (conceptually 'not').  Study the examples there, especially the examples used on the Seattle rainfall data.  Pay special attention to the use of parentheses, because using them correctly matters in solving the following problems.

Each of the following problems concerns finding a particular set of rows in the covertypes dataset. For each set do two things: 
 
a. Construct a single Python expression which counts the number of rows in the set.
b. Construct a single Python expression which returns all the rows of the covertype dataset that are in the set.  (Note: not just the Boolean array, but the rows you get when use the Boolean array as a mask).
 
 1.  The rows in the covertype dataset which have a value greater than 300 for attribute 1.
 2.  The rows which have attribute 12 (Note:  this is a qualitative soil attribute).
 3.  The entire covertypes dataset excluding rows which belong either to covertype 3 or covertype 5.  (Note: you will have to use two arrays `data.data` and `data.target`).
 4. The rows which have attribute 12 but do not have either covertype 3 or covertype 5.
 5. The rows which are not in covertype 3 or 5, have a value greater than 300 for attribute 1, and have attribute 12.

The `sklearn` description of the covertype dataset is out below. For a fuller understanding of the attributes in the dataset you can look at [the original UCI data set description.](https://archive.ics.uci.edu/ml/datasets/Covertype) but that is not necessary to solve the problem.

The `data` object loaded above is a wrapper which contains three attributes.

In [3]:
dir(data)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [4]:
print(data.DESCR)

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`

Example:  `data.data`  is a `numpy` array containing the attributes for the 581,012 samples.  


In [14]:
print(type(data.data))
print(data.data.shape)

<class 'numpy.ndarray'>
(581012, 54)



To compute the Boolean array  that identifies the rows in which attribute 0 is greater than 3000 you do

In [8]:
WW = data.data[:,0] > 3000
print(WW)
print(WW.shape, data.data.shape)

[False False False ... False False False]
(581012,) (581012, 54)


In [3]:
WW.sum()

286266

Note the Boolean array has the same number of rows as the entire dataset.  To find the rows that satisfy this condition, you use this Boolean array as a 
**mask**. That is, you do:

In [3]:
print(data.data[WW,:])
data.data[WW,:].shape

[[3008.   45.   14. ...    0.    0.    0.]
 [3073.  173.   12. ...    0.    0.    0.]
 [3067.  164.   11. ...    0.    0.    0.]
 ...
 [3125.  127.    5. ...    0.    0.    0.]
 [3126.  120.    4. ...    0.    0.    0.]
 [3124.  115.    5. ...    0.    0.    0.]]


(286266, 54)

Note the new array contains just the rows that satisfies condition `WW` , so it is smaller.

The covertypes (or dominant tree) for each forest plot are in `data.target`.  The `target` attribute is often the attribute used to store the classes in an `sklearn` clasification dataset. There are seven classes.

In [5]:
print(data.target.shape)
print(set(data.target))

(581012,)
{1, 2, 3, 4, 5, 6, 7}


Place your answer to questions 1a and 1b in the next cell.

In [12]:
#The rows in the covertype dataset which have a value greater than 300 for attribute 1.
BC1 = data.data[:,1] > 300
# a.
print(BC1.sum())
# B.
data.data[BC1,:]

102696


array([[2.686e+03, 3.540e+02, 1.200e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.699e+03, 3.470e+02, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.570e+03, 3.460e+02, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       ...,
       [2.619e+03, 3.360e+02, 1.300e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.630e+03, 3.170e+02, 1.200e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.633e+03, 3.090e+02, 9.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00]])

Place your answer to questions 2a and 2b in the next cell.

In [13]:
#The rows which have attribute 12 (Note: this is a qualitative soil attribute).
BC2 = (data.data[:,12] == 1.0)
BC2.sum()

253364

In [14]:
data.data[BC2]

array([[2621.,  162.,   13., ...,    0.,    0.,    0.],
       [2664.,  112.,    5., ...,    0.,    0.,    0.],
       [2633.,   68.,    8., ...,    0.,    0.,    0.],
       ...,
       [2386.,  159.,   17., ...,    0.,    0.,    0.],
       [2384.,  170.,   15., ...,    0.,    0.,    0.],
       [2383.,  165.,   13., ...,    0.,    0.,    0.]])

Alternative answer:

Recall that `BC2` is a Boolean condition, that is, an array consisting entirely of 
Booleans (`True` or `False`).  

In [15]:
BC2[1610:1640]

array([False, False, False, False, False, False, False, False, False,
       False,  True, False,  True,  True,  True, False, False,  True,
        True, False, False, False, False, False, False, False, False,
       False,  True,  True])

Instead of using `== 1.0` to create a Series of Booleans,
we can let the ones and zeros in column 12 serve as Boolean
values on their own by converting them to integerss.  Column 12 converted to ints
just **is** a Boolean Series.

In [16]:
BC2_alt = data.data[:,12].astype(int)
BC2_alt[1610:1640]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1])

First consequence:  Summing the values gives us the same result -- of type `int` -- as summing the
Booleans did, as you might expect.

In [17]:
BC2_alt.sum()

253364

More interestingly, we can now use BC2_alt to index `data.data`.

In [18]:
data.data[BC2_alt]

array([[2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       ...,
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00]])

Notice we can't do that without first converting the column to integers.

In [19]:
data.data[data.data[:,12] ]

IndexError: arrays used as indices must be of integer (or boolean) type

#### Questions 3a and 3b

Place your answer to questions 3a and 3b in the next cell.

In [20]:
# The entire covertypes dataset excluding rows which belong either to covertype 3
# or covertype 5. (Note: you will have to use two arrays data.data and data.target).

BC3 = ~((data.target==3) | (data.target==5))
# a.
BC3.sum()

535765

Now we apply the Boolean array constraining `data.target` to `data.data`.  This is okay
because `data.target` and `data.data` are synched: `data.target[i]` tells us something about
`data.data[i]`.

In [21]:
data.data[BC3]

array([[2804.,  139.,    9., ...,    0.,    0.,    0.],
       [2785.,  155.,   18., ...,    0.,    0.,    0.],
       [2579.,  132.,    6., ...,    0.,    0.,    0.],
       ...,
       [2579.,   48.,   21., ...,    0.,    0.,    0.],
       [2571.,   55.,   21., ...,    0.,    0.,    0.],
       [2638.,  147.,   13., ...,    0.,    0.,    0.]])

Alternatively:

In [33]:
# a != b  is how you say a does not equal b
BC3a = (data.target!=3) & (data.target!=5)
# a.
BC3a.sum()

535765

Place your answer to questions 4a and 4b in the next cell.

In [22]:
# The rows which have attribute 12 but do not have either covertype 3 or covertype 5.
BC4 = (BC2 & BC3)

BC4.sum()

233352

Alternative answer.

This completes the demonstration that simply by converting the ones and zeros
in column 12 to ints, we make column 12 a Boolean Series (called `BC2_alt` in
the cell below).

We demonstrate that `BC2_alt` can successfully enter into Boolean
combinations with other Boolean Series.

In [23]:
#BC2 = (data.data[:,12] == 1.0)
BC2_alt = data.data[:,12].astype(int)
BC4_alt =  (BC2_alt & BC3)

In [24]:
BC4_alt.sum()

233352

In [25]:
data.data[BC4]

array([[2635.,   45.,    8., ...,    0.,    0.,    0.],
       [2650.,   26.,   19., ...,    0.,    0.,    0.],
       [2644.,    9.,   32., ...,    0.,    0.,    0.],
       ...,
       [2579.,   48.,   21., ...,    0.,    0.,    0.],
       [2571.,   55.,   21., ...,    0.,    0.,    0.],
       [2638.,  147.,   13., ...,    0.,    0.,    0.]])

Note that `int`s can be combine with bitwise operators like `&`, while  `floats` can't.

Ints can!

In [31]:
1&0

0

Floats can't!

In [29]:
1.0 & 0.0

TypeError: unsupported operand type(s) for &: 'float' and 'float'

And a very similar `TypeError` carries over to a Pandas Series containing Floats:

In [30]:
 data.data[:,12] & BC3

TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

So the `.astype(int)` conversion used to create `BC2_alt` was necessary in order to use the binary
column as a Boolean Series.

Place your answer to questions 5a and 5b in the next cell.

In [31]:
# The rows which have a value greater than 300 for attribute 1,  BC1
#     which  have attribute 12.                         BC2  --- |
#                                                                |- BC4
# and are not in covertype 3 or 5,                      BC3  --- |             

BC5 = BC1 & BC4

# Answer to 5a.
BC5.sum()

43815

answer to 5B

In [41]:
data.data[BC5]

array([[2780.,  346.,   13., ...,    0.,    0.,    0.],
       [2725.,  353.,   19., ...,    0.,    0.,    0.],
       [2747.,  329.,   19., ...,    0.,    0.,    0.],
       ...,
       [2619.,  336.,   13., ...,    0.,    0.,    0.],
       [2630.,  317.,   12., ...,    0.,    0.,    0.],
       [2633.,  309.,    9., ...,    0.,    0.,    0.]])

Or spelling it all out:

In [29]:
#   BC1                            BC2                                  BC3
BC5a = BC1 & BC2 & BC3

In [30]:
BC5a.sum()

43815

In [32]:
data.data[BC5a]

array([[2780.,  346.,   13., ...,    0.,    0.,    0.],
       [2725.,  353.,   19., ...,    0.,    0.,    0.],
       [2747.,  329.,   19., ...,    0.,    0.,    0.],
       ...,
       [2619.,  336.,   13., ...,    0.,    0.,    0.],
       [2630.,  317.,   12., ...,    0.,    0.,    0.],
       [2633.,  309.,    9., ...,    0.,    0.,    0.]])