# Boolean operations, Boolean masks, and Boolean combinations

The purpose of this assignment is to give you more practice using Boolean constraints on `numpy` arrays.  but it is also to introduce one new wrinkle in applying Boolean constraints to arrays: Boolean combinations of constraints.

To start off, we load forest covertype data (Note: This will take a little while because it downloads a large dataset from the web). This is a multi class dataset that has data for 7 different forest covertypes (stored in the target attribute). There are 581,012 forest plots with 54 attributes each (stored in a 581012 x 54 array) The first ten attributes are numerical, the last 44 are Boolean (true/false) attributes. Each of the Boolean attributes represents a qualitative soil type attribute which is either present or absent. We will refer to all the attributes by their column index. Thus the first attribute is attribute 0 and the last (a Boolean attribute) is attribute 53.

In [1]:
#from sklearn.datasets import load_wine
#wdata = load_wine()
from sklearn.datasets import fetch_covtype
data = fetch_covtype()
print(data.data.shape)  # data.data is the 581012 x 54 array
# data.target contains the covertype for each instance  This is what we're trying to predict
print(data.target.shape)     

(581012, 54)
(581012,)


[The Boolean arrays and masks notebook](https://github.com/gawron/python-for-social-science/blob/master/numpy/02_06_Boolean_Arrays_and_Masks.ipynb) discusses combining Boolean arrays with Boolean operators `&` (conceptually 'and') and `|` (conceptually 'or') and  `~` (conceptually 'not').  Study the examples there, especially the examples used on the Seattle rainfall data.  Pay special attention to the use of parentheses, because using them correctly matters in solving the following problems.

Each of the following problems concerns finding a particular set of rows in the covertypes dataset. For each set do two things: 
 
a. Construct a single Python expression which counts the number of rows in the set.
b. Construct a single Python expression which returns all the rows of the covertype dataset that are in the set.  (Note: not just the Boolean array, but the rows you get when use the Boolean array as a mask).
 
 1.  The rows in the covertype dataset which do not have either covertype 1 or covertype 2.
 2.  The rows which have attribute 10 (Note:  this is a qualitative wilderness area attribute).
 3.  The rows with elevation above 200.
 4.  The rows of covertype 1 or 2 with elevation above 200 which have attribute 10.
 5.  The rows neither of covertype 1 or 2 with elevation above 200 which have attribute 10.

The `sklearn` description of the covertype dataset is out below. For a fuller understanding of the attributes in the dataset you can look at [the original UCI data set description.](https://archive.ics.uci.edu/ml/datasets/Covertype) but that is not necessary to solve the problem.

The `data` object loaded above is a wrapper which contains three attributes.

In [3]:
dir(data)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [4]:
print(data.DESCR)

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`

Example:  `data.data`  is a `numpy` array containing the attributes for the 581,012 samples.  


In [2]:
print(type(data.data))
print(data.data.shape)

<class 'numpy.ndarray'>
(581012, 54)



To compute the Boolean array  that identifies the rows in which attribute 0 is greater than 3000 you do

In [8]:
WW = data.data[:,0] > 3000
print(WW)
print(WW.shape, data.data.shape)

[False False False ... False False False]
(581012,) (581012, 54)


In [3]:
WW.sum()

286266

Note the Boolean array has the same number of rows as the entire dataset.  To find the rows that satisfy this condition, you use this Boolean array as a 
**mask**. That is, you do:

In [3]:
print(data.data[WW,:])
data.data[WW,:].shape

[[3008.   45.   14. ...    0.    0.    0.]
 [3073.  173.   12. ...    0.    0.    0.]
 [3067.  164.   11. ...    0.    0.    0.]
 ...
 [3125.  127.    5. ...    0.    0.    0.]
 [3126.  120.    4. ...    0.    0.    0.]
 [3124.  115.    5. ...    0.    0.    0.]]


(286266, 54)

Note the new array contains just the rows that satisfies condition `WW` , so it is smaller.

The covertypes (or dominant tree) for each forest plot are in `data.target`.  The `target` attribute is often the attribute used to store the classes in an `sklearn` clasification dataset. There are seven classes.

In [5]:
print(data.target.shape)
print(set(data.target))

(581012,)
{1, 2, 3, 4, 5, 6, 7}


Finally the fact that attribute 12 is qualitative means the only values in that column are 0 and 1.
Each plot either has or doesn't have attribute 12.

In [3]:
set(data.data[:,12])

{0.0, 1.0}

In [5]:
[bool(x) for x in set(data.data[:,12])]

[False, True]

#### Questions 1a and 1b

Place your answer to questions 1a and 1b in the next cell.  Scroll down to see a valid answer.

In [8]:
#
#data.target > 2
# a
BC1 = (~((data.target==1) | (data.target ==2)))
print(BC1.sum())
# b
data.data[BC1]

85871


array([[2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.595e+03, 4.500e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       ...,
       [2.386e+03, 1.590e+02, 1.700e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.384e+03, 1.700e+02, 1.500e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.383e+03, 1.650e+02, 1.300e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00]])

#### Questions 2a and 2b

Place your answer to questions 2a and 2b in the next cell.

In [15]:
#
BC2 = (data.data[:,10] == 1.0)
print(BC2.sum())
data.data[BC2]

260796


260796

In [18]:
#data.data[:,10].astype(bool)
data.data[data.data[:,10].astype(int)]

array([[2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       ...,
       [2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00]])

#### Alternative answer to 2a and 2b:

Recall that `BC2` is a Boolean condition, that is, an array consisting entirely of 
Booleans (`True` or `False`).    You can also try an answer that uses
the `.astype(int)` method.

#### Questions 3a and 3b

Place your answer to questions 3a and 3b in the next cell.

In [21]:
#
BC3 = (data.data[:,0] > 200)
print(BC3.sum())

data.data[BC3]

581012


array([[2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.804e+03, 1.390e+02, 9.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       ...,
       [2.386e+03, 1.590e+02, 1.700e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.384e+03, 1.700e+02, 1.500e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.383e+03, 1.650e+02, 1.300e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00]])

#### Questions 4a and 4b

Place your answer to questions 4a and 4b in the next cell.

In [None]:
# The rows of covertype 1 or 2 with elevation above 200 which have attribute 10.
# BC1 Not of of covertype 1 or 2
# BC2 have att 10
# BC3 elev.  above 200

In [22]:
# The rows of covertype 1 or 2 with elevation above 200 which have attribute 10.
# BC1 Not of of covertype 1 or 2
# BC2 have att 10
# BC3 elev.  above 200
BC4 = ~BC1 & BC2 & BC3
print(BC4.sum())
data.data[BC4]

251914


array([[2804.,  139.,    9., ...,    0.,    0.,    0.],
       [2785.,  155.,   18., ...,    0.,    0.,    0.],
       [2579.,  132.,    6., ...,    0.,    0.,    0.],
       ...,
       [3261.,   78.,   12., ...,    0.,    0.,    0.],
       [3254.,   72.,   10., ...,    0.,    0.,    0.],
       [3250.,   66.,    9., ...,    0.,    0.,    0.]])

#### Questions 5a and 5b

Place your answer to questions 5a and 5b in the next cell.

In [23]:
#
BC5 = BC1 & BC2 & BC3
print(BC5.sum())
data.data[BC5]

8882


array([[2.596e+03, 5.100e+01, 3.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.590e+03, 5.600e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.595e+03, 4.500e+01, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       ...,
       [3.236e+03, 1.310e+02, 2.000e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [3.234e+03, 1.510e+02, 1.900e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [3.228e+03, 1.400e+02, 1.900e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00]])