# 8. Boolean Indexing - DataFrames

### Objectives

+ Boolean Indexing or Boolean Selection is the selection of a subset of a Series/DataFrame based on the values themselves and not the row/column labels or integer location
+ Boolean means **`True`** or **`False`**
+ To do boolean selection, you first create a sequence of True/False values and pass it to a DataFrame/Series indexer. Each row of data is kept or discarded
+ The indexing operators are overloaded — change functionality depending on what is passed to them
+ Typically, you will first create a boolean Series with one of the 6 comparison operators
+ You will pass this boolean series to one of the indexers to make your selection
+ Use the **`isin`** method to test for multiple equalities in the same column
+ You can create complex criteria with the and (**`&`**), or (**`|`**), and not (**`~`**) logical operators
+ When you have multiple conditions in a single line, you must wrap each expression with a parentheses
+ If you have complex criteria, think about assigning each set of criteria into its own variable (i.e. don't do everything in one line)
+ If you are only selecting rows, then you will almost always use just the brackets
* If you are simultaneously doing boolean selection on the rows and selecting column labels then you will use **`.loc`**
+ You will almost never use .iloc to do boolean selection
+ Boolean selection works the same for Series as it does for DataFrames


# Boolean Indexing
Boolean indexing, also referred to as **Boolean Selection**, is the process of selecting subsets of rows from DataFrames (or Series) based on the actual values and NOT by their labels or integer locations.

# Examples of Boolean Indexing

Before diving into Pandas, lets see some examples of actual questions (in plain English) that boolean indexing can help us answer from the bikes dataset.

+ Find all male riders
+ Find all rides with duration longer than 2 hours
+ Find all rides that took place between March and June of 2015.
+ Find all the rides that a duration longer than 2 hours by females with temperature higher than 90 degrees

The term **query** is used to refer to these sorts of questions.

### All queries have criteria
Each of the above queries have a strict logical criteria that must be checked one row at a time.

### Keep or Discard entire row of data
If you were to manually answer the above queries, you would need to scan each row and determine whether the row as a whole meets the criterion or not. If the row meets the criteria, then it is kept and if not, then it is discarded.

### Each row will have a True or False value associated with it
When you perform boolean indexing, each row of the DataFrame (or value of a Series) will have a True or False value associated with it depending on whether or not it meets the criterion. True/False values are known as boolean. The documentation refers to the entire procedure as boolean indexing.

Since we are using the booleans to select data, it is sometimes referred to as boolean selection. We are using booleans to select subsets of data.

### Beginning with a small DataFrame
We will perform our first boolean indexing on a dataset of 5 rows. Let's assign the head of the bikes dataset to its own variable.

In [None]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')
bikes_head = bikes.head()
bikes_head

### Create a criteria with a list
We will manually create a list of 5 boolean values as a list.

In [None]:
criteria = [False, True, False, False, True]

### Pass this list into the just the brackets
The above list has a True in both the second and fifth rows. These will be the rows that are kept during boolean indexing. To formally do boolean indexing, we place the list inside the brackets.

In [None]:
bikes_head[criteria]

## Wait a second… Isn’t `[ ]` just for column selection?

The primary purpose of *just the brackets* for a DataFrame is to select one or more columns by using either a string or a list of strings. Now, all of a sudden, this example is showing that entire rows are selected with boolean values. This is what makes Pandas, unfortunately, a confusing library to use.

## Operator Overloading
*Just the brackets* is **overloaded**. This means, that depending on the inputs, Pandas will do something completely different. Here are the rules for the different objects you pass to the brackets.

* string — return a column as a Series
* list of strings — return all those columns as a DataFrame
* a slice — select rows (can do both label and integer location — confusing!)
* a sequence of booleans — select all rows where True

In summary, primarily just the indexing operator selects columns, but if you pass it a sequence of booleans it will select all rows that are True.

## Using booleans in a Series and not a list
Instead of using a list to contain our booleans, we can store them in a Series. This produces the same output. Below, we use the Series constructor to create a Series object.

In [None]:
s = pd.Series([False, True, False, False, True])
s

### Use the boolean Series to do the boolean selection
Placing the Series directly in the brackets will again select only the rows which have True values in the Series.

In [None]:
bikes_head[s]

# Practical Boolean Selection
We will almost never create boolean lists/Series manually like we did above but instead use the actual data to create them.

## Creating Boolean Series from Column Data
By far the most common way to create a boolean Series will be from the values of one particular column. We will test a condition using one of the six comparison operators:

* `<`
* `<=`
* `>`
* `>=`
* `==`
* `!=`


## Create a Boolean Series
Let's create a boolean Series by determining which rows have a trip duration of over 1000 seconds.

In [None]:
criteria = bikes['tripduration'] > 1000
criteria.head(10)

### Manually verify correctness
Let's output the head of the trip duration Series to manually verfiy that indeed integer locations 2 and 8 are the ones greater than 1000.

In [None]:
bikes['tripduration'].head(10)

## Complete our boolean indexing
We created our boolean Series, **`criteria`** using the greater than comparison operator on the **`tripduration`** column. We can now pass this result into the brackets to filter the entire DataFrame. Verify that all **`tripduration`** values are greater than 1000. 

In [None]:
bikes[criteria].head(10)

### How many rows have a trip duration greater than 1000?
To answer this question, let's assign the result of the boolean selection to a varible and then retrieve the **`shape`** of the DataFrame.

In [None]:
bikes.shape

In [None]:
bikes_duration_1000 = bikes[criteria]
bikes_duration_1000.shape

About 20% of the rides are longer than 1000 seconds.

# Boolean selection in one line
Often, you will see boolean selection happen in a single line of code instead of the multiple lines we used above. Put the expression with comparison operator directly inside the brackets.

In [None]:
bikes[bikes['tripduration'] > 1000].head()

If that is confusing for you, then I recommend storing your boolean Series to a variable like we did with **`criteria`** above.

## Single condition expression
Our first example tested a single condition (whether the trip duration was 1,000 or more). Let’s test a different single condition and look for all the rides that happend when the weather was cloudy.

We use the == operator to test for equality and assign this result to the variable criteria. Again, we pass this variable to the brackets which completes our selection.

In [None]:
criteria = bikes['events'] == 'cloudy'
bikes[criteria].head()

## Multiple condition expression
So far, our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

## Use `&`, `|` , `~`
Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with Pandas. 

You must use the following operators with pandas:

* **`&`** for and
* **`|`** for or
* **`~`** for not

## Our first multiple condition expression
Let’s find all the rides that where longer than 1,000 seconds and happened when it was cloudy. We assign each condition to separate variables and then apply the **and** operator to them.

In [None]:
criteria_1 = bikes['tripduration'] > 1000
criteria_2 = bikes['events'] == 'cloudy'
criteria_all = criteria_1 & criteria_2

In [None]:
bikes[criteria_all].head()

## Multiple conditions in one line
It is possible to combine the entire expression into a single line. Many pandas users like doing this, others hate it. Regardless, it is a good idea to know how to do so as you will definitely encounter it.

## Use parentheses to separate conditions
You must encapsulate each condition in a set of parentheses in order to make this work.

Each condition will be separated like this:

```

(bikes['tripduration'] > 1000) & (bikes['events'] == 'cloudy')

```

## Same results
We can then drop this expression inside of just the indexing operator to get the same results:

In [None]:
bikes[(bikes['tripduration'] > 1000) & (bikes['events'] == 'cloudy')].head()

## Using an or condition
Let's find all the rides that were done by females of had trip durations longer than 1,000 seconds.

For the or condition, we use the pipe character **`|`**

In [None]:
bikes[(bikes['tripduration'] > 1000) | (bikes['gender'] == 'Female')].head()

## Reversing a condition with the not operator
The tilde character, **`~`**, represents the not operator and reverses a condition.  For instance, if we wanted all the rides with trip duration less than or equal to 1000, we could do it like this (notice the parentheses around the criteria):

In [None]:
bikes[~(bikes['tripduration'] > 1000)].head()

Of course, reversing single conditions is pretty pointless as we can simply use the less than or equal to operator instead like this:

In [None]:
bikes[bikes['tripduration'] <= 1000].head()

### Reverse a more complex condition
Typically, we will save the not operator for reversing more complex conditions. Let's reverse the condition for selecting rides by females or those with duration over 1,000 seconds.

Notice that there are parentheses around the entire expression. Logically, this should return only male riders with duration 1,000 or less.

In [None]:
criteria = ~((bikes['tripduration'] > 1000) | (bikes['gender'] == 'Female'))
bikes[criteria].head()

## Even more complex conditions
It is possible to build extremely complex conditions to select rows of your DataFrame that meet a very specific criteria. For instance, we can select males riders with trip duration between 1,000 and 2,000 seconds along with female riders with trip duration betwee 5,000 and 10,000 seconds.

With multiple conditions, its probably best to break out the logic into multiple steps:

In [None]:
criteria_1 = (bikes['gender'] == 'Male') & (bikes['tripduration'] >= 1000) & (bikes['tripduration'] <= 2000)
criteria_2 = (bikes['gender'] == 'Female') & (bikes['tripduration'] >= 5000) & (bikes['tripduration'] <= 10000)
criteria_all = criteria_1 | criteria_2

In [None]:
bikes[criteria_all].head(10)

## Lots of equality conditions in a single column - use `isin`
Occasionally, we will want to test equality in a single column with multiple values. This is most common in string columns. For instance, let’s say we wanted to find all the rides where the events were either rain, snow, tstorms or sleet.

One way to do this would be with four or conditions.

In [None]:
criteria = ((bikes['events'] == 'rain') | 
            (bikes['events'] == 'snow') | 
            (bikes['events'] == 'tstorms') | 
            (bikes['events'] == 'sleet'))
bikes[criteria].head()

Instead we can call the **`isin`** method and pass a list of all the acceptable values:

In [None]:
criteria = bikes['events'].isin(['rain', 'snow', 'tstorms', 'sleet'])
bikes[criteria].head()

## Combining isin with other criteria
You can use the resulting boolean Series from the isin method in the same way you would from the logical operators. For instance, If we wanted to find all the rides that had the same events and had a duration greater than 10,000 we would do the following:

In [None]:
criteria_1 = bikes['events'].isin(['rain', 'snow', 'tstorms', 'sleet'])
criteria_2 = bikes['tripduration'] > 2000
criteria_all = criteria_1 & criteria_2
bikes[criteria_all].head()

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select all movies that have Tom Hanks as `actor1`. How many of these movies has he starred in?</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Select movies with and IMDB score greater than 9.</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Select all movies from the 1970s.</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Select all movies from the 1970s that had IMDB scores greater than 8</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Select movies that were rated either R, PG-13, or PG.</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Select movies that are either rated PG-13 or were made after 2010.</span>

In [None]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Find all the movies that have at least one of the three actors with more than 10,000 Facebook likes.</span>

In [None]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Reverse the condition from problem 6. Use one line of code. In words, what have you selected.</span>

In [None]:
# your code here