# Boolean Indexing - DataFrames

### Objectives

+ Boolean Indexing or Boolean Selection is the selection of a subset of a Series/DataFrame based on the values themselves and not the row/column labels or integer location
+ Boolean means **`True`** or **`False`**
+ To do boolean selection, you first create a sequence of True/False values and pass it to a DataFrame/Series indexer Each row of data is kept or discarded
+ The indexing operators are overloaded — change functionality depending on what is passed to them
+ Typically, you will first create a boolean Series with one of the 6 comparison operators
+ You will pass this boolean series to one of the indexers to make your selection
+ Use the **`isin`** method to test for multiple equalities in the same column
+ You can create complex criteria with the and (**`&`**), or (**`|`**), and not (**`~`**) logical operators
+ When you have multiple conditions in a single line, you must wrap each expression with a parentheses
+ If you have complex criteria, think about storing each set of criteria into its own variable (i.e. don’t do everything in one line)
+ If you are only selecting rows, then you will almost always use just the brackets
* If you are simultaneously doing boolean selection on the rows and selecting column labels then you will use **`.loc`**
+ You will almost never use .iloc to do boolean selection
+ Boolean selection works the same for Series as it does for DataFrames


# Boolean Indexing
Boolean indexing, also referred to as **Boolean Selection**, is the process of selecting subsets of rows from DataFrames (or Series) based on the actual values and NOT by their labels or integer locations.

# Examples of Boolean Indexing

Before diving into Pandas, lets see some examples of actual questions (in plain English) that boolean indexing can help us answer from the bikes dataset.

+ Find all male riders
+ Find all rides with duration longer than 2 hours
+ Find all rides that took place between March and June of 2015.
+ Find all the rides that a duration longer than 2 hours by females with temperature higher than 90 degrees

The term **query** is used to refer to these sorts of questions.


### All queries have criteria
Each of the above queries have a strict logical criteria that must be checked one row at a time.

### Keep or Discard entire row of data
If you were to manually answer the above queries, you would need to scan each row and determine whether the row as a whole meets the criterion or not. If the row meets the criteria, then it is kept and if not, then it is discarded.

### Each row will have a True or False value associated with it
When you perform boolean indexing, each row of the DataFrame (or value of a Series) will have a True or False value associated with it depending on whether or not it meets the criterion. True/False values are known as boolean. The documentation refers to the entire procedure as boolean indexing.

Since we are using the booleans to select data, it is sometimes referred to as boolean selection. We are using booleans to select subsets of data.

### Beginning with a small DataFrame
We will perform our first boolean indexing on a dataset of 5 rows. Let's assign the head of the bikes dataset to its own variable.

In [6]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')
bikes_head = bikes.head()
bikes_head

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


### Create a criteria with a list
We will manually create a list of 5 boolean values as a list.

In [7]:
criteria = [False, True, False, False, True]

### Pass this list into the just the brackets
The above list has a True in both the second and fifth rows. These will be the rows that are kept during boolean indexing. To formally do boolean indexing, we place the list inside the brackets.

In [9]:
bikes_head[criteria]

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


## Wait a second… Isn’t `[ ]` just for column selection?

The primary purpose of *just the brackets* for a DataFrame is to select one or more columns by using either a string or a list of strings. Now, all of a sudden, this example is showing that entire rows are selected with boolean values. This is what makes Pandas, unfortunately, a confusing library to use.

## Operator Overloading
*Just the brackets* is **overloaded**. This means, that depending on the inputs, Pandas will do something completely different. Here are the rules for the different objects you pass to the brackets.

* string — return a column as a Series
* list of strings — return all those columns as a DataFrame
* a slice — select rows (can do both label and integer location — confusing!)
* a sequence of booleans — select all rows where True

In summary, primarily just the indexing operator selects columns, but if you pass it a sequence of booleans it will select all rows that are True.

## Using booleans in a Series and not a list
Instead of using a list to contain our booleans, we can store them in a Series. This produces the same output. Below, we use the Series constructor to create a Series object.

In [10]:
s = pd.Series([False, True, False, False, True])
s

0    False
1     True
2    False
3    False
4     True
dtype: bool

### Use the boolean Series to do the boolean selection
Placing the Series directly in the brackets will again select only the rows which have True values in the Series.

In [12]:
bikes_head[s]

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


# Practical Boolean Selection
We will almost never create boolean lists/Series manually like we did above but instead use the actual data to create them.

## Creating Boolean Series from Column Data
By far the most common way to create a boolean Series will be from the values of one particular column. We will test a condition using one of the six comparison operators:

* `<`
* `<=`
* `>`
* `>=`
* `==`
* `!=`


## Create a Boolean Series
Let's create a boolean Series by determining which rows have a trip duration of over 1000 seconds.

In [21]:
criteria = bikes['tripduration'] > 1000
criteria.head(10)

0    False
1    False
2     True
3    False
4    False
5    False
6    False
7    False
8     True
9    False
Name: tripduration, dtype: bool

### Manually verify correctness
Let's output the head of the trip duration Series to manually verfiy that indeed integer locations 2 and 8 are the ones greater than 1000.

In [23]:
bikes['tripduration'].head(10)

0     993
1     623
2    1040
3     667
4     130
5     660
6     565
7     505
8    1300
9     922
Name: tripduration, dtype: int64

## Complete our boolean indexing
We created our boolean Series, **`criteria`** using the greater than comparison operator on the **`tripduration`** column. We can now pass this result into the brackets to filter the entire DataFrame. Verify that all **`tripduration`** values are greater than 1000. 

In [25]:
bikes[criteria].head(10)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
10,24383,Subscriber,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy
11,24673,Subscriber,Male,2013-07-04 18:13:00,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,41.917859,-87.668919,15.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,79.0,10.0,10.4,-9999.0,mostlycloudy
12,26214,Subscriber,Male,2013-07-05 10:02:00,2013-07-05 10:40:00,2263,Jefferson St & Monroe St,41.880422,-87.642746,19.0,Jefferson St & Monroe St,41.880422,-87.642746,19.0,79.0,10.0,0.0,-9999.0,partlycloudy
13,30404,Subscriber,Male,2013-07-06 09:43:00,2013-07-06 10:06:00,1365,May St & Randolph St,41.88397,-87.655688,15.0,Millennium Park,41.881032,-87.624084,35.0,78.1,10.0,5.8,-9999.0,partlycloudy
18,40924,Subscriber,Male,2013-07-09 13:12:00,2013-07-09 14:42:00,5396,Canal St & Jackson Blvd,41.878114,-87.639971,35.0,Millennium Park,41.881032,-87.624084,35.0,79.0,10.0,13.8,0.0,cloudy
26,51130,Subscriber,Male,2013-07-12 01:07:00,2013-07-12 01:24:00,1043,State St & Harrison St,41.873958,-87.627739,19.0,Racine Ave & 18th St,41.858181,-87.656487,15.0,64.9,10.0,0.0,-9999.0,clear
34,54257,Subscriber,Male,2013-07-12 18:13:00,2013-07-12 18:40:00,1616,Clinton St & Madison St,41.881582,-87.641277,23.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,78.1,10.0,10.4,-9999.0,partlycloudy
40,61401,Subscriber,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,41.867173,-87.625955,19.0,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,87.1,10.0,8.1,-9999.0,partlycloudy


### How many rows have a trip duration greater than 1000?
To answer this question, let's assign the result of the boolean selection to a varible and then retrieve the **`shape`** of the DataFrame.

In [26]:
bikes.shape

(50089, 19)

In [27]:
bikes_duration_1000 = bikes[criteria]
bikes_duration_1000.shape

(10178, 19)

About 20% of the rides are longer than 1000 seconds.

# Boolean selection in one line
Often, you will see boolean selection happen in a single line of code instead of the multiple lines we used above. Put the expression with comparison operator directly inside the brackets.

In [29]:
bikes[bikes['tripduration'] > 1000].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
10,24383,Subscriber,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy
11,24673,Subscriber,Male,2013-07-04 18:13:00,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,41.917859,-87.668919,15.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,79.0,10.0,10.4,-9999.0,mostlycloudy
12,26214,Subscriber,Male,2013-07-05 10:02:00,2013-07-05 10:40:00,2263,Jefferson St & Monroe St,41.880422,-87.642746,19.0,Jefferson St & Monroe St,41.880422,-87.642746,19.0,79.0,10.0,0.0,-9999.0,partlycloudy


If that is confusing for you, then I recommend storing your boolean Series to a variable like we did with **`criteria`** above.

## Single condition expression
Our first example tested a single condition (whether the trip duration was 1,000 or more). Let’s test a different single condition and look for all the rides that happend when the weather was cloudy.

We use the == operator to test for equality and assign this result to the variable criteria. Again, we pass this variable to the brackets which completes our selection.

In [30]:
criteria = bikes['events'] == 'cloudy'
bikes[criteria].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
6,18880,Subscriber,Male,2013-07-02 17:47:00,2013-07-02 17:56:00,565,Clark St & Randolph St,41.884576,-87.63189,31.0,Ravenswood Ave & Irving Park Rd,41.95469,-87.67393,19.0,66.0,10.0,15.0,-9999.0,cloudy
7,19689,Subscriber,Male,2013-07-03 09:07:00,2013-07-03 09:16:00,505,State St & Van Buren St,41.877181,-87.627844,27.0,Franklin St & Jackson Blvd,41.877708,-87.635321,27.0,64.0,7.0,5.8,-9999.0,cloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
18,40924,Subscriber,Male,2013-07-09 13:12:00,2013-07-09 14:42:00,5396,Canal St & Jackson Blvd,41.878114,-87.639971,35.0,Millennium Park,41.881032,-87.624084,35.0,79.0,10.0,13.8,0.0,cloudy
19,40879,Subscriber,Male,2013-07-09 13:14:00,2013-07-09 13:20:00,384,Aberdeen St & Madison St,41.881487,-87.654752,19.0,Canal St & Jackson Blvd,41.878114,-87.639971,35.0,79.0,10.0,13.8,0.0,cloudy


## Multiple condition expression
So far, our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.

## Use `&`, `|` , `~`
Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with Pandas. 

You must use the following operators with pandas:

* **`&`** for and
* **`|`** for or
* **`~`** for not

## Our first multiple condition expression
Let’s find all the rides that where longer than 1,000 seconds and happened when it was cloudy. We assign each condition to separate variables and then apply the **and** operator to them.

In [32]:
criteria_1 = bikes['tripduration'] > 1000
criteria_2 = bikes['events'] == 'cloudy'
criteria_all = criteria_1 & criteria_2

In [33]:
bikes[criteria_all].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
18,40924,Subscriber,Male,2013-07-09 13:12:00,2013-07-09 14:42:00,5396,Canal St & Jackson Blvd,41.878114,-87.639971,35.0,Millennium Park,41.881032,-87.624084,35.0,79.0,10.0,13.8,0.0,cloudy
80,90932,Subscriber,Female,2013-07-22 07:59:00,2013-07-22 08:19:00,1224,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,Dearborn St & Adams St,41.879356,-87.629791,19.0,73.4,10.0,0.0,-9999.0,cloudy
109,110836,Subscriber,Male,2013-07-26 16:39:00,2013-07-26 17:04:00,1468,Indiana Ave & Roosevelt Rd,41.867888,-87.623041,19.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,69.8,10.0,13.8,0.0,cloudy
110,111278,Subscriber,Male,2013-07-26 17:57:00,2013-07-26 18:27:00,1825,Sheffield Ave & Fullerton Ave,41.925602,-87.653708,15.0,Sheffield Ave & Willow St,41.913688,-87.652855,15.0,69.1,10.0,9.2,-9999.0,cloudy


## Multiple conditions in one line
It is possible to combine the entire expression into a single line. Many pandas users like doing this, others hate it. Regardless, it is a good idea to know how to do so as you will definitely encounter it.

## Use parentheses to separate conditions
You must encapsulate each condition in a set of parentheses in order to make this work.

Each condition will be separated like this:

```

(bikes['tripduration'] > 1000) & (bikes['events'] == 'cloudy')

```

## Same results
We can then drop this expression inside of just the indexing operator to get the same results:

In [36]:
bikes[(bikes['tripduration'] > 1000) & (bikes['events'] == 'cloudy')].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
18,40924,Subscriber,Male,2013-07-09 13:12:00,2013-07-09 14:42:00,5396,Canal St & Jackson Blvd,41.878114,-87.639971,35.0,Millennium Park,41.881032,-87.624084,35.0,79.0,10.0,13.8,0.0,cloudy
80,90932,Subscriber,Female,2013-07-22 07:59:00,2013-07-22 08:19:00,1224,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,Dearborn St & Adams St,41.879356,-87.629791,19.0,73.4,10.0,0.0,-9999.0,cloudy
109,110836,Subscriber,Male,2013-07-26 16:39:00,2013-07-26 17:04:00,1468,Indiana Ave & Roosevelt Rd,41.867888,-87.623041,19.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,69.8,10.0,13.8,0.0,cloudy
110,111278,Subscriber,Male,2013-07-26 17:57:00,2013-07-26 18:27:00,1825,Sheffield Ave & Fullerton Ave,41.925602,-87.653708,15.0,Sheffield Ave & Willow St,41.913688,-87.652855,15.0,69.1,10.0,9.2,-9999.0,cloudy


## Using an or condition
Let's find all the rides that were done by females of had trip durations longer than 1,000 seconds.

For the or condition, we use the pipe character **`|`**

In [37]:
bikes[(bikes['tripduration'] > 1000) | (bikes['gender'] == 'Female')].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
9,23558,Subscriber,Female,2013-07-04 15:00:00,2013-07-04 15:16:00,922,Lakeview Ave & Fullerton Pkwy,41.925858,-87.638973,19.0,Racine Ave & Congress Pkwy,41.87464,-87.65703,19.0,81.0,10.0,12.7,-9999.0,mostlycloudy
10,24383,Subscriber,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy
11,24673,Subscriber,Male,2013-07-04 18:13:00,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,41.917859,-87.668919,15.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,79.0,10.0,10.4,-9999.0,mostlycloudy


## Reversing a condition with the not operator
The tilde character, **`~`**, represents the not operator and reverses a condition.  For instance, if we wanted all the rides with trip duration less than or equal to 1000, we could do it like this (notice the parentheses around the criteria):

In [40]:
bikes[~(bikes['tripduration'] > 1000)].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy
5,13595,Subscriber,Male,2013-07-01 12:37:00,2013-07-01 12:48:00,660,California Ave & 21st St,41.854016,-87.695445,15.0,Clark St & Wrightwood Ave,41.929546,-87.643118,15.0,73.0,10.0,17.3,-9999.0,mostlycloudy


Of course, reversing single conditions is pretty pointless as we can simply use the less than or equal to operator instead like this:

In [43]:
bikes[bikes['tripduration'] <= 1000].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy
5,13595,Subscriber,Male,2013-07-01 12:37:00,2013-07-01 12:48:00,660,California Ave & 21st St,41.854016,-87.695445,15.0,Clark St & Wrightwood Ave,41.929546,-87.643118,15.0,73.0,10.0,17.3,-9999.0,mostlycloudy


### Reverse a more complex condition
Typically, we will save the not operator for reversing more complex conditions. Let's reverse the condition for selecting rides by females or those with duration over 1,000 seconds.

Notice that there are parentheses around the entire expression. Logically, this should return only male riders with duration 1,000 or less.

In [69]:
criteria = ~((bikes['tripduration'] > 1000) | (bikes['gender'] == 'Female'))
bikes[criteria].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy
5,13595,Subscriber,Male,2013-07-01 12:37:00,2013-07-01 12:48:00,660,California Ave & 21st St,41.854016,-87.695445,15.0,Clark St & Wrightwood Ave,41.929546,-87.643118,15.0,73.0,10.0,17.3,-9999.0,mostlycloudy


## Even more complex conditions
It is possible to build extremely complex conditions to select rows of your DataFrame that meet a very specific criteria. For instance, we can select males riders with trip duration between 1,000 and 2,000 seconds along with female riders with trip duration betwee 5,000 and 10,000 seconds.

With multiple conditions, its probably best to break out the logic into multiple steps:

In [76]:
criteria_1 = (bikes['gender'] == 'Male') & (bikes['tripduration'] >= 1000) & (bikes['tripduration'] <= 2000)
criteria_2 = (bikes['gender'] == 'Female') & (bikes['tripduration'] >= 5000) & (bikes['tripduration'] <= 10000)
criteria_all = criteria_1 | criteria_2

In [79]:
bikes[criteria_all].head(10)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
8,21028,Subscriber,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wood St & Division St,41.90332,-87.67273,15.0,71.1,8.0,0.0,-9999.0,cloudy
10,24383,Subscriber,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,41.858086,-87.651073,15.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,79.0,10.0,9.2,-9999.0,mostlycloudy
11,24673,Subscriber,Male,2013-07-04 18:13:00,2013-07-04 18:42:00,1697,Ashland Ave & Armitage Ave,41.917859,-87.668919,15.0,Lincoln Ave & Armitage Ave,41.918273,-87.638116,19.0,79.0,10.0,10.4,-9999.0,mostlycloudy
13,30404,Subscriber,Male,2013-07-06 09:43:00,2013-07-06 10:06:00,1365,May St & Randolph St,41.88397,-87.655688,15.0,Millennium Park,41.881032,-87.624084,35.0,78.1,10.0,5.8,-9999.0,partlycloudy
26,51130,Subscriber,Male,2013-07-12 01:07:00,2013-07-12 01:24:00,1043,State St & Harrison St,41.873958,-87.627739,19.0,Racine Ave & 18th St,41.858181,-87.656487,15.0,64.9,10.0,0.0,-9999.0,clear
34,54257,Subscriber,Male,2013-07-12 18:13:00,2013-07-12 18:40:00,1616,Clinton St & Madison St,41.881582,-87.641277,23.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,78.1,10.0,10.4,-9999.0,partlycloudy
40,61401,Subscriber,Female,2013-07-14 14:08:00,2013-07-14 15:53:00,6274,Wabash Ave & Roosevelt Rd,41.867173,-87.625955,19.0,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,87.1,10.0,8.1,-9999.0,partlycloudy
41,64257,Subscriber,Male,2013-07-15 06:26:00,2013-07-15 06:44:00,1125,Racine Ave & Fullerton Ave,41.925563,-87.658404,19.0,State St & Kinzie St,41.88918,-87.6277,15.0,73.9,10.0,0.0,-9999.0,partlycloudy
47,67013,Subscriber,Male,2013-07-15 19:10:00,2013-07-15 19:34:00,1463,Lake Shore Dr & Ohio St,41.89257,-87.614492,19.0,Lake Shore Dr & Ohio St,41.89257,-87.614492,19.0,80.1,10.0,6.9,-9999.0,mostlycloudy


## Lots of equality conditions in a single column - use `isin`
Occasionally, we will want to test equality in a single column with multiple values. This is most common in string columns. For instance, let’s say we wanted to find all the rides where the events were either rain, snow, tstorms or sleet.

One way to do this would be with four or conditions.

In [89]:
criteria = ((bikes['events'] == 'rain') | 
            (bikes['events'] == 'snow') | 
            (bikes['events'] == 'tstorms') | 
            (bikes['events'] == 'sleet'))
bikes[criteria].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
45,66336,Subscriber,Male,2013-07-15 16:43:00,2013-07-15 16:55:00,727,Greenwood Ave & 47th St,41.809835,-87.599383,15.0,State St & Harrison St,41.873958,-87.627739,19.0,82.9,10.0,5.8,0.0,rain
78,89180,Subscriber,Male,2013-07-21 16:35:00,2013-07-21 17:06:00,1809,Michigan Ave & Pearson St,41.89766,-87.62351,23.0,Millennium Park,41.881032,-87.624084,35.0,82.4,10.0,11.5,0.0,tstorms
79,89228,Subscriber,Male,2013-07-21 16:47:00,2013-07-21 17:03:00,999,Carpenter St & Huron St,41.894556,-87.653449,19.0,Carpenter St & Huron St,41.894556,-87.653449,19.0,82.4,10.0,11.5,0.0,tstorms
86,95044,Subscriber,Female,2013-07-23 00:16:00,2013-07-23 00:26:00,563,Wabash Ave & Roosevelt Rd,41.867173,-87.625955,19.0,Daley Center Plaza,41.884337,-87.630183,47.0,78.8,10.0,17.3,0.0,tstorms
112,111568,Subscriber,Male,2013-07-26 19:10:00,2013-07-26 19:33:00,1395,Larrabee St & Kingsbury St,41.897764,-87.642884,27.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,66.9,8.0,12.7,0.0,rain


Instead we can call the **`isin`** method and pass a list of all the acceptable values:

In [92]:
criteria = bikes['events'].isin(['rain', 'snow', 'tstorms', 'sleet'])
bikes[criteria].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
45,66336,Subscriber,Male,2013-07-15 16:43:00,2013-07-15 16:55:00,727,Greenwood Ave & 47th St,41.809835,-87.599383,15.0,State St & Harrison St,41.873958,-87.627739,19.0,82.9,10.0,5.8,0.0,rain
78,89180,Subscriber,Male,2013-07-21 16:35:00,2013-07-21 17:06:00,1809,Michigan Ave & Pearson St,41.89766,-87.62351,23.0,Millennium Park,41.881032,-87.624084,35.0,82.4,10.0,11.5,0.0,tstorms
79,89228,Subscriber,Male,2013-07-21 16:47:00,2013-07-21 17:03:00,999,Carpenter St & Huron St,41.894556,-87.653449,19.0,Carpenter St & Huron St,41.894556,-87.653449,19.0,82.4,10.0,11.5,0.0,tstorms
86,95044,Subscriber,Female,2013-07-23 00:16:00,2013-07-23 00:26:00,563,Wabash Ave & Roosevelt Rd,41.867173,-87.625955,19.0,Daley Center Plaza,41.884337,-87.630183,47.0,78.8,10.0,17.3,0.0,tstorms
112,111568,Subscriber,Male,2013-07-26 19:10:00,2013-07-26 19:33:00,1395,Larrabee St & Kingsbury St,41.897764,-87.642884,27.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,66.9,8.0,12.7,0.0,rain


## Combining isin with other criteria
You can use the resulting boolean Series from the isin method in the same way you would from the logical operators. For instance, If we wanted to find all the rides that had the same events and had a duration greater than 10,000 we would do the following:

In [97]:
criteria_1 = bikes['events'].isin(['rain', 'snow', 'tstorms', 'sleet'])
criteria_2 = bikes['tripduration'] > 2000
criteria_all = criteria_1 & criteria_2
bikes[criteria_all].head()

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
2344,1266453,Subscriber,Female,2014-03-19 07:23:00,2014-03-19 08:00:00,2181,Seeley Ave & Roscoe St,41.943403,-87.679618,11.0,Franklin St & Lake St,41.885837,-87.6355,23.0,43.0,3.0,6.9,0.07,rain
7697,3557596,Subscriber,Male,2014-09-12 14:20:00,2014-09-12 14:57:00,2213,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,California Ave & Division St,41.903029,-87.697474,15.0,52.0,2.0,12.7,0.0,rain
8357,3801419,Subscriber,Male,2014-09-30 08:21:00,2014-09-30 08:58:00,2246,Damen Ave & Melrose Ave,41.9406,-87.6785,11.0,Wood St & Taylor St,41.869154,-87.671045,15.0,46.9,3.0,11.5,0.0,rain
8506,3846762,Subscriber,Male,2014-10-04 12:33:00,2014-10-04 14:06:00,5568,Halsted St & Diversey Pkwy,41.933341,-87.648747,15.0,Halsted St & Wrightwood Ave,41.929143,-87.649077,15.0,42.1,8.0,17.3,0.02,rain
11267,4822906,Subscriber,Male,2015-04-10 17:25:00,2015-04-10 18:00:00,2074,Stetson Ave & South Water St,41.886835,-87.62232,19.0,Lake Shore Dr & Wellington Ave,41.936669,-87.636794,15.0,46.9,10.0,17.3,0.0,rain


# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select all movies that have Tom Hanks as `actor1`. How many of these movies has he starred in?</span>

In [109]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Select movies with and IMDB score greater than 9.</span>

In [110]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Select all movies from the 1970s.</span>

In [111]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Select all movies from the 1970s that had IMDB scores greater than 8</span>

In [112]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Select movies that were rated either R, PG-13, or PG.</span>

In [113]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Select movies that are either rated PG-13 or were made after 2010.</span>

In [None]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Find all the movies that have at least one of the three actors with more than 10,000 Facebook likes.</span>

In [None]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Reverse the condition from problem 6. Use one line of code. In words, what have you selected.</span>

In [None]:
# your code here