# Boolean Indexing

### Objectives
After this lesson you should be able to...
+ Use the indexing operator alone **`[]`** for boolean indexing - an exception to the rule of using **`.loc`** or **`.iloc`**
+ Create 'criteria' with the comparison operators
+ Build complex criteria multiple boolean comparisons
+ Keywords **`and`**, **`or`** and **`not`** do NOT work in pandas when creating multiple comparison logic. Use **`&, |, ~`** instead.

### Prepare for this lesson by...
[ALWAYS READ THE DOCUMENTATION BEFORE A LESSON!](http://pandas.pydata.org/pandas-docs/stable/)
+ Read [Boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)
+ Read [Indexing with isin](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-with-isin)

### Introduction
Previously we selected subsets of data based on integer location or the label of the index. Another common method of extracting the values in a Series is to choose them based on certain criteria. **Boolean indexing** is done by passing a boolean (only true/false values) array or Series to the **`[ ]`** operator. Boolean indexing is also referred to as **boolean selection**.

In [1]:
import pandas as pd
import numpy as np

### Setting index on read
Read in the movie dataset with the **`read_csv`** function but this time use the **`index_col`** parameter to set the index on read to **`movie_title`**. Then select the first 5 values of the **`actor_1_facebook_likes`** Series.

In [2]:
movie = pd.read_csv('../data/movie.csv', index_col='movie_title')
fb_likes = movie['actor_1_facebook_likes'].head()
fb_likes

movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      40000.0
Spectre                                       11000.0
The Dark Knight Rises                         27000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

### It's OK to use [ ] with boolean indexing
Previously we cautioned to use **`.iloc`** or **`.loc`** when accessing elements of a Series. Boolean indexing is an exception to this rule. Here we create a list of booleans and place them directly inside the bracket operator. This will select elements where True. The below keeps the first and third elements

In [3]:
keep = [True, False, True, False, False]
fb_likes[keep]

movie_title
Avatar      1000.0
Spectre    11000.0
Name: actor_1_facebook_likes, dtype: float64

### Creating boolean Series
You can compare each element with another value using the comparison operators, <, >, <=, >=, ==, !=. A Series of booleans that have the same index label will be the result. This result will soon be used inside the indexing operator as was done above.

In [4]:
fb_likes > 10000

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
Name: actor_1_facebook_likes, dtype: bool

### Create 'criteria' for boolean selection
We can save this boolean series to the variable **`criteria`** and then pass it to the indexing operator. The result will be a Series with only the movies that have more than 10000 Facebook likes for actor 1.

In [5]:
criteria = fb_likes > 10000

In [6]:
# we can now pass this criteria to our original series to get only values above 0
fb_likes[criteria]

movie_title
Pirates of the Caribbean: At World's End    40000.0
Spectre                                     11000.0
The Dark Knight Rises                       27000.0
Name: actor_1_facebook_likes, dtype: float64

### No intermediate variable
It's possible to pass the boolean expression directly into the indexing operator without first saving it to a variable.

In [7]:
fb_likes[fb_likes > 10000]

movie_title
Pirates of the Caribbean: At World's End    40000.0
Spectre                                     11000.0
The Dark Knight Rises                       27000.0
Name: actor_1_facebook_likes, dtype: float64

## Boolean Indexing with String Columns
Let's select a column that has strings in it and not numeric. Let's select the column **`actor_1_name`** and do some boolean indexing. We will work with the whole column and not just the first 5 values.

In [8]:
actor1 = movie['actor_1_name']
actor1.head()

movie_title
Avatar                                            CCH Pounder
Pirates of the Caribbean: At World's End          Johnny Depp
Spectre                                       Christoph Waltz
The Dark Knight Rises                               Tom Hardy
Star Wars: Episode VII - The Force Awakens        Doug Walker
Name: actor_1_name, dtype: object

### Find all Johnny Depp movies

In [9]:
actor1[actor1 == 'Johnny Depp']

movie_title
Pirates of the Caribbean: At World's End                  Johnny Depp
Pirates of the Caribbean: Dead Man's Chest                Johnny Depp
The Lone Ranger                                           Johnny Depp
Pirates of the Caribbean: On Stranger Tides               Johnny Depp
Alice in Wonderland                                       Johnny Depp
Alice Through the Looking Glass                           Johnny Depp
Charlie and the Chocolate Factory                         Johnny Depp
Dark Shadows                                              Johnny Depp
Rango                                                     Johnny Depp
Pirates of the Caribbean: The Curse of the Black Pearl    Johnny Depp
Public Enemies                                            Johnny Depp
The Tourist                                               Johnny Depp
Transcendence                                             Johnny Depp
Mortdecai                                                 Johnny Depp
Black Ma

### Using `isin` method to check for multiple equalities
Series have the method **`isin`** (read 'is in') that checks whether each element is a member of a given list. It returns a boolean Series the same length as the original. Let's search for the actors Johnny Depp, Matt Damon and Tom Hanks.

In [10]:
criteria = actor1.isin(['Johnny Depp', 'Matt Damon', 'Tom Hanks'])
criteria.head(5)

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End       True
Spectre                                       False
The Dark Knight Rises                         False
Star Wars: Episode VII - The Force Awakens    False
Name: actor_1_name, dtype: bool

In [11]:
actor1[criteria].head(15)

movie_title
Pirates of the Caribbean: At World's End                  Johnny Depp
Pirates of the Caribbean: Dead Man's Chest                Johnny Depp
The Lone Ranger                                           Johnny Depp
Pirates of the Caribbean: On Stranger Tides               Johnny Depp
Alice in Wonderland                                       Johnny Depp
Toy Story 3                                                 Tom Hanks
The Polar Express                                           Tom Hanks
Alice Through the Looking Glass                           Johnny Depp
Charlie and the Chocolate Factory                         Johnny Depp
Angels & Demons                                             Tom Hanks
Dark Shadows                                              Johnny Depp
Rango                                                     Johnny Depp
The Bourne Ultimatum                                       Matt Damon
Pirates of the Caribbean: The Curse of the Black Pearl    Johnny Depp
The Da V

## Multiple boolean expressions
Any number of boolean conditions can be strung together to retrieve certain values just as they can in python. The key words **`and`**, **`or`** and **`not`** do not work with pandas and numpy. Instead, `and` is replaced with **`&`**, `or` with **`|`** and `not` with **`~`**. Also, each comparison expression must be wrapped in parentheses.

Let's select the **`duration`** column and run though examples of multiple boolean expressions

In [12]:
duration = movie['duration']
duration.head()

movie_title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
Name: duration, dtype: float64

#### Find all movies that lasted between two and three hours.
Use the **and** operator **&**

In [13]:
criteria = (duration >= 120) & (duration <= 180)
duration[criteria].head()

movie_title
Avatar                                      178.0
Pirates of the Caribbean: At World's End    169.0
Spectre                                     148.0
The Dark Knight Rises                       164.0
John Carter                                 132.0
Name: duration, dtype: float64

#### Find all movies that lasted either less than two hours or more than three hours
Use the **or** operator **|**

In [14]:
criteria = (duration < 120) | (duration > 180)
duration[criteria].head()

movie_title
Tangled                                100.0
Batman v Superman: Dawn of Justice     183.0
Quantum of Solace                      106.0
Men in Black 3                         106.0
The Hobbit: The Desolation of Smaug    186.0
Name: duration, dtype: float64

#### Make the statements above equal with the not operator
Use the not operator **~**

In [15]:
criteria = ~((duration < 120) | (duration > 180))
duration[criteria].head()

movie_title
Avatar                                        178.0
Pirates of the Caribbean: At World's End      169.0
Spectre                                       148.0
The Dark Knight Rises                         164.0
Star Wars: Episode VII - The Force Awakens      NaN
Name: duration, dtype: float64

#### Why is there a missing value in the above Series?
All comparisons against a missing value are False. Reversing this with the not operator makes it True and hence it gets selected.

# Your Turn

### Problem 1
<span  style="color:green; font-size:16px">Select the column **`movie_facebook_likes`** as a Series, save it to a variable with the same name and output its first 10 values.</span>

In [16]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Use boolean indexing to select all movies with 0 movie facebook likes.</span>

In [17]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Use boolean indexing to select all movies that don't have 0 movie facebook likes.</span>

In [18]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Use boolean indexing to select all movies with more than 50,000 likes but less than 100,000</span>

In [19]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Sort the values of the output from problem 4 and output the head and tail to validate the boolean indexing.</span>

In [20]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Use boolean indexing to select movies with facebook likes less than 1000 or greater than 100,000.</span>

In [21]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Use boolean indexing to select movies with facebook likes less than 1000 but greater than 0 or greater than 100,000.</span>

In [22]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Use boolean indexing to select movies with facebook likes less than 1000 but greater than 0 or greater than 100,000 but less than 120,000. Think about breaking up the boolean expression into two different criteria before combining them together.</span>

In [23]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Reverse the boolean selection from problem 8.</span>

In [24]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Use the **`between`** method to replicate problem 4.</span>

In [25]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">How many movies have more than 100,000 facebook likes?</span>

In [26]:
# your code here