---
# Filtering Rows
---

## Calculating Boolean statistics
We create a Boolean array by applying a condition to a column of data and then
calculate summary statistics from it

Read in the movie dataset, set the index to the movie title, and inspect the first few
rows of the duration column

In [1]:
import numpy as np
import pandas as pd

In [2]:
movie = pd.read_csv('movie.csv', index_col='movie_title')
movie[['duration']].sample(n=8, random_state=42)

Unnamed: 0_level_0,duration
movie_title,Unnamed: 1_level_1
The Book Thief,131.0
The Beyond,82.0
Clear and Present Danger,141.0
The Ballad of Cable Hogue,121.0
Bobby Jones: Stroke of Genius,128.0
The Jungle Book,106.0
Malibu's Most Wanted,86.0
The Brain That Sings,62.0


Determine whether the duration of each movie is longer than two hours by using the
greater than comparison operator with the duration column:

In [7]:
movie_2_hours = movie[['duration']].gt(120)
movie_2_hours

Unnamed: 0_level_0,duration
movie_title,Unnamed: 1_level_1
Avatar,True
Pirates of the Caribbean: At World's End,True
Spectre,True
The Dark Knight Rises,True
Star Wars: Episode VII - The Force Awakens,False
...,...
Signed Sealed Delivered,False
The Following,False
A Plague So Pleasant,False
Shanghai Calling,False


We can now use this Series to determine the number of movies that are longer than
two hours

In [8]:
movie_2_hours.sum()

duration    1039
dtype: int64

To find the percentage of movies in the dataset longer than two hours, use the `.mean` method

In [9]:
movie_2_hours.mean().mul(100)

duration    21.135069
dtype: float64

Unfortunately, the output from step 4 is misleading. The duration column has a few missing values. If you look back at the DataFrame output from step 1, you will see that the last row is missing a value for duration. The Boolean condition in step 2 returns False for this. We need to drop the missing values first, then evaluate the condition and take the mean

In [13]:
movie['duration'].isna().sum()

15

In [15]:
movie[['duration']].dropna().gt(120).mean().mul(100)

duration    21.199755
dtype: float64

Use the `.describe` method to output summary statistics on the Boolean array

In [16]:
movie_2_hours.describe()

Unnamed: 0,duration
count,4916
unique,2
top,False
freq,3877


In [17]:
movie_2_hours.value_counts(normalize=True)

duration
False       0.788649
True        0.211351
dtype: float64

It is possible to compare two columns from the same DataFrame to produce a Boolean Series.
For instance, we could determine the percentage of movies that have actor 1 with more Facebook likes than actor 2. To do this, we would select both of these columns and then drop any of the rows that had missing values for either movie. Then we would make the comparison and calculate the mean:

In [18]:
fb_likes = ["actor_1_facebook_likes", "actor_2_facebook_likes"]
actors = movie[fb_likes].dropna()
actors.head()

Unnamed: 0_level_0,actor_1_facebook_likes,actor_2_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1
Avatar,1000.0,936.0
Pirates of the Caribbean: At World's End,40000.0,5000.0
Spectre,11000.0,393.0
The Dark Knight Rises,27000.0,23000.0
Star Wars: Episode VII - The Force Awakens,131.0,12.0


In [19]:
(
    actors['actor_1_facebook_likes'] > actors['actor_2_facebook_likes']
).mean()

0.9777687130328371

In [20]:
actors['actor_1_facebook_likes'].mean(), actors['actor_2_facebook_likes'].mean()

(6502.41444013869, 1621.9235162145626)

In [21]:
actors['actor_1_facebook_likes'].mean() / actors['actor_2_facebook_likes'].mean()


4.0090758751157365

## Constructing multiple Boolean conditions
In Python, Boolean expressions use the **built-in** logical operators `and`, `or`, and `not`. These keywords do not work with Boolean indexing in pandas and are respectively replaced with `&`, `|`, and `~`. Additionally, when combining expressions, each expression must be wrapped in parentheses, or an error will be raised (due to operator precedence).

find all the movies that have an *imdb_score* greater
than **8**, a *content_rating* of **PG-13**, and a *title_year* either before **2000** or after **2009**.

In [24]:
df = movie[['title_year', 'content_rating', 'imdb_score']]
df.head(2)

Unnamed: 0_level_0,title_year,content_rating,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,2009.0,PG-13,7.9
Pirates of the Caribbean: At World's End,2007.0,PG-13,7.1


In [27]:
mask = (
    (df.title_year < 2000) | (df.title_year > 2009) & 
    (df.content_rating == 'PG-13') & 
    (df.imdb_score > 8)
)
mask

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
                                              ...  
Signed Sealed Delivered                       False
The Following                                 False
A Plague So Pleasant                          False
Shanghai Calling                              False
My Date with Drew                             False
Length: 4916, dtype: bool

In [26]:
df[mask]

Unnamed: 0_level_0,title_year,content_rating,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Dark Knight Rises,2012.0,PG-13,8.5
The Avengers,2012.0,PG-13,8.1
Titanic,1997.0,PG-13,7.7
Captain America: Civil War,2016.0,PG-13,8.2
Wild Wild West,1999.0,PG-13,4.8
...,...,...,...
Slacker,1991.0,R,7.1
Pink Flamingos,1972.0,NC-17,6.1
The Cure,1997.0,,7.4
Bang,1995.0,,6.4


## Filtering with boolean arrays
Both Series and DataFrame can be filtered with Boolean arrays. You can index this directly off of the object or off of the `.loc` attribute.

constructs two complex filters for different rows of movies. The first filters movies with an *imdb_score* greater than **8**, a *content_rating* of **PG-13**, and a *title_year* either before 2000 or after **2009**. The second filter consists of those with an *imdb_score*
less than **5**, a *content_rating* of **R**, and a *title_year* between **2000** and **2010**. Finally, we will combine these filters.

In [28]:
mask1 = (
    (df.title_year < 2000) | (df.title_year > 2009) & 
    (df.content_rating == 'PG-13') & 
    (df.imdb_score > 8)
)
mask1

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
                                              ...  
Signed Sealed Delivered                       False
The Following                                 False
A Plague So Pleasant                          False
Shanghai Calling                              False
My Date with Drew                             False
Length: 4916, dtype: bool

In [29]:
mask2 = (
    (df.imdb_score < 5) &
    (df.content_rating == 'R') &
    (df.title_year >= 2000) &
    (df.title_year <= 2010)   
)
mask2

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                         False
Star Wars: Episode VII - The Force Awakens    False
                                              ...  
Signed Sealed Delivered                       False
The Following                                 False
A Plague So Pleasant                          False
Shanghai Calling                              False
My Date with Drew                             False
Length: 4916, dtype: bool

Combine the two sets of criteria using the pandas or operator. This yields a Boolean
array of all movies that are members of either set

In [30]:
mask = (mask1 | mask2)
mask

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
                                              ...  
Signed Sealed Delivered                       False
The Following                                 False
A Plague So Pleasant                          False
Shanghai Calling                              False
My Date with Drew                             False
Length: 4916, dtype: bool

 Pass the Boolean array to the index operator to filter the data

In [31]:
movie[mask].head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
The Avengers,Color,Joss Whedon,703.0,173.0,0.0,19000.0,Robert Downey Jr.,26000.0,623279547.0,Action|Adventure|Sci-Fi,Chris Hemsworth,995415,87697,Scarlett Johansson,3.0,alien invasion|assassin|battle|iron man|soldier,http://www.imdb.com/title/tt0848228/?ref_=fn_t...,1722.0,English,USA,PG-13,220000000.0,2012.0,21000.0,8.1,1.85,123000
Titanic,Color,James Cameron,315.0,194.0,0.0,794.0,Kate Winslet,29000.0,658672302.0,Drama|Romance,Leonardo DiCaprio,793059,45223,Gloria Stuart,0.0,artist|love|ship|titanic|wet,http://www.imdb.com/title/tt0120338/?ref_=fn_t...,2528.0,English,USA,PG-13,200000000.0,1997.0,14000.0,7.7,2.35,26000
Captain America: Civil War,Color,Anthony Russo,516.0,147.0,94.0,11000.0,Scarlett Johansson,21000.0,407197282.0,Action|Adventure|Sci-Fi,Robert Downey Jr.,272670,64798,Chris Evans,0.0,based on comic book|knife|marvel cinematic uni...,http://www.imdb.com/title/tt3498820/?ref_=fn_t...,1022.0,English,USA,PG-13,250000000.0,2016.0,19000.0,8.2,2.35,72000
Wild Wild West,Color,Barry Sonnenfeld,85.0,106.0,188.0,582.0,Salma Hayek,10000.0,113745408.0,Action|Comedy|Sci-Fi|Western,Will Smith,129601,15870,Bai Ling,2.0,buddy movie|general|inventor|steampunk|utah,http://www.imdb.com/title/tt0120891/?ref_=fn_t...,648.0,English,USA,PG-13,170000000.0,1999.0,4000.0,4.8,1.85,0


We can also filter off of the `.loc` attribute

In [32]:
movie.loc[mask]

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
The Avengers,Color,Joss Whedon,703.0,173.0,0.0,19000.0,Robert Downey Jr.,26000.0,623279547.0,Action|Adventure|Sci-Fi,Chris Hemsworth,995415,87697,Scarlett Johansson,3.0,alien invasion|assassin|battle|iron man|soldier,http://www.imdb.com/title/tt0848228/?ref_=fn_t...,1722.0,English,USA,PG-13,220000000.0,2012.0,21000.0,8.1,1.85,123000
Titanic,Color,James Cameron,315.0,194.0,0.0,794.0,Kate Winslet,29000.0,658672302.0,Drama|Romance,Leonardo DiCaprio,793059,45223,Gloria Stuart,0.0,artist|love|ship|titanic|wet,http://www.imdb.com/title/tt0120338/?ref_=fn_t...,2528.0,English,USA,PG-13,200000000.0,1997.0,14000.0,7.7,2.35,26000
Captain America: Civil War,Color,Anthony Russo,516.0,147.0,94.0,11000.0,Scarlett Johansson,21000.0,407197282.0,Action|Adventure|Sci-Fi,Robert Downey Jr.,272670,64798,Chris Evans,0.0,based on comic book|knife|marvel cinematic uni...,http://www.imdb.com/title/tt3498820/?ref_=fn_t...,1022.0,English,USA,PG-13,250000000.0,2016.0,19000.0,8.2,2.35,72000
Wild Wild West,Color,Barry Sonnenfeld,85.0,106.0,188.0,582.0,Salma Hayek,10000.0,113745408.0,Action|Comedy|Sci-Fi|Western,Will Smith,129601,15870,Bai Ling,2.0,buddy movie|general|inventor|steampunk|utah,http://www.imdb.com/title/tt0120891/?ref_=fn_t...,648.0,English,USA,PG-13,170000000.0,1999.0,4000.0,4.8,1.85,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Slacker,Black and White,Richard Linklater,61.0,100.0,0.0,0.0,Richard Linklater,5.0,1227508.0,Comedy|Drama,Tommy Pallotta,15103,5,Jean Caffeine,0.0,austin texas|moon|pap smear|texas|twenty somet...,http://www.imdb.com/title/tt0102943/?ref_=fn_t...,80.0,English,USA,R,23000.0,1991.0,0.0,7.1,1.37,2000
Pink Flamingos,Color,John Waters,73.0,108.0,0.0,105.0,Mink Stole,462.0,180483.0,Comedy|Crime|Horror,Divine,16792,760,Edith Massey,2.0,absurd humor|egg|gross out humor|lesbian|sex,http://www.imdb.com/title/tt0069089/?ref_=fn_t...,183.0,English,USA,NC-17,10000.0,1972.0,143.0,6.1,1.37,0
The Cure,Color,Kiyoshi Kurosawa,78.0,111.0,62.0,6.0,Anna Nakagawa,89.0,94596.0,Crime|Horror|Mystery|Thriller,Kôji Yakusho,6318,115,Denden,0.0,breasts|interrogation|investigation|murder|wat...,http://www.imdb.com/title/tt0123948/?ref_=fn_t...,50.0,Japanese,Japan,,1000000.0,1997.0,13.0,7.4,1.85,817
Bang,Color,Ash Baron-Cohen,10.0,98.0,3.0,152.0,Stanley B. Herman,789.0,,Crime|Drama,Peter Greene,438,1186,James Noble,1.0,corruption|homeless|homeless man|motorcycle|ur...,http://www.imdb.com/title/tt0109266/?ref_=fn_t...,14.0,English,USA,,,1995.0,194.0,6.4,,20


In [33]:
df[mask]

Unnamed: 0_level_0,title_year,content_rating,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Dark Knight Rises,2012.0,PG-13,8.5
The Avengers,2012.0,PG-13,8.1
Titanic,1997.0,PG-13,7.7
Captain America: Civil War,2016.0,PG-13,8.2
Wild Wild West,1999.0,PG-13,4.8
...,...,...,...
Slacker,1991.0,R,7.1
Pink Flamingos,1972.0,NC-17,6.1
The Cure,1997.0,,7.4
Bang,1995.0,,6.4


The `.iloc` attribute does not support Boolean arrays! If you pass in a Boolean Series to it, an exception will get raised. However, it does work with NumPy arrays, so if you call the `.to_numpy()` method, you can filter with it

In [35]:
df.iloc[mask.to_numpy()]

Unnamed: 0_level_0,title_year,content_rating,imdb_score
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Dark Knight Rises,2012.0,PG-13,8.5
The Avengers,2012.0,PG-13,8.1
Titanic,1997.0,PG-13,7.7
Captain America: Civil War,2016.0,PG-13,8.2
Wild Wild West,1999.0,PG-13,4.8
...,...,...,...
Slacker,1991.0,R,7.1
Pink Flamingos,1972.0,NC-17,6.1
The Cure,1997.0,,7.4
Bang,1995.0,,6.4
