---
# Filtering Rows
---

## Calculating Boolean statistics
We create a Boolean array by applying a condition to a column of data and then
calculate summary statistics from it

Read in the movie dataset, set the index to the movie title, and inspect the first few
rows of the duration column

In [1]:
import numpy as np
import pandas as pd

In [2]:
movie = pd.read_csv('movie.csv', index_col='movie_title')
movie[['duration']].sample(n=8, random_state=42)

Unnamed: 0_level_0,duration
movie_title,Unnamed: 1_level_1
The Book Thief,131.0
The Beyond,82.0
Clear and Present Danger,141.0
The Ballad of Cable Hogue,121.0
Bobby Jones: Stroke of Genius,128.0
The Jungle Book,106.0
Malibu's Most Wanted,86.0
The Brain That Sings,62.0


Determine whether the duration of each movie is longer than two hours by using the
greater than comparison operator with the duration column:

In [7]:
movie_2_hours = movie[['duration']].gt(120)
movie_2_hours

Unnamed: 0_level_0,duration
movie_title,Unnamed: 1_level_1
Avatar,True
Pirates of the Caribbean: At World's End,True
Spectre,True
The Dark Knight Rises,True
Star Wars: Episode VII - The Force Awakens,False
...,...
Signed Sealed Delivered,False
The Following,False
A Plague So Pleasant,False
Shanghai Calling,False


We can now use this Series to determine the number of movies that are longer than
two hours

In [8]:
movie_2_hours.sum()

duration    1039
dtype: int64

To find the percentage of movies in the dataset longer than two hours, use the `.mean` method

In [9]:
movie_2_hours.mean().mul(100)

duration    21.135069
dtype: float64

Unfortunately, the output from step 4 is misleading. The duration column has a few missing values. If you look back at the DataFrame output from step 1, you will see that the last row is missing a value for duration. The Boolean condition in step 2 returns False for this. We need to drop the missing values first, then evaluate the condition and take the mean

In [13]:
movie['duration'].isna().sum()

15

In [15]:
movie[['duration']].dropna().gt(120).mean().mul(100)

duration    21.199755
dtype: float64

Use the `.describe` method to output summary statistics on the Boolean array

In [16]:
movie_2_hours.describe()

Unnamed: 0,duration
count,4916
unique,2
top,False
freq,3877


In [17]:
movie_2_hours.value_counts(normalize=True)

duration
False       0.788649
True        0.211351
dtype: float64

It is possible to compare two columns from the same DataFrame to produce a Boolean Series.
For instance, we could determine the percentage of movies that have actor 1 with more Facebook likes than actor 2. To do this, we would select both of these columns and then drop any of the rows that had missing values for either movie. Then we would make the comparison and calculate the mean:

In [18]:
fb_likes = ["actor_1_facebook_likes", "actor_2_facebook_likes"]
actors = movie[fb_likes].dropna()
actors.head()

Unnamed: 0_level_0,actor_1_facebook_likes,actor_2_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1
Avatar,1000.0,936.0
Pirates of the Caribbean: At World's End,40000.0,5000.0
Spectre,11000.0,393.0
The Dark Knight Rises,27000.0,23000.0
Star Wars: Episode VII - The Force Awakens,131.0,12.0


In [19]:
(
    actors['actor_1_facebook_likes'] > actors['actor_2_facebook_likes']
).mean()

0.9777687130328371

In [20]:
actors['actor_1_facebook_likes'].mean(), actors['actor_2_facebook_likes'].mean()

(6502.41444013869, 1621.9235162145626)

In [21]:
actors['actor_1_facebook_likes'].mean() / actors['actor_2_facebook_likes'].mean()


4.0090758751157365