# Activity: Selective Subsets

## Introduction

In this activity you will practice selecting subsets of data from a DataFrame using Pandas.
This activity will cover the following topics:
- Creating masks
- Negating masks
- Masks with slicing
- Null value masks


In [1]:
import pandas as pd

#### Question 1

Create a `DataFrame` called `df` from the given CSV file `movie_data.csv`, and then create a mask called `before_millennium` to select all movies that were released before 2000.


In [2]:
# Your code here
import numpy as np

df = pd.read_csv("movie_data.csv")

df

Unnamed: 0,Title,Year Released,Rating,Box Office ($M)
0,The Shawshank Redemption,1994,9.3,58.3
1,The Godfather,1972,9.2,246.1
2,The Dark Knight,2008,9.0,1005.0
3,Pulp Fiction,1994,8.9,213.9
4,Schindler's List,1993,8.9,321.3
...,...,...,...,...
70,Andhadhun,2018,8.3,48.0
71,Gully Boy,2019,8.1,62.0
72,Dil Chahta Hai,2001,8.1,13.0
73,Dil To Pagal Hai,1997,7.1,11.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            75 non-null     object 
 1   Year Released    75 non-null     int64  
 2   Rating           74 non-null     float64
 3   Box Office ($M)  75 non-null     float64
dtypes: float64(2), int64(1), object(1)
memory usage: 2.5+ KB


In [4]:
df.isnull().sum()

Title              0
Year Released      0
Rating             1
Box Office ($M)    0
dtype: int64

In [5]:
# Question 1 Grading Checks

assert isinstance(df, pd.DataFrame), 'Did you create a DataFrame called df?'


#### Question 2

Using the `before_millennium` mask from Question 1, assign the titles of every movie that was released on or after 2000 to a `Series` called `newer_titles`.


In [6]:
# Your code here

before_millennium = df[df["Year Released"] >= 2000]
before_millennium.head()

Unnamed: 0,Title,Year Released,Rating,Box Office ($M)
2,The Dark Knight,2008,9.0,1005.0
7,Inception,2010,8.8,829.9
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8,871.5
11,Gladiator,2000,8.5,457.6
14,The Lord of the Rings: The Two Towers,2002,8.7,947.1


In [7]:
newer_titles = pd.Series(before_millennium.Title)

newer_titles

2                                       The Dark Knight
7                                             Inception
9     The Lord of the Rings: The Fellowship of the Ring
11                                            Gladiator
14                The Lord of the Rings: The Two Towers
15                                         The Departed
19        The Lord of the Rings: The Return of the King
20                                          The Pianist
22                                               Avatar
23                                 Inglourious Basterds
24                                         The Revenant
25                                     Django Unchained
26                                   The Social Network
27                                         Interstellar
28                             The Grand Budapest Hotel
29                                              Dunkirk
30                                     A Beautiful Mind
33                                   The Shape o

In [8]:
# Question 2 Grading Checks

assert isinstance(newer_titles, pd.Series), 'Did you create a Series called newer_titles?'


#### Question 3

Create a mask to select movies with a `Rating` of `8.9` or higher and a `Box Office ($M)` value higher than `1000.0`. Assign the resulting `Series` to a variable called `popular_movies`.


In [9]:
# Your code here

mask = df[(df.Rating >=8.9) & (df["Box Office ($M)"] > 1000.00)]
mask

Unnamed: 0,Title,Year Released,Rating,Box Office ($M)
2,The Dark Knight,2008,9.0,1005.0
19,The Lord of the Rings: The Return of the King,2003,8.9,1142.0


In [10]:
popular_movies = pd.Series(mask.Title)
popular_movies

2                                   The Dark Knight
19    The Lord of the Rings: The Return of the King
Name: Title, dtype: object

In [11]:
# Question 3 Grading Checks

assert isinstance(popular_movies, pd.Series), 'Did you create a Series called popular_movies?'


#### Question 4

Create a mask to select movies with a null value for `Box Office ($M)` or `Rating`. Assign the resulting `Series` to a variable called `missing_info`.


In [12]:
df.Rating.value_counts(dropna=False)

8.1    11
8.0     8
8.6     6
8.5     5
8.8     4
7.7     3
8.3     3
8.9     3
7.8     3
8.4     3
8.2     3
7.3     2
7.1     2
8.7     2
7.9     2
6.0     2
7.0     2
7.6     2
9.0     2
7.4     1
6.3     1
7.2     1
9.2     1
9.3     1
NaN     1
6.7     1
Name: Rating, dtype: int64

In [13]:
# Your code here

mask2 = df[(df['Box Office ($M)'].isnull()) | (df['Rating'].isnull())]

mask2

Unnamed: 0,Title,Year Released,Rating,Box Office ($M)
34,12 Years a Slave,2013,,187.7


In [14]:
missing_info = pd.Series(mask2.Title)
missing_info

34    12 Years a Slave
Name: Title, dtype: object

In [15]:
# Question 4 Grading Checks

assert isinstance(missing_info, pd.Series), 'Did you create a Series called missing_info?'


#### Question 5

Create a mask to select movies with a `Year Released` on or after 1990 and before 2000. Assign the resulting `Series` to a variable called `nineties_movies`.


In [16]:
# Your code here

nineties_movies = df[(df['Year Released'] >= 1990) & (df['Year Released'] < 2000)]

nineties_movies = pd.Series(nineties_movies.Title)

nineties_movies

0        The Shawshank Redemption
3                    Pulp Fiction
4                Schindler's List
5                      Fight Club
6                    Forrest Gump
8                      The Matrix
10       The Silence of the Lambs
12                 The Green Mile
16             The Usual Suspects
17            Saving Private Ryan
18                          Se7en
21                        Titanic
31        The Godfather: Part III
32               The Big Lebowski
41                  Jurassic Park
45                  The Lion King
60    Dilwale Dulhania Le Jayenge
61             Kuch Kuch Hota Hai
64                       Baazigar
73               Dil To Pagal Hai
Name: Title, dtype: object

In [17]:
# Question 5 Grading Checks

assert isinstance(nineties_movies, pd.Series), 'Did you create a Series called The nineties_movies mask it not quite correct?'


#### Question 6

Create a mask to select movies with a `Year Released` before 1980 or after 2010. Assign the resulting `Series` to a variable called `other_movies`.


In [18]:
# Your code here

other_movies = df[(df['Year Released'] < 1980) | (df['Year Released'] > 2010)]

other_movies = pd.Series(other_movies.Title)

other_movies

1                   The Godfather
13         The Godfather: Part II
24                   The Revenant
25               Django Unchained
27                   Interstellar
28       The Grand Budapest Hotel
29                        Dunkirk
33             The Shape of Water
34               12 Years a Slave
35                        Gravity
36                     La La Land
37              Blade Runner 2049
38                   The Irishman
39                       Parasite
40               The Great Gatsby
47                   The Avengers
48                    The Martian
49                  Black Panther
51                             PK
52                         Dangal
53       Baahubali: The Beginning
54    Baahubali 2: The Conclusion
55              Bajrangi Bhaijaan
56                         Sultan
59                Chennai Express
63                      Padmaavat
68                         Sholay
69                Mera Naam Joker
70                      Andhadhun
71            

In [19]:
# Question 6 Grading Checks

assert isinstance(other_movies, pd.Series), 'Did you create a Series called other_movies?'
