# 3. Selecting Subsets of Data - Series

### Overview
This notebook will teach you how to select a subset of data from a Series.

# Using Dot Notation to Select a Column as a Series
The last notebook showed how to use *just the brackets* to select a single column as a Series. Another common way to do this uses dot notation. Place the column name following a dot after the name of your DataFrame.

Let's read in the movie dataset, set the index as the title and then select the year with dot notation.

In [1]:
import pandas as pd
movie = pd.read_csv('data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


In [2]:
movie.year.head()

title
Avatar                                        2009.0
Pirates of the Caribbean: At World's End      2007.0
Spectre                                       2015.0
The Dark Knight Rises                         2012.0
Star Wars: Episode VII - The Force Awakens       NaN
Name: year, dtype: float64

### I don't recommend doing this
Although this is valid Pandas syntax I don't recommend using this notation for the following reasons:
* You cannot select columns with spaces in them
* You cannot select columns that have the same name as a Pandas method such as **`count`**

Using *just the brackets* **always** works so I recommend doing the following instead:

In [3]:
movie['year'].head()

title
Avatar                                        2009.0
Pirates of the Caribbean: At World's End      2007.0
Spectre                                       2015.0
The Dark Knight Rises                         2012.0
Star Wars: Episode VII - The Force Awakens       NaN
Name: year, dtype: float64

### Why even know about this?
Tab Completion:

In [4]:
# place your cursor after the dot and press tab
movie.year.

SyntaxError: invalid syntax (<ipython-input-4-f40c9669a7d1>, line 2)

In [5]:
# place your cursor after the dot and press tab
movie['year'].

SyntaxError: invalid syntax (<ipython-input-5-251dafd8f166>, line 2)

# Selecting Subsets of Data From a Series
Selecting subsets of data from a Series is very similar to that as a DataFrame. Since there are no columns in a Series, there isn't a need to use *just the brackets*. Instead, you can do all of your subset selection with **`.loc`** and **`.iloc`**

Let's select the column for IMDB scores as a Series and output the head.

In [6]:
imdb = movie['imdb_score']
imdb.head()

title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

### Selection with a scalar, a list, and a slice
Just like with a DataFrame, both **`.loc`** and **`.iloc`** accept either a single scalar, a list, or a slice.

Let's select the movie IMDB score for 'Forrest Gump':

In [7]:
imdb.loc['Forrest Gump']

8.8000000000000007

Select both 'Forrest Gump' and 'Avatar'. Notice that a Series is returned.

In [8]:
imdb[['Forrest Gump', 'Avatar']]

title
Forrest Gump    8.8
Avatar          7.9
Name: imdb_score, dtype: float64

Select every 100th movie from 'Avatar' to 'Forrest Gump':

In [9]:
imdb.loc['Avatar':'Forrest Gump':100]

title
Avatar                                   7.9
The Fast and the Furious                 6.7
Harry Potter and the Sorcerer's Stone    7.5
Epic                                     6.7
102 Dalmatians                           4.8
Pompeii                                  5.6
Wall Street: Money Never Sleeps          6.3
Hop                                      5.5
Beyond Borders                           6.5
Name: imdb_score, dtype: float64

### Repeat with `.iloc`
Select a single score

In [10]:
imdb.iloc[10]

6.9000000000000004

Select multiple scores with a list

In [11]:
imdb[[10, 20, 30]]

title
Batman v Superman: Dawn of Justice           6.9
The Hobbit: The Battle of the Five Armies    7.5
Skyfall                                      7.8
Name: imdb_score, dtype: float64

Select multiple scores with a slice

In [None]:
imdb[3000:3050:10]

# Pandas has power of lists and dictionaries
DataFrames and Series are able to make selections with integers like a list and with labels like a dictionary.

In [12]:
a_list = [4, 4, 9]
a_list[1]

4

In [13]:
d = {'a': 1, 'z': 26}
d['a']

1

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select `actor1` as a Series. Who is the `actor1` for 'My Big Fat Greek Wedding'?</span>

In [20]:
movie = pd.read_csv('data/movie.csv', index_col='title')
actor = movie['actor1']
actor.loc['My Big Fat Greek Wedding']


'Nia Vardalos'

### Problem 2
<span  style="color:green; font-size:16px">Find `actor1` for your favorite two movies?</span>

In [25]:
actor.loc['Spectre']

'Christoph Waltz'

### Problem 3
<span  style="color:green; font-size:16px">Select the last 10 values from `actor1` using two different ways?</span>

In [27]:
actor.tail(10)

title
Primer                       Shane Carruth
Cavite                         Ian Gamazon
El Mariachi                Carlos Gallardo
The Mongol King             Richard Jewell
Newlyweds                      Kerry Bishé
Signed Sealed Delivered        Eric Mabius
The Following                  Natalie Zea
A Plague So Pleasant           Eva Boehnke
Shanghai Calling                 Alan Ruck
My Date with Drew              John August
Name: actor1, dtype: object