# 7. Selecting Subsets of Data - Series


### Objectives

+ Learn how to select subsets from a Series using `.loc` and `.iloc`


### Overview
This notebook will teach you how to select a subset of data from a Series.

# Using Dot Notation to Select a Column as a Series
Previously we learned how to use *just the brackets* to select a single column as a Series. Another common way to do this uses dot notation. Place the column name following a dot after the name of your DataFrame.

Let's read in the movie dataset, set the index as the title and then select the year with dot notation.

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


In [None]:
movie.year.head()

### I don't recommend doing this
Although this is valid Pandas syntax I don't recommend using this notation for the following reasons:
* You cannot select columns with spaces in them
* You cannot select columns that have the same name as a Pandas method such as **`count`**

Using *just the brackets* **always** works so I recommend doing the following instead:

In [None]:
movie['year'].head()

### Why even know about this?
Pandas is written differently by different people and you will definitely see this syntax around so it's important to be aware of it.

It also has the advantage of providing tab-completion help when chaining a method to the end. Place your cursor at the end of the following two lines and press tab. Only the one that selects via dot notation will show the available methods. This helps me remember what methods are possible, so sometimes I will use this to find the method I need and then change the syntax back to the brackets.

In [None]:
# place your cursor after the dot and press tab
movie.year.

In [None]:
# place your cursor after the dot and press tab
movie['year'].

# Selecting Subsets of Data From a Series
Selecting subsets of data from a Series is very similar to that as a DataFrame. Since there are no columns in a Series, there isn't a need to use *just the brackets*. Instead, you can do all of your subset selection with **`.loc`** and **`.iloc`**

Let's select the column for IMDB scores as a Series and output the head.

In [None]:
imdb = movie['imdb_score']
imdb.head()

### Selection with a scalar, a list, and a slice
Just like with a DataFrame, both **`.loc`** and **`.iloc`** accept either a single scalar, a list, or a slice.

Let's select the movie IMDB score for 'Forrest Gump':

In [None]:
imdb.loc['Forrest Gump']

Select both 'Forrest Gump' and 'Avatar'. Notice that a Series is returned.

In [None]:
locs = ['Forrest Gump', 'Avatar']
imdb.loc[locs]

Select every 100th movie from 'Avatar' to 'Forrest Gump':

In [None]:
imdb.loc['Avatar':'Forrest Gump':100]

### Repeat with `.iloc`
Select a single score

In [None]:
imdb.iloc[10]

Select multiple scores with a list

In [None]:
ilocs = [10, 20, 30]
imdb.iloc[ilocs]

Select multiple scores with a slice

In [None]:
imdb.iloc[3000:3050:10]

### Trouble with *just the brackets*
You can use just the brackets to make the same selections as above. See the following examples:

In [None]:
imdb['Forrest Gump']

In [None]:
imdb['Avatar':'Forrest Gump':100]

In [None]:
ilocs = [10, 20, 30]
imdb[ilocs]

In [None]:
imdb[3000:3050:10]

# Can you spot the problem?
The major issue is that using *just the brackets* is **ambiguous** and not **explicit**. We don't know if we are selecting by label or by integer location. With **`.loc`** and **`.iloc`**, it is clear exactly what our intentions are. I suggest using **`.loc`** and **`.iloc`** for clarity.

# Comparison to Python Lists and Dictionaries
It may be helpful to compare pandas ability to make selections by label and integer location to that of Python lists and dictionaries.

Python lists allow for selection of data only through **integer location**. You can use a single integer or slice notation to make the selection but NOT a list of integers.

Let’s see examples of subset selection of lists using integers:

In [None]:
a_list = [10, 5, 3, 89, 20, 44, 37]

In [None]:
a_list[4]

In [None]:
a_list[-3:]

# Selection by label with Python dictionaries
All values in each dictionary are labeled by a key. We use this key to make single selections. Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [None]:
d = {'a':1, 'b':2, 't':20, 'z':26, 'A':27}
d['a']

In [None]:
d['A']

# Pandas has power of lists and dictionaries
DataFrames and Series are able to make selections with integers like a list and with labels like a dictionary.

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the bikes dataset. We will be using it for the rest of the questions. Select the wind speed column as a Series and assign it to a variable and output the head. What kind of index does this Series have?</span>

In [1]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')
bikes.head(5)
wind = bikes['wind_speed']
wind.head(3) 

0    12.7
1     6.9
2    16.1
Name: wind_speed, dtype: float64

In [2]:
wind.index #get the type of index, this case is RangeIndex

RangeIndex(start=0, stop=50089, step=1)

### Problem 2
<span  style="color:green; font-size:16px">From the wind speed Series, select the integer locations 4 through, but not including 10</span>

In [3]:
wind.iloc[4:10]

4    17.3
5    17.3
6    15.0
7     5.8
8     0.0
9    12.7
Name: wind_speed, dtype: float64

### Problem 3
<span  style="color:green; font-size:16px">Copy and paste your answer to problem 2 below but use `.loc` instead. Do you get the same result? Why not?</span>

In [7]:
wind.loc[4:10] 
#because loc is inclusive so goes until the positions you attribute to the end of your slicing

4     17.3
5     17.3
6     15.0
7      5.8
8      0.0
9     12.7
10     9.2
Name: wind_speed, dtype: float64

### Problem 4
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select `actor1` as a Series. Who is the `actor1` for 'My Big Fat Greek Wedding'?</span>

In [20]:
import pandas as pd
movies = pd.read_csv('../data/movie.csv', index_col='title' )
movies.head(3)
actor1 = movies['actor1']
actor1.loc['My Big Fat Greek Wedding']

'Nia Vardalos'

### Problem 5
<span  style="color:green; font-size:16px">Find `actor1` for your favorite two movies?</span>

In [22]:
actor1.loc[['Avatar','Titanic']]

title
Avatar           CCH Pounder
Titanic    Leonardo DiCaprio
Name: actor1, dtype: object

### Problem 6
<span  style="color:green; font-size:16px">Select the last 10 values from `actor1` using two different ways?</span>

In [25]:
actor1.head(10)

title
Avatar                                            CCH Pounder
Pirates of the Caribbean: At World's End          Johnny Depp
Spectre                                       Christoph Waltz
The Dark Knight Rises                               Tom Hardy
Star Wars: Episode VII - The Force Awakens        Doug Walker
John Carter                                      Daryl Sabara
Spider-Man 3                                     J.K. Simmons
Tangled                                          Brad Garrett
Avengers: Age of Ultron                       Chris Hemsworth
Harry Potter and the Half-Blood Prince           Alan Rickman
Name: actor1, dtype: object

In [26]:
actor1.iloc[-10:]

title
Primer                       Shane Carruth
Cavite                         Ian Gamazon
El Mariachi                Carlos Gallardo
The Mongol King             Richard Jewell
Newlyweds                      Kerry Bishé
Signed Sealed Delivered        Eric Mabius
The Following                  Natalie Zea
A Plague So Pleasant           Eva Boehnke
Shanghai Calling                 Alan Ruck
My Date with Drew              John August
Name: actor1, dtype: object

In [29]:
actor1.tail(10)

title
Primer                       Shane Carruth
Cavite                         Ian Gamazon
El Mariachi                Carlos Gallardo
The Mongol King             Richard Jewell
Newlyweds                      Kerry Bishé
Signed Sealed Delivered        Eric Mabius
The Following                  Natalie Zea
A Plague So Pleasant           Eva Boehnke
Shanghai Calling                 Alan Ruck
My Date with Drew              John August
Name: actor1, dtype: object

## DURING CLASS SELECTING SUBSETS OF DATA - SERIES

* iloc is exclusive [0:5] - include 0, exclude 5

In [3]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [4]:
height = df['height']
height

Jane         165
Niko          70
Aaron        120
Penelope      80
Dean         180
Christina    172
Cornelia     150
Name: height, dtype: int64

In [6]:
df.height 
#not ideal but you are gonna se this being used
#you can select column with the . notation and the name of the variable
# will not work if the columns do have space in between or if have the same name of a build in method

Jane         165
Niko          70
Aaron        120
Penelope      80
Dean         180
Christina    172
Cornelia     150
Name: height, dtype: int64

In [7]:
rows = ['Niko', 'Dean']
height.loc[rows]

Niko     70
Dean    180
Name: height, dtype: int64

In [10]:
height.loc['Niko']

70

In [11]:
height.loc['Niko':'Dean']

Niko         70
Aaron       120
Penelope     80
Dean        180
Name: height, dtype: int64

In [12]:
height.iloc[4]

180

In [14]:
height.iloc[[3,1,-2]]

Penelope      80
Niko          70
Christina    172
Name: height, dtype: int64

In [15]:
height.iloc[4:]

Dean         180
Christina    172
Cornelia     150
Name: height, dtype: int64