# 2. Selecting Subsets of Series Data

### Objectives
After this lesson you should be able to...
+ Make an index meaningful by using **`set_index`**
+ Access Series items by integer position with **`.iloc`**
+ Access Series items by index label with **`.loc`**
+ Always use **`.iloc`** or **`.loc`** for accessing Series elements
+ Create a simple Series with the constructor **`pd.Series`**
+ Know that the indexes automatically align when two Series objects are added (or any operation) together
+ Know why using the brackets **`[]`** to access elements is not good practice
+ Get the dimensions of Series and DataFames with the **`len`** function and **`size`** and **`shape`** attributes

### Prepare for this lesson by...
[ALWAYS READ THE DOCUMENTATION BEFORE A LESSON!](http://pandas.pydata.org/pandas-docs/stable/)
+ Read [Intro to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html) - **just the Series section**
+ Read [Indexing and Selecting](http://pandas.pydata.org/pandas-docs/stable/indexing.html) - **up to but not including Selection By Callable**

### Overview
The primary purpose of this notebook is to show you how to select certain subsets from a Series using either **integer location** or by **index label**. These are two distinct methods that allows for powerful data selection.

## Python list Selection
There is only one way to select data from a list and that is by passing an integer (or a slice) to the indexing operator. This would be considered **integer location**.

## Python dictionary selection
Python dictionaries are made of key-value pairs. You select one value at a time by passing the key to the indexing operator of a dictionary. This would be considered **index label**. Labels are usually strings but they can be integers.

In [1]:
# selection by integer location
my_list = ['integer', 'location', 'seletion']
my_list[1]

'location'

In [2]:
# selection by index label
d = {'index':'label'}
d['index']

'label'

## Making a meaningful index
Before we start selecting data lets read in the movie dataset and replace the simple integer index with the title of the movie. Any column may be used for the index. Many times, you will want to use an index that uniquely identifies a row.

We will use the **`set_index`** DataFrame method to make the column **`movie_title`** the new index. This removes it from the values of the DataFrame.

In [3]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv')
movie.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


### Setting the index
The **`set_index`** method accepts the string of the column to use as its new index. Notice how the index now contains the movie title and is in bold font.

In [4]:
movie = movie.set_index('movie_title')
movie.head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


## Select the imdb_score column as a Series
We will select the **`imdb_score`** column again. Notice how the index of movie titles remains after selection.

In [5]:
imdb_score = movie['imdb_score']
imdb_score.head()

movie_title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

## Selection by Integer Location with **`.iloc`**
To do integer location selection you must use **`.iloc`**. pandas calls this an **indexer**. The locations of the elements of a Series begin with 0 and end at n-1 where n is the number of rows. There are a few ways to use .iloc. You can pass a single integer, a slice or a list of integers. See the examples below

### Single Integers
A single integer passed to the **`.iloc`** indexer will return the value as a scalar and not a Series. You can use negative values a well.

In [6]:
imdb_score.iloc[3]

8.5

In [7]:
# get last score
imdb_score.iloc[-1]

6.5999999999999996

### Slice Notation
You may use slice notation **`start:stop:step`** to select elements as well. Notice that a Series will always be returned with the index remaining.

In [8]:
imdb_score.iloc[5:10]

movie_title
John Carter                               6.6
Spider-Man 3                              6.2
Tangled                                   7.8
Avengers: Age of Ultron                   7.5
Harry Potter and the Half-Blood Prince    7.5
Name: imdb_score, dtype: float64

In [9]:
imdb_score.iloc[200:400:50]

movie_title
Harry Potter and the Sorcerer's Stone    7.5
The Patriot                              7.1
Epic                                     6.7
Unstoppable                              6.8
Name: imdb_score, dtype: float64

In [10]:
# last five elements
imdb_score.iloc[-5:]

movie_title
Signed Sealed Delivered    7.7
The Following              7.5
A Plague So Pleasant       6.3
Shanghai Calling           6.3
My Date with Drew          6.6
Name: imdb_score, dtype: float64

### List of integers
A list of integer locations will select specific Series elements.

In [11]:
imdb_score.iloc[[5,17,-4]]

movie_title
John Carter      6.6
The Avengers     8.1
The Following    7.5
Name: imdb_score, dtype: float64

In [12]:
# a one item list returns a Series and not a scalar
imdb_score.iloc[[6]]

movie_title
Spider-Man 3    6.2
Name: imdb_score, dtype: float64

## Selection by Index Label with `.loc`
The **`.loc`** indexer selects data by accepting the label (or labels) of the index. In this example, the label is the movie title. Just like **`.iloc`**, **`.loc`** can accept a single label, a slice or a list. A KeyError will be raised if you try to access a label not in the index.

### Single Labels
You probably don't know which labels are in your index. You can print them out by running **`print(imdb_score.index)`** so you know the possibilities.

In [13]:
# some possible values for labels
imdb_score.index

Index(['Avatar', 'Pirates of the Caribbean: At World's End', 'Spectre',
       'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens',
       'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron',
       'Harry Potter and the Half-Blood Prince',
       ...
       'Primer', 'Cavite', 'El Mariachi', 'The Mongol King', 'Newlyweds',
       'Signed Sealed Delivered', 'The Following', 'A Plague So Pleasant',
       'Shanghai Calling', 'My Date with Drew'],
      dtype='object', name='movie_title', length=4916)

In [14]:
imdb_score.loc['The Avengers']

8.0999999999999996

In [15]:
imdb_score.loc['Tangled']

7.7999999999999998

#### KeyError
The movie **Father of the Bride** is not in this dataset and results in a KeyError, the same type of error you would get for a dictionary.

In [16]:
imdb_score.loc['Father of the Bride']

KeyError: 'the label [Father of the Bride] is not in the [index]'

### Slicing with labels
It is possible to slice from one label to another. pandas always **includes** the end label when slicing with labels.

In [17]:
imdb_score.loc['The Dark Knight Rises':'Tangled']

movie_title
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
John Carter                                   6.6
Spider-Man 3                                  6.2
Tangled                                       7.8
Name: imdb_score, dtype: float64

In [18]:
imdb_score.loc[:'Spectre']

movie_title
Avatar                                      7.9
Pirates of the Caribbean: At World's End    7.1
Spectre                                     6.8
Name: imdb_score, dtype: float64

In [19]:
# Tangled appears later than Spectre. Returns empty series
imdb_score['Tangled':'Spectre']

Series([], Name: imdb_score, dtype: float64)

In [20]:
# slice with negative step
imdb_score['Tangled':'Spectre':-1]

movie_title
Tangled                                       7.8
Spider-Man 3                                  6.2
John Carter                                   6.6
Star Wars: Episode VII - The Force Awakens    7.1
The Dark Knight Rises                         8.5
Spectre                                       6.8
Name: imdb_score, dtype: float64

### Lists of labels

In [21]:
imdb_score[['Avatar', 'Spider-Man 3']]

movie_title
Avatar          7.9
Spider-Man 3    6.2
Name: imdb_score, dtype: float64

In [22]:
# a single item list returns a series
imdb_score.loc[['Avatar']]

movie_title
Avatar    7.9
Name: imdb_score, dtype: float64

In [23]:
# does not raise a KeyError when one list label in the index
imdb_score[['Avatar', 'Father of the Bride']]

movie_title
Avatar                 7.9
Father of the Bride    NaN
Name: imdb_score, dtype: float64

### Deprecation of the `.ix` indexer
In the early days of pandas, **`.ix`** was a popular way to access elements as it could take both positional and label arguments. But, this is confusing as it is ambiguous. Thankfully, **`.ix`** has been deprecated since pandas version 0.20 so it should become less used. Unfortunately, many historical (and new) questions on stackoverflow use **`.ix`**. Do not let this fool you. Use **`.loc`** and **`iloc`** instead.

# Why all the fuss over this index
indexes in Series and DataFrames play a huge (and perhaps surprising) roll in pandas. Lets begin with one of these surprising examples. We will manually create a couple Series here that look fairly similar and then add them together.

To 'manually create a Series you must use the **`pd.Series`** constructor. There are a variety of ways to use it but the simplest way is to use the parameters **`index`** and **`data`**. Pass them both a list of elements that are the same size. There will be another notebook on creating these manually.

In [24]:
s1 = pd.Series(index=['a','a','b','b','b'], data=[1,2,3,4,5])
s2 = pd.Series(index=['a','a','a','b','b'], data=[1,2,3,4,5])

In [25]:
s1

a    1
a    2
b    3
b    4
b    5
dtype: int64

In [26]:
s2

a    1
a    2
a    3
b    4
b    5
dtype: int64

## Lets see what happens when we add the series together
Something quite unexpected happens when we add the two Series together.

In [27]:
s1 + s2

a     2
a     3
a     4
a     3
a     4
a     5
b     7
b     8
b     8
b     9
b     9
b    10
dtype: int64

### What happened?
This example is fundamental to understanding many pandas operations. Both Series **automatically aligned** on their index (not by integer position).  Each **`a`** label in **`s1`** aligned with each **`a`** label in **`s2`**. A cartesian product between the same labels happens. The resulting Series has now 6 **`a`** labels and 6 **`b`** labels.

### When labels can't align
If there is an index label that appears in one index but not another, a missing value will result. We add the following two Series together. **`s`** is missing label **`c`**. The result still contains label **`c`** but it now has a missing value.

In [28]:
s1 = pd.Series(index=['a','a','b','b','b'], data=[1,2,3,4,5])
s2 = pd.Series(index=['a','a','a','b','b','c'], data=[1,2,3,4,5,6])

s1 + s2

a     2.0
a     3.0
a     4.0
a     3.0
a     4.0
a     5.0
b     7.0
b     8.0
b     8.0
b     9.0
b     9.0
b    10.0
c     NaN
dtype: float64

### Multiple labels not aligning
When adding together two Series that are missing multiple labels, each label will be present in the resulting Series but will be missing its value.

In [29]:
# The only index label in common was 'd'. The rest are missing
s1 = pd.Series(index=['a','b','c','d'], data=[1,2,3,4])
s2 = pd.Series(index=['d','e','f','g'], data=[1,2,3,4])
s1 + s2

a    NaN
b    NaN
c    NaN
d    5.0
e    NaN
f    NaN
g    NaN
dtype: float64

### More surprises with indexes
There is no enforcement of uniqueness on the index as seen above so all your elements in your index can have the same index value. Also, Series do not have to be the same length since each index label aligns with the same label in the other Series.

In the following example we can think of **`s2`** as being used to add 1 to every **`a`** value, 2 to every **`b`** value and 3 to every **`c`**.

In [30]:
s1 = pd.Series(index=['a','a','a','b','b','c'], data=[10,15,20,25,30,35])
s2 = pd.Series(index=['a','b','c'], data=[1,2,3])

In [31]:
s1

a    10
a    15
a    20
b    25
b    30
c    35
dtype: int64

In [32]:
s2

a    1
b    2
c    3
dtype: int64

In [33]:
s1 + s2

a    11
a    16
a    21
b    27
b    32
c    38
dtype: int64

### Getting a closer look
To see what is happening with index alignment, we can use a bit of complexity (to be explained in later notebooks so don't worry about it now). It is the same thing that happens with a SQL outer join.

In [34]:
s1 = pd.Series(index=['a','a','b','b','b'], data=[1,2,3,4,5])
s2 = pd.Series(index=['a','a','a','b','b','c'], data=[1,2,3,4,5,6])

In [35]:
f1 = s1.to_frame(name='s1')
f2 = s2.to_frame(name='s2')

df = f1.join(f2, how='outer')
df['sum'] = df['s1'] + df['s2']
df

Unnamed: 0,s1,s2,sum
a,1.0,1,2.0
a,1.0,2,3.0
a,1.0,3,4.0
a,2.0,1,3.0
a,2.0,2,4.0
a,2.0,3,5.0
b,3.0,4,7.0
b,3.0,5,8.0
b,4.0,4,8.0
b,4.0,5,9.0


## Caveat to Cartesian Product
A cartesian product does not happen when the indexes of the two Series are identical. That is if the have the same labels in the same order and are the same length. The first example has the Series with identical indexes.

The second example has Series with indexes that have the same labels but different order

In [36]:
s1 = pd.Series(index=['a','a','b','b'], data=[1,2,3,4])
s2 = pd.Series(index=['a','a','b','b'], data=[1,2,3,4])

In [37]:
s1 + s2

a    2
a    4
b    6
b    8
dtype: int64

In [38]:
s1 = pd.Series(index=['a','a','b','b'], data=[1,2,3,4])
s2 = pd.Series(index=['b','b','a','a'], data=[1,2,3,4])

s1 + s2

a    4
a    5
a    5
a    6
b    4
b    5
b    5
b    6
dtype: int64

## Get the length of your Series or DataFrame
There are multiple ways to get the number of elements in your Series. The **`len`** built-in function returns the number of elements.

In [39]:
len(imdb_score)

4916

The **`len`** function also returns the number of rows (not columns) from a DataFrame

In [40]:
len(movie)

4916

### size attribute
The size attribute also returns the number of elements of your Series. For DataFrames it returns the total number of elements which is the number of rows times the number of columns.

In [41]:
imdb_score.size

4916

In [42]:
movie.size

132732

### shape attribute
The shape attribute returns the number of Series elements as a one-item tuple. For DataFrames it returns rows and columns as two-item tuple.

In [43]:
imdb_score.shape

(4916,)

In [44]:
movie.shape

(4916, 27)

## The indexing operator alone - avoid!
It is possible to select data from a Series with the indexing operator alone but it is ambiguous as it allows selection by both integer location and label. Because it is ambiguous it is strongly suggested you always use the **`.iloc`** and **`.loc`** indexers except when doing boolean indexing (in an upcoming notebook).

In [45]:
imdb_score[:5]

movie_title
Avatar                                        7.9
Pirates of the Caribbean: At World's End      7.1
Spectre                                       6.8
The Dark Knight Rises                         8.5
Star Wars: Episode VII - The Force Awakens    7.1
Name: imdb_score, dtype: float64

In [46]:
imdb_score['Avatar']

7.9000000000000004

# Your Turn!

### Problem 1
<span  style="color:green; font-size:16px">Create a 3 element pandas Series using the Series constructor with characters as the index and numbers as the values. Output the Series.</span>

In [47]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Another way to create a series is to pass a dictionary to the pandas series constructor. The keys of the dictionary become the Series index and the dictionary values become the Series values. Create a dictionary with at least 3 elements and use it to create a series. Output the Series.</span>

In [48]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Use the **`read_csv`** function to read in the movie dataset and set the index to the title of the movie. Output the first 10 rows.</span>

In [49]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Select the director column into a variable with the same name.</span>

In [50]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Output to the screen the first 10 numbers in the **`director`** Series. Remember to only use **`.loc`** and **`.iloc`** when accessing Series elements.</span>

In [51]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Output **`director`** elements at location 40, 50 and 99</span>

In [52]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Output the last ten values of the **`director`** Series.</span>

In [53]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Select the directors from the movies **The Fast and the Furious** and **Batman Begins**</span>

In [54]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Think of a movie you have seen and try to select its director</span>

In [55]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">If two Series are added with no indices in common, what will be the outcome? Check your answer by coding this situation.</span>

In [56]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">What if the two series from problem 9 were subtracted, multiplied or divided together?</span>

In [57]:
# your code here

### Problem 12
<span  style="color:green; font-size:16px">Create two Series that have 3 elements each and when added together yield a Series that has four 4 total elements that are all not missing.</span>

In [58]:
# your code here