# Selecting Subsets of Data from DataFrames with `loc`

## Overview

### Objectives

+ `loc` can select rows, columns, or rows and columns simultaneously
+ `loc` selects primarily by **label**

## Subset selection with `loc`
The `loc` indexer selects data in a different manner than *just the brackets*. We must learn its set of rules.

### Simultaneous row and column subset selection with `loc`
The `loc` indexer can select rows and columns simultaneously. You cannot do this with *just the brackets*. This is done by separating the row and column selections with a **comma**. The selection will look something like this:

```
df.loc[rows, cols]
```

### `loc` only selects data by LABEL

Very importantly, `loc` primarily selects data by the **LABEL** of the rows and columns. Provide `loc` with the label of the rows and/or columns you would like to select.

### Select two rows and three columns with `loc`
If we wanted to select the rows `Dean` and `Cornelia` along with the columns `age`, `state`, and `score` we would do this:

In [10]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [15]:
rows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']
df.loc[rows, cols]

Unnamed: 0,age,state,score
Dean,32,AK,1.8
Cornelia,69,TX,2.2


In [16]:
rows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']
df[:][cols]

Unnamed: 0,age,state,score
Jane,30,NY,4.6
Niko,2,TX,8.3
Aaron,12,FL,9.0
Penelope,4,AL,3.3
Dean,32,AK,1.8
Christina,33,TX,9.5
Cornelia,69,TX,2.2


### The possible types of selections for `loc`
Row or column selections can be any of the following:

* A single label
* A list of labels
* A slice with labels
* A boolean Series or array (covered in a later chapter)

We can use any of these for either row or column selections with `loc`.

### Select two rows and a single column:
Let's use a list for the rows and a string for the column. The row selection is `['Dean', 'Aaron']` and the column selection is `food`. Note how this returns a Series since we are selecting exactly a single column.

In [17]:
rows = ['Dean', 'Aaron']
cols = 'food'
df.loc[rows, cols]

Dean     Cheese
Aaron     Mango
Name: food, dtype: object

## Use slice notation to select a range of rows
We have seen slice notation when working with Python lists. This same notation is allowed with DataFrames. Let's choose all of the rows from `Jane` to `Penelope` with slice notation along with the columns `state` and `color`.

In [19]:
cols = ['state', 'color']
df.loc['Jane':'Penelope', cols]

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


### Slice notation only works within the brackets attached to the object
Python only allows us to use slice notation within the brackets that are attached to an object. If we try and assign slice notation outside of this, we will get a syntax error like we do below.

In [23]:
rows = 'Jane':'Penelope'

SyntaxError: invalid syntax (<ipython-input-23-479e8c54e367>, line 1)

In [24]:
rows = ['Jane':'Penelope']

SyntaxError: invalid syntax (<ipython-input-24-479e8c54e367>, line 1)

### Use the `slice` function to separate out the selection in a different line
There is a built-in `slice` function that you can use to assign your selection to a variable. It takes the same three values **start**, **stop**, and **step**, but this time as function parameters.

In [25]:
rows = slice('Jane', 'Penelope')
cols = ['state', 'color']
df.loc[rows, cols]

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


### Slice both the rows and columns

In [27]:
df.loc[:'Dean', 'height':]

Unnamed: 0,height,score
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


Use `None` to denote an empty part of the slice.

In [28]:
rows = slice(None, 'Dean')
cols = slice('height', None)
df.loc[rows, cols]

Unnamed: 0,height,score
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


### Slices with `loc` are inclusive of the stop value
Notice that the stop value is included in the returned DataFrame. When slicing Python lists, the last element is **excluded**.

### Use slice notation or the slice function?
Almost no one uses the `slice` function, so you will probably want to use slice notation. That said, the slice function does help separate the row and column selections into their own lines of code.

### Selecting all of the rows and some of the columns
It is possible to select all of the rows by using a single colon. Here, we select all of the rows and two of the columns.

In [29]:
cols = ['food', 'color']
df.loc[:, cols]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


Equivalently, we could use the slice function like this:

In [30]:
rows = slice(None)
cols = ['food', 'color']
df.loc[rows, cols]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


### The above is not necessary! Use *just the brackets*
You would never see two columns with all the rows selected like that. This is exactly what *just the brackets* are built for.

In [31]:
cols = ['food', 'color']
df[cols]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


### A single colon is slice notation for select all
That single colon might be intimidating but it is technically slice notation that selects all items. See the following example with a list:

In [32]:
a_list = [1, 2, 3, 4, 5, 6]
a_list[:]

[1, 2, 3, 4, 5, 6]

### Use a single colon to select all the columns
It is possible to use a single colon to represent a slice of all the rows or all of the columns. Below, a colon is used as slice notation for all of the columns.

In [33]:
rows = ['Penelope','Cornelia']
df.loc[rows, :]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


### The above can be shortened
By default, pandas will select all of the columns if you only provide a row selection. Providing the colon is not necessary and the following will do the same:

In [34]:
rows = ['Penelope', 'Cornelia']
df.loc[rows]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


One reason to use the colon, though it is not syntactically necessary, is to reinforce the idea that `loc` may be used for simultaneous column selection and that the first object passed to `loc` always selects rows and the second always selects columns.

### Use slice notation to select a range of rows with all of the columns
Similarly, we can slice from Niko through Dean while selecting all of the columns. We do not provide a specific column selection. By default, Pandas returns all of the columns.

In [35]:
df.loc['Niko':'Dean']

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


Again, you could have written the above as `df.loc['Niko':'Dean', :]` to reinforce the fact that `loc` first selects rows and then columns.

### Other slicing examples
You can slice in a variety of ways such as selecting every other row by setting the step size to 2:

In [36]:
df.loc['Niko':'Christina':2]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


Omitting the start value to include all rows until the stop value:

In [37]:
df.loc[:'Penelope']

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3


Omitting the stop value to keep all rows after the start value:

In [38]:
df.loc['Aaron':]

Unnamed: 0,state,color,food,age,height,score
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


## Select a single row and a single column
If the row and column selections are both a single label, then a scalar value and NOT a DataFrame or Series is returned.

In [39]:
rows = 'Jane'
cols = 'state'
df.loc[rows, cols]

'NY'

### Select a single row as a Series with `loc`
The `loc` indexer will return a single row as a Series when given a single row label. Let's select the row for Niko. Notice that the column names have now become index labels.

In [40]:
df.loc['Niko']

state        TX
color     green
food       Lamb
age           2
height       70
score       8.3
Name: Niko, dtype: object

### Is this confusing?
Think about why this output may be confusing.

## Summary of `loc`
* Primarily uses labels
* Can select rows and columns simultaneously
* Selection can be a single label, a list of labels, a slice of labels, or a boolean Series/array
* Put a comma between row and column selections

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the title column as the index. Select all columns for the movie 'The Dark Knight Rises'.</span>

In [68]:
df1 = pd.read_csv("../data/movie.csv")
df1.set_index('title', inplace = True)
rows = 'The Dark Knight Rises'
columns = ['year','imdb_score']
df1.loc[rows][columns]
# df1.loc[rows][:]

year                                                            2012
color                                                          Color
content_rating                                                 PG-13
duration                                                         164
director_name                                      Christopher Nolan
director_fb                                                    22000
actor1                                                     Tom Hardy
actor1_fb                                                      27000
actor2                                                Christian Bale
actor2_fb                                                      23000
actor3                                          Joseph Gordon-Levitt
actor3_fb                                                      23000
gross                                                    4.48131e+08
genres                                               Action|Thriller
num_reviews                       

In [72]:
df1 = pd.read_csv("../data/movie.csv", index_col='title')
# df1 = pd.read_csv("../data/movie.csv")
# df1.set_index('title')
rows = 'The Dark Knight Rises'
df1.loc[rows][:]

year                                                            2012
color                                                          Color
content_rating                                                 PG-13
duration                                                         164
director_name                                      Christopher Nolan
director_fb                                                    22000
actor1                                                     Tom Hardy
actor1_fb                                                      27000
actor2                                                Christian Bale
actor2_fb                                                      23000
actor3                                          Joseph Gordon-Levitt
actor3_fb                                                      23000
gross                                                    4.48131e+08
genres                                               Action|Thriller
num_reviews                       

### Exercise 2
<span  style="color:green; font-size:16px">Select all columns for the movies 'Tangled' and 'Avatar'.</span>

In [70]:
rows = ['Tangled','Avatar']
df1.loc[rows][:]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Tangled,2010.0,Color,PG,100.0,Nathan Greno,15.0,Brad Garrett,799.0,Donna Murphy,553.0,...,284.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,324.0,294810,17th century|based on fairy tale|disney|flower...,English,USA,260000000.0,7.8
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9


### Exercise 3
<span  style="color:green; font-size:16px">What year was 'Tangled' and 'Avatar' made and what was their IMBD scores?</span>

In [71]:
rows = ['Tangled','Avatar']
columns = ['year','imdb_score']
df1.loc[rows][columns]

Unnamed: 0_level_0,year,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Tangled,2010.0,7.8
Avatar,2009.0,7.9


### Exercise 4
<span  style="color:green; font-size:16px">Can you tell what the data type of the `year` column is by just looking at its values?</span>

In [None]:
# Turn this into a markdown cell and write your answer here

### Exercise 5
<span  style="color:green; font-size:16px">Use a single method to output the data type and number of non-missing values of `year`. Is it missing any?</span>

### Exercise 6
<span  style="color:green; font-size:16px">Select every 100th movie between 'Tangled' and 'Forrest Gump'. Why doesn't 'Forrest Gump' appear in the results?</span>