# 4. Selecting Subsets of Data from DataFrames with just the brackets

### Objectives

+ Know the three indexers `[ ]`, `.loc`, and `.iloc` are used to select subsets of data
+ The primary purpose of *just the brackets* is to select columns of a DataFrame

### Resources

+ Read [Indexing and Selecting](http://pandas.pydata.org/pandas-docs/stable/indexing.html) - **up to but not including Selection By Callable**

# Selecting Subsets of Data
One of the most common tasks during a data analysis is to select subsets of the dataset. In Pandas, this means selecting particular rows and/or columns from our DataFrame (or Series).

## Examples of Selections of Subsets of Data
The following images show different types of subset selection that are possible. We will first highlight the values we want and then show the corresponding DataFrame after the completed selection.

### Selection of columns

![][2]

Resulting DataFrame:

![][3]

### Selection of rows

![][4]

Resulting DataFrame:

![][5]

### Selection of rows and columns

![][6]

Resulting DataFrame:

![][7]

[1]: images/sample_df.png
[2]: images/just_cols.png
[3]: images/just_cols2.png
[4]: images/just_rows.png
[5]: images/just_rows2.png
[6]: images/rows_cols.png
[7]: images/rows_cols2.png

# Pandas dual references: by label and by integer location
As previously mentioned, the index of each DataFrame provides a label to reference each individual row. Similarly the columns provide a label to reference each column.

What hasn't been mentioned, is that each row and column may be referenced by an integer as well. I call this **integer location**. The integer location begins at 0 and ends at n-1 for each row and column. Take a look above at our sample DataFrame one more time.

The rows with labels **`Aaron`** and **`Dean`** can also be referenced by their respective integer locations 2 and 4. Similarly, the columns **`color`**, **`age`**, and **`height`** can be referenced by their integer locations 1, 3, and 4.

The documentation refers to integer location as **position**. I don't particularly like this terminology as it's not as explicit as integer location. The key term here is INTEGER.

# What's the difference between indexing and selecting subsets of data?
The documentation uses the term **indexing** frequently. This term is essentially just a one-word phrase to say **subset selection**. I prefer the term subset selection as, again, it is more descriptive of what is actually happening. Indexing is also the term used in the official Python documentation (for selecting subsets of lists or strings for example).

# The three indexers `[ ]`, `.loc`, `.iloc`
Pandas provides three **indexers** to select subsets of data. An indexer is a term for one of  `[ ]`, `.loc`, or `.iloc` and what makes the subset selection.

We will go in-depth on how to make selections with each of these indexers. Each indexer has different rules for how it works. All our selections will look similar to the following, except they will have something placed within the brackets.

```
>>> df[]
>>> df.loc[]
>>> df.iloc[]
```
### Terminology
When the brackets are placed directly after the DataFrame, the term **just the brackets** will be used to differentiate from the brackets after **`.loc`** and **`.iloc`**.

# Begin with *just the brackets*
As we saw in the last notebook, just the brackets are used to select a single column as a Series. We place the column name inside the brackets to return the Series.

In [1]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv')

In [2]:
df['color']

0     blue
1    green
2      red
3    white
4     gray
5    black
6      red
Name: color, dtype: object

## Select Multiple Columns with a List
You can select multiple columns by placing them in a list inside of just the brackets. Notice that a DataFrame and NOT a Series is returned:

In [17]:
df[['color', 'age', 'score']]

Unnamed: 0,color,age,score
Jane,blue,30,4.6
Niko,green,2,8.3
Aaron,red,12,9.0
Penelope,white,4,3.3
Dean,gray,32,1.8
Christina,black,33,9.5
Cornelia,red,69,2.2


### You must use an inner set of brackets
You might be tempted to do the following which will NOT work. You must pass the columns names as a **list** - remember that a list is defined by a set of brackets.

In [18]:
# NO! An exception is raised

df['color', 'age', 'score']

KeyError: ('color', 'age', 'score')

# Use two lines of code to select multiple columns
To help ease the process of making subset selection, I recommend using intermediate variables. In this instance, we can assign the columns we would like to select to a list and then pass this list to the brackets.

In [19]:
cols = ['color', 'age', 'score']
df[cols]

Unnamed: 0,color,age,score
Jane,blue,30,4.6
Niko,green,2,8.3
Aaron,red,12,9.0
Penelope,white,4,3.3
Dean,gray,32,1.8
Christina,black,33,9.5
Cornelia,red,69,2.2


### Column order does not matter
You can create new DataFrames in any column order you wish - it need not match the original column order

In [20]:
cols = ['height', 'age']
df[cols]

Unnamed: 0,height,age
Jane,165,30
Niko,70,2
Aaron,120,12
Penelope,80,4
Dean,180,32
Christina,172,33
Cornelia,150,69


# Exercises
For the following exercises, make sure to use the movie dataset with **`title`** set as the index. It's good practice to shorten your output with the **`head`** method.

### Problem 1
<span  style="color:green; font-size:16px">Select the column with the director's name as a Series</span>

In [22]:
movie_df = pd.read_csv('../data/movie.csv')

In [25]:
movie_df['director_name'].head(3)

0     James Cameron
1    Gore Verbinski
2        Sam Mendes
Name: director_name, dtype: object

### Problem 2
<span  style="color:green; font-size:16px">Select the column with the director's name and number of Facebook likes.</span>

In [26]:
movie_df.head()

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
3,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1


In [29]:
movie_df[['director_name', 'director_fb']].head(3)

Unnamed: 0,director_name,director_fb
0,James Cameron,0.0
1,Gore Verbinski,563.0
2,Sam Mendes,0.0


### Problem 3
<span  style="color:green; font-size:16px">Select a single column as a DataFrame and not a Series</span>

In [31]:
movie_df[['director_name']].head(3)

Unnamed: 0,director_name
0,James Cameron
1,Gore Verbinski
2,Sam Mendes


### Problem 4
<span  style="color:green; font-size:16px">Look in the data folder and read in another dataset. Select some columns from it.</span>

In [34]:
import pandas as pd
flights_df = pd.read_csv('../data/flights.csv')
flights_df.head(3)

Unnamed: 0,year,month,day,day_of_week,airline,flight_number,tail_number,origin_airport,destination_airport,scheduled_departure,...,arrival_time,arrival_delay,diverted,cancelled,cancellation_reason,air_system_delay,security_delay,airline_delay,late_aircraft_delay,weather_delay
0,2015,1,1,4,WN,1908,N8324A,LAX,SLC,1625,...,2010.0,65.0,0,0,,31.0,0.0,0.0,34.0,0.0
1,2015,1,1,4,UA,581,N448UA,DEN,IAD,823,...,1320.0,-13.0,0,0,,,,,,
2,2015,1,1,4,MQ,2851,N645MQ,DFW,VPS,1305,...,1528.0,35.0,0,0,,0.0,0.0,35.0,0.0,0.0


In [41]:
flights_df[['day_of_week', 'airline']].head(3)

Unnamed: 0,day_of_week,airline
0,4,WN
1,4,UA
2,4,MQ


## During class

In [4]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [5]:
df.shape

(7, 6)

In [8]:
df['height'] #this return a series

Jane         165
Niko          70
Aaron        120
Penelope      80
Dean         180
Christina    172
Cornelia     150
Name: height, dtype: int64

In [10]:
df[4] #this doesnt work, key error (couldnt found the index label or something doesnt exist)

KeyError: 4

In [11]:
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [13]:
cols = ['age', 'color']
df[cols]

Unnamed: 0,age,color
Jane,30,blue
Niko,2,green
Aaron,12,red
Penelope,4,white
Dean,32,gray
Christina,33,black
Cornelia,69,red


In [14]:
df[['age', 'color']]

Unnamed: 0,age,color
Jane,30,blue
Niko,2,green
Aaron,12,red
Penelope,4,white
Dean,32,gray
Christina,33,black
Cornelia,69,red


In [15]:
df[['age']] #this return a data frame of 1 column

Unnamed: 0,age
Jane,30
Niko,2
Aaron,12
Penelope,4
Dean,32
Christina,33
Cornelia,69


In [16]:
df['age'] #this return a data series

Jane         30
Niko          2
Aaron        12
Penelope      4
Dean         32
Christina    33
Cornelia     69
Name: age, dtype: int64