# 4. Selecting Subsets of Data from DataFrames with just the brackets

### Objectives

+ Know the three indexers `[ ]`, `.loc`, and `.iloc` are used to select subsets of data
+ The primary purpose of *just the brackets* is to select columns of a DataFrame

### Resources

+ Read [Indexing and Selecting](http://pandas.pydata.org/pandas-docs/stable/indexing.html) - **up to but not including Selection By Callable**

# Selecting Subsets of Data
One of the most common tasks during a data analysis is to select subsets of the dataset. In Pandas, this means selecting particular rows and/or columns from our DataFrame (or Series).

## Examples of Selections of Subsets of Data
The following images show different types of subset selection that are possible. We will first highlight the values we want and then show the corresponding DataFrame after the completed selection.

### Selection of columns

![][2]

Resulting DataFrame:

![][3]

### Selection of rows

![][4]

Resulting DataFrame:

![][5]

### Selection of rows and columns

![][6]

Resulting DataFrame:

![][7]

[1]: images/sample_df.png
[2]: images/just_cols.png
[3]: images/just_cols2.png
[4]: images/just_rows.png
[5]: images/just_rows2.png
[6]: images/rows_cols.png
[7]: images/rows_cols2.png

# Pandas dual references: by label and by integer location
As previously mentioned, the index of each DataFrame provides a label to reference each individual row. Similarly the columns provide a label to reference each column.

What hasn't been mentioned, is that each row and column may be referenced by an integer as well. I call this **integer location**. The integer location begins at 0 and ends at n-1 for each row and column. Take a look above at our sample DataFrame one more time.

The rows with labels **`Aaron`** and **`Dean`** can also be referenced by their respective integer locations 2 and 4. Similarly, the columns **`color`**, **`age`**, and **`height`** can be referenced by their integer locations 1, 3, and 4.

The documentation refers to integer location as **position**. I don't particularly like this terminology as it's not as explicit as integer location. The key term here is INTEGER.

# What's the difference between indexing and selecting subsets of data?
The documentation uses the term **indexing** frequently. This term is essentially just a one-word phrase to say **subset selection**. I prefer the term subset selection as, again, it is more descriptive of what is actually happening. Indexing is also the term used in the official Python documentation (for selecting subsets of lists or strings for example).

# The three indexers `[ ]`, `.loc`, `.iloc`
Pandas provides three **indexers** to select subsets of data. An indexer is a term for one of  `[ ]`, `.loc`, or `.iloc` and what makes the subset selection.

We will go in-depth on how to make selections with each of these indexers. Each indexer has different rules for how it works. All our selections will look similar to the following, except they will have something placed within the brackets.

```
>>> df[]
>>> df.loc[]
>>> df.iloc[]
```
### Terminology
When the brackets are placed directly after the DataFrame, the term **just the brackets** will be used to differentiate from the brackets after **`.loc`** and **`.iloc`**.

# Begin with *just the brackets*
As we saw in the last notebook, just the brackets are used to select a single column as a Series. We place the column name inside the brackets to return the Series.

In [None]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv')

In [None]:
df['color']

## Select Multiple Columns with a List
You can select multiple columns by placing them in a list inside of just the brackets. Notice that a DataFrame and NOT a Series is returned:

In [None]:
df[['color', 'age', 'score']]

### You must use an inner set of brackets
You might be tempted to do the following which will NOT work. You must pass the columns names as a **list** - remember that a list is defined by a set of brackets.

In [None]:
# NO! An exception is raised

df['color', 'age', 'score']

# Use two lines of code to select multiple columns
To help ease the process of making subset selection, I recommend using intermediate variables. In this instance, we can assign the columns we would like to select to a list and then pass this list to the brackets.

In [None]:
cols = ['color', 'age', 'score']
df[cols]

### Column order does not matter
You can create new DataFrames in any column order you wish - it need not match the original column order

In [None]:
cols = ['height', 'age']
df[cols]

# Exercises
For the following exercises, make sure to use the movie dataset with **`title`** set as the index. It's good practice to shorten your output with the **`head`** method.

### Problem 1
<span  style="color:green; font-size:16px">Select the column with the director's name as a Series</span>

### Problem 2
<span  style="color:green; font-size:16px">Select the column with the director's name and number of Facebook likes.</span>

### Problem 3
<span  style="color:green; font-size:16px">Select a single column as a DataFrame and not a Series</span>

### Problem 4
<span  style="color:green; font-size:16px">Look in the data folder and read in another dataset. Select some columns from it.</span>