# 2. Selecting Subsets of Data - DataFrames

### Objectives

+ Read in a dataset with `read_csv` and set the index with the `index_col` parameter
+ Extract the components of a DataFrame and verify their type
+ Know the three indexers `[ ]`, `.loc`, and `.iloc` are used to select subsets of data
+ The primary purpose of *just the brackets* selects columns of a DataFrame
+ `.loc` selects only by **label**
+ `.iloc` selects only by **integer location**
+ Both `.loc` and `.iloc` can select rows, columns, or rows and columns simultaneously
+ Set a meaningful index with the `set_index` method


### Prepare for this lesson by...

+ Read [Indexing and Selecting](http://pandas.pydata.org/pandas-docs/stable/indexing.html) - **up to but not including Selection By Callable**

### Overview
This notebook focuses on selecting subsets of data from a DataFrame.

# Extracting the components of a DataFrame - The Index, Columns, and Data
The DataFrame consists of three components - the index, columns, and data. It is possible to extract each component and assign them into their own variable.

Let's read in a small dataset to show how this is done. Notice that when we read in the data, we choose the first column to be the index with the **`index_col`** parameter. More on this later.

In [9]:
import pandas as pd

In [12]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Use the attributes `index`, `columns`, and `values`
The index, columns, and data are each separate objects. Let's assign them into their own variables.

In [13]:
index = df.index
columns = df.columns
data = df.values

### View these objects
Let's output each of these objects:

In [17]:
index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object')

In [19]:
columns

Index(['state', 'color', 'food', 'age', 'height', 'score'], dtype='object')

In [20]:
data

array([['NY', 'blue', 'Steak', 30, 165, 4.6],
       ['TX', 'green', 'Lamb', 2, 70, 8.3],
       ['FL', 'red', 'Mango', 12, 120, 9.0],
       ['AL', 'white', 'Apple', 4, 80, 3.3],
       ['AK', 'gray', 'Cheese', 32, 180, 1.8],
       ['TX', 'black', 'Melon', 33, 172, 9.5],
       ['TX', 'red', 'Beans', 69, 150, 2.2]], dtype=object)

### What are these objects?
The output of these objects looks correct but we don't know the exact type of each one. Let's find out:

In [21]:
type(index)

pandas.core.indexes.base.Index

In [22]:
type(columns)

pandas.core.indexes.base.Index

In [23]:
type(data)

numpy.ndarray

### Pandas `Index` Type
Pandas has a special type of object called an **`Index`**. This object is quite powerful, but for now you can just think of it as a sequence of labels for either the rows or the columns. You will not deal with this object directly much at all, so there's not much of a need to know more about it for now.

Notice that the both the index and columns are of the same type.

### NumPy's `ndarray`
The data is stored as a NumPy **`ndarray`** (which stands for n-dimensional array). It is this array that is doing the bulk of the workload in Pandas.

### Operating with DataFrame as a Whole
You will rarely need to operate with these components directly and instead be working with the entire DataFrame almost always.

# Extracting the components of a Series - The Index and Data
Similarly we can extract the two Series components - the index and the data.

Let's first select a single column as a Series:

In [31]:
color = df['color']

In [32]:
color.index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object')

In [33]:
color.values

array(['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
      dtype=object)

# Selecting Subsets of Data
One of the most common tasks during a data analysis is to select subsets of the dataset. In Pandas, this means selecting particular rows and/or columns from our DataFrame (or Series).

## Examples of Selections of Subsets of Data
The following will show images of different types of selections that are possible. We will first highlight the values we want and then show the corresponding DataFrame after the completed selection.

### Selection of columns

![][2]

Resulting DataFrame:

![][3]

### Selection of rows

![][4]

Resulting DataFrame:

![][5]

### Selection of rows and columns

![][6]

Resulting DataFrame:

![][7]

[1]: images/sample_df.png
[2]: images/just_cols.png
[3]: images/just_cols2.png
[4]: images/just_rows.png
[5]: images/just_rows2.png
[6]: images/rows_cols.png
[7]: images/rows_cols2.png

# Pandas dual references: by label and by integer location
As previously mentioned, the index of each DataFrame provides a label to reference each individual row. Similarly the columns provide a label to reference each column.

But, what hasn't been mentioned, is that each row and column may be referenced by an integer as well. I call this integer location. The integer location begins at 0 and ends at n-1 for each row and column. Take a look above at our sample DataFrame one more time.

The rows with labels **`Aaron`** and **`Dean`** can also be referenced by their respective integer locations 2 and 4. Similarly, the columns **`color`**, **`age`**, and **`height`** can be referenced by their integer locations 1, 3, and 4.

The documentation refers to integer location as **position**. I don't particularly like this terminology as it's not as explicit as integer location. The key thing term here is INTEGER.

# What's the difference between indexing and selecting subsets of data?
The documentation uses the term **indexing** frequently. This term is essentially just a one-word phrase to say **subset selection**. I prefer the term subset selection as, again, it is more descriptive of what is actually happening. Indexing is also the term used in the official Python documentation (for selecting subsets of lists or strings for example).

# The Three Indexers `[ ]`, `.loc`, `.iloc`
Pandas provides three **indexers** to select subsets of data. An indexer is a term for one of  `[ ]`, `.loc`, or `.iloc` and what makes the subset selection.

We will go in-depth on how to make selections with each of these indexers. Each indexer has different rules for how it works. All our selections will look similar to the following, except they will have something placed within the brackets.

```
>>> df[]
>>> df.loc[]
>>> df.iloc[]
```
### Terminology
When the brackets are placed directly after the DataFrame, the term **just the brackets** will be used to differentiate from the brackets after **`.loc`** and **`.iloc`**.

# Begin with *just the brackets*
As we saw in the last notebook, just the brackets are used to select a single column as a Series. We simply place the name inside the brackets to the return the Series

In [34]:
df['color']

Jane          blue
Niko         green
Aaron          red
Penelope     white
Dean          gray
Christina    black
Cornelia       red
Name: color, dtype: object

## Select Multiple Columns with a List
You can select multiple columns by placing them in a list inside of just the brackets. Notice that a DataFrame and NOT a Series is returned:

In [35]:
df[['color', 'age', 'score']]

Unnamed: 0,color,age,score
Jane,blue,30,4.6
Niko,green,2,8.3
Aaron,red,12,9.0
Penelope,white,4,3.3
Dean,gray,32,1.8
Christina,black,33,9.5
Cornelia,red,69,2.2


### You must use an inner set of brackets
You might be tempted to do the following which will NOT work. You must pass the columns names as a **list** - remember that a list is defined by a set of brackets.

In [36]:
# NO! An exception is raised

df['color', 'age', 'score']

KeyError: ('color', 'age', 'score')

### Column order does not matter
You can create new DataFrames in any column order you wish - it need not match the original column order

In [38]:
df[['height', 'age']]

Unnamed: 0,height,age
Jane,165,30
Niko,70,2
Aaron,120,12
Penelope,80,4
Dean,180,32
Christina,172,33
Cornelia,150,69


### Assign list of column names to a variable

In [39]:
cols = ['age', 'state', 'height']
df[cols]

Unnamed: 0,age,state,height
Jane,30,NY,165
Niko,2,TX,70
Aaron,12,FL,120
Penelope,4,AL,80
Dean,32,AK,180
Christina,33,TX,172
Cornelia,69,TX,150


# Subset Selection with `.loc`
The **`.loc`** indexer selects data in a different manner than *just the brackets*. It can select subsets of rows or columns as well as rows and columns simultaneously. Most importantly, it only selects data by the **LABEL** of the rows and columns.

You must provide **`.loc`** with the label of the rows and/or columns you would like to select.

## Select a single row as a Series with `.loc`
The .loc indexer will return a single row as a Series when given a single row label. Let's select the row for Niko. Notice that the column names have now become index labels.

In [40]:
df.loc['Niko']

state        TX
color     green
food       Lamb
age           2
height       70
score       8.3
Name: Niko, dtype: object

## Select multiple rows as a DataFrame with `.loc`
Pass the names of the rows as a list to **`.loc`**. Let's select **`Dean`** and **`Aaron`**.

In [41]:
df.loc[['Dean', 'Aaron']]

Unnamed: 0,state,color,food,age,height,score
Dean,AK,gray,Cheese,32,180,1.8
Aaron,FL,red,Mango,12,120,9.0


## Use slice notation to select a range of rows with `.loc`
It is possible to slice the rows of a DataFrame with `.loc` by using slice notation. Slice notation uses a colon to separate start, stop, and step values. This is similar to how slicing lists works except that the last value is **included**. For instance we can select all the rows from Niko through Dean like this:

In [43]:
df.loc['Niko':'Dean']

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


## Other slicing examples
You can slice in a variety of ways such as taking every other row by setting the step size to 2:

In [45]:
df.loc['Niko':'Christina':2]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


Omitting the start value to include all rows until the stop value:

In [46]:
df.loc[:'Penelope']

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3


Omitting the stop value to keep all rows after the start value:

In [47]:
df.loc['Aaron':]

Unnamed: 0,state,color,food,age,height,score
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


# Selecting rows and columns simultaneously with `.loc`
Unlike just the brackets, it is possible to select rows and columns simultaneously with `.loc`. This is done by separating the row and column selections with a **comma**. It will look something like this:

```
df.loc[row_selection, column_selection]
```

## Select two rows and three columns
For instance, if we wanted to select the rows **`Dean`** and **`Cornelia`** along with the columns **`age`**, **`state`**, and **`score`** we would do this:

In [49]:
df.loc[['Dean', 'Cornelia'], ['age', 'state', 'score']]

Unnamed: 0,age,state,score
Dean,32,AK,1.8
Cornelia,69,TX,2.2


## Use any combination of selections for either rows or columns with `.loc`
Row or column selections can be any of the following as we have already seen:

* A single label
* A list of labels
* A slice with labels

We can use any of these three for either row or column selections with `.loc.`

### Select two rows and a single column:
Here we use a list for the rows and string for the column. The row selection is **`['Dean', 'Aaron']`** and the column selection is **`food`**. Note how this returns a Series since we are selecting exactly a single column.

In [51]:
df.loc[['Dean', 'Aaron'], 'food']

Dean     Cheese
Aaron     Mango
Name: food, dtype: object

### Slice the rows and a list for the columns
Continue to notice the column separating the row and column selections

In [52]:
df.loc['Jane':'Penelope', ['state', 'color']]

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


### Select a single row and a single column
This returns a scalar value and NOT a DataFrame or Series

In [55]:
df.loc['Jane', 'state']

'NY'

### Slice both the rows and columns

In [56]:
df.loc[:'Dean', 'height':]

Unnamed: 0,height,score
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


### Selecting all of the rows and some columns
It is possible to select all of the rows by using a single colon. You can then select columns as normal.

In [57]:
df.loc[:, ['food', 'color']]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


### A single colon is slice notation for select all
That single colon might be intimidating but it is technically slice notation that selects all items. See the following example with a list:

In [66]:
a_list = [1, 2, 3, 4, 5, 6]
a_list[:]

[1, 2, 3, 4, 5, 6]

### Use a single colon to select all the columns

In [67]:
df.loc[['Penelope','Cornelia'], :]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


### The above is not necessary!
By default, Pandas will select all the columns if you only provide a row selection. So, you should never actually use the last line of code as it is redundant.

In [68]:
df.loc[['Penelope','Cornelia']]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


### Assign row and column selections to variables
You might prefer assigning the row and column selections to variables to help clean up the code

In [69]:
row_selection = ['Jane', 'Niko', 'Dean', 'Penelope', 'Christina']
column_selection = ['state', 'age', 'height', 'score']
df.loc[row_selection, column_selection]

Unnamed: 0,state,age,height,score
Jane,NY,30,165,4.6
Niko,TX,2,70,8.3
Dean,AK,32,180,1.8
Penelope,AL,4,80,3.3
Christina,TX,33,172,9.5


# Summary of `.loc`
* Only uses labels
* Can select rows and columns simultaneously
* Selection can be a single label, a list of labels or a slice of labels
* Put a comma between row and column selections

# Getting started with `.iloc`
The `.iloc` indexer is very similar to `.loc` but only uses integer locations to make its selections. The word `iloc` itself stands for integer location so that should help remember what it does.

## Select a single row as a Series with `.iloc`
By passing a single integer to .iloc, it will select one row as a Series:

In [71]:
df.iloc[2]

state        FL
color       red
food      Mango
age          12
height      120
score         9
Name: Aaron, dtype: object

## Selecting multiple rows with `.iloc`
Use a list of integers to select multiple rows:

In [72]:
df.iloc[[3, 1, 5]]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Niko,TX,green,Lamb,2,70,8.3
Christina,TX,black,Melon,33,172,9.5


### Remember to use an inner list
The following will result in an error:
```
df.iloc[3, 1, 5]
```

## Use slice notation to select a range of rows with `.iloc`
Slice notation works just like it does with a list in this instance and is **exclusive** of the last element.

In [73]:
df.iloc[2:4]

Unnamed: 0,state,color,food,age,height,score
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3


### Other Slices
You can slice DataFrames exactly how you slice lists:

In [77]:
# First four elements
df.iloc[:4]

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3


In [78]:
# last two elements
df.iloc[-2:]

Unnamed: 0,state,color,food,age,height,score
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [79]:
# Select integer location 2 to the end by every third element
df.iloc[2::3]

Unnamed: 0,state,color,food,age,height,score
Aaron,FL,red,Mango,12,120,9.0
Christina,TX,black,Melon,33,172,9.5


# Selecting rows and columns simultaneously with `.iloc`
Just like with `.loc`, any combination of a single integer, lists of integers, or slices can be used to select rows and columns simultaneously. Just remember to separate the selections with a comma.

### Use a list for both rows and columns

In [80]:
# Select rows 2 and 4 along with the first and last columns
df.iloc[[2, 4], [0, -1]]

Unnamed: 0,state,score
Aaron,FL,9.0
Dean,AK,1.8


### Slice the rows and use a list for the columns

In [81]:
# Slice the rows and use a list for the columns
df.iloc[::2, [4, 2]]

Unnamed: 0,height,food
Jane,165,Steak
Aaron,120,Mango
Dean,180,Cheese
Cornelia,150,Beans


### Select a single element as a scalar

In [85]:
# Select a single element as a scalar (in this case the height of Penelope)
df.iloc[3, 4]

80

### Select a single row and slice the columns
Anytime a single row or column is selected the result will be a Series

In [86]:
df.iloc[3, 2:5]

food      Apple
age           4
height       80
Name: Penelope, dtype: object

### Select a single row (with a list) and slice the columns - return a DataFrame
This is an exception to the above. You can select a single row (or column) and return a DataFrame and not a Series if you use a list to make the selection.

Let's repeat the selection from above using a list for the row.

In [87]:
df.iloc[[3], 2:5]

Unnamed: 0,food,age,height
Penelope,Apple,4,80


# Summary of `.iloc`
Is the exact same as `.loc` but uses **integer location** only for selection. The official Pandas documentation refers to this as selection by **position** which I find a bit confusing.

# The Default index
When we first read in our DataFrame, we specified the first column from the CSV file to be our index. If you don't specify an index, then Pandas will create one for you as the integers from 0 to n-1.

Let's read in the movie dataset without setting an index.

In [135]:
movie = pd.read_csv('../data/movie.csv')
movie.head()

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
3,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


### Notice the integers in the index
These integers are the default index labels for each of the rows. Let's examine the underlying index object.

In [103]:
movie.index

RangeIndex(start=0, stop=4916, step=1)

In [104]:
type(movie.index)

pandas.core.indexes.range.RangeIndex

### A RangeIndex
Pandas has many different types of index objects. A **`RangeIndex`** is similar to pure Python **`range`** object. The integers are not actually stored in memory and only accessed when requested.

# Setting a Meaningful Index
The **`RangeIndex`** does not provide us with much meaning as a labels for the rows. 

### Setting an index with `read_csv`
At the top of the notebook, we set our index to be the first column with the **`read_csv`** function by setting the **`index_col`** parameter equal to 0. You can use either integers or the column name to set the index like this.

### Re-read movie dataset with movie title as index
There's a column in the movie dataset named **`title`**. Let's re-read in the data with it as the index.

Notice that now the titles of each movie serve as the label for each row. Also notice that the word **title** appears directly above the index. This is a bit confusing - **title** is NOT a column name, but rather the **name of the index**.

In [138]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


#### Extract new index and its type
We now have a 'plain' **`Index`** object.

In [140]:
movie.index

Index(['Avatar', 'Pirates of the Caribbean: At World's End', 'Spectre',
       'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens',
       'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron',
       'Harry Potter and the Half-Blood Prince',
       ...
       'Primer', 'Cavite', 'El Mariachi', 'The Mongol King', 'Newlyweds',
       'Signed Sealed Delivered', 'The Following', 'A Plague So Pleasant',
       'Shanghai Calling', 'My Date with Drew'],
      dtype='object', name='title', length=4916)

In [110]:
type(movie.index)

pandas.core.indexes.base.Index

### Choosing a good index
First, its not necessary to choose an index for your DataFrames. You can do all your analysis with just the default **`RangeIndex`**. Setting a column to be an index can help you identify the rows such as we did with the movie titles above.

I suggest choosing columns that are both **unique** and **descriptive**. Although uniqueness is not enforced, it helps to identify one particular row.

## Setting the index after read
It is possible to set the index after reading the data with the **`set_index`** method. Simply pass it the name of the column you would like to use as the index. Make sure to re-assign the DataFrame after calling the method.

In [139]:
movie = pd.read_csv('../data/movie.csv')  # read in data without setting index
movie = movie.set_index('title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


# Changing Display Options
Pandas gives you the ability to change how the output on your screen is displayed. For instance, the default number of columns displayed for a DataFrame is 20 (and 60 for rows), meaning that if your DataFrame has more than 20 rows then only the first and last 10 rows will be shown on the screen.

The display options all come after `pd.options.display.<option_name>` where **`<option_name>`** is one of the following:

In [123]:
dir(pd.options.display)

['chop_threshold',
 'colheader_justify',
 'column_space',
 'date_dayfirst',
 'date_yearfirst',
 'encoding',
 'expand_frame_repr',
 'float_format',
 'height',
 'html',
 'large_repr',
 'latex',
 'line_width',
 'max_categories',
 'max_columns',
 'max_colwidth',
 'max_info_columns',
 'max_info_rows',
 'max_rows',
 'max_seq_items',
 'memory_usage',
 'multi_sparse',
 'notebook_repr_html',
 'pprint_nest_depth',
 'precision',
 'show_dimensions',
 'unicode',
 'width']

### Getting option values
All of the above are attributes that hold a scalar value. Let's output some of the default option values:

In [124]:
pd.options.display.max_columns

20

In [125]:
pd.options.display.max_rows

60

In [126]:
pd.options.display.max_colwidth

50

### Setting a new option value
To set a new option value, assign a number value to it like you do any other variable. Let's change the maximum number of columns to 40 so that we can see every column in the movie dataset.

In [141]:
pd.options.display.max_columns = 40

In [142]:
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,,,,Documentary,,8,,,,,7.1


# Exercises
For the following exercises, make sure to use the movie dataset with **`title`** set as the index. It's good practice to shorten your output with the **`head`** method.

### Problem 1
<span  style="color:green; font-size:16px">Select the column with the director's name as a Series</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Select the column with the director's name and number of Facebook likes.</span>

In [112]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">Select all columns for the movie 'The Dark Knight Rises'.</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Select all columns for the movies 'Tangled' and 'Avatar'.</span>

In [131]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">What year was 'Tangled' and 'Avatar' made and what was their IMBD scores?</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Can you tell what the data type of the `year` column is by just looking at its values?</span>

In [132]:
# Turn this into a markdown cell and write your answer here

### Problem 7
<span  style="color:green; font-size:16px">Use a single method to output the data type and number of non-missing values of `year`. Is it missing any?</span>

In [None]:
# your code here

### Problem 8
<span  style="color:green; font-size:16px">Select every 100th movie between 'Tangled' and 'Forrest Gump'. Why doesn't 'Forrest Gump' appear in the results?</span>

In [133]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Select the rows with integer location 10, 5, and 1</span>

In [134]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Select the columns with integer location 10, 5, and 1</span>

In [None]:
# your code here

### Problem 11
<span  style="color:green; font-size:16px">Select rows with integer location 100 to but not including 105 along with the column integer location 5.</span>

In [None]:
# your code here