# Setting a Meaningful Index

## Overview

### Objectives

* Extract the components of a DataFrame and verify their type
* Know that a `RangeIndex` is the default DataFrame index
* Select values from the index like a list
* Understand what makes a meaningful index
* Use the `index_col` parameter of `read_csv` to set an index on read
* Use the `set_index` method to set an index after read

## Extracting the components of a DataFrame
The DataFrame consists of three components - the index, columns, and data. It is possible to extract each component and assign them into their own variable. Let's read in a small dataset to show how this is done. Notice that when we read in the data, we choose the first column to be the index with the `index_col` parameter. More on this later.

In [1]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### The attributes `index`, `columns`, and `values`
The index, columns, and data are each separate objects. Notice that each of these objects are extracted as attributes and NOT methods. Let's assign them as their own variables.

In [6]:
index = df.index
columns = df.columns
data = df.values

### View these objects
Let's output each of these objects:

In [7]:
index

Index([u'Jane', u'Niko', u'Aaron', u'Penelope', u'Dean', u'Christina',
       u'Cornelia'],
      dtype='object')

In [8]:
columns

Index([u'state', u'color', u'food', u'age', u'height', u'score'], dtype='object')

In [9]:
data

array([['NY', 'blue', 'Steak', 30L, 165L, 4.6],
       ['TX', 'green', 'Lamb', 2L, 70L, 8.3],
       ['FL', 'red', 'Mango', 12L, 120L, 9.0],
       ['AL', 'white', 'Apple', 4L, 80L, 3.3],
       ['AK', 'gray', 'Cheese', 32L, 180L, 1.8],
       ['TX', 'black', 'Melon', 33L, 172L, 9.5],
       ['TX', 'red', 'Beans', 69L, 150L, 2.2]], dtype=object)

### What are these objects?
The output of these objects looks correct but we don't know the exact type of each one. Let's find out:

In [10]:
type(index)

pandas.core.indexes.base.Index

In [11]:
type(columns)

pandas.core.indexes.base.Index

In [12]:
type(data)

numpy.ndarray

### pandas `Index` type
pandas has a special type of object called an `Index`. This object is similar to a list or a one dimensional array. You can think of it as a sequence of labels for either the rows or the columns. You will not deal with this object directly much, so we will not go into further details about it here. Notice that the both the index and columns are of the same type.

### numpy's `ndarray`
The data is stored as a numpy `ndarray` (which stands for n-dimensional array). It is this array that is doing the bulk of the workload in pandas.

### Operating with the DataFrame as a whole
You will rarely need to operate with these components directly and instead be working with the entire DataFrame.

## Extracting the components of a Series
Similarly, we can extract the two Series components - the index and the data. Let's first select a single column as a Series:

In [13]:
color = df['color']
color

Jane          blue
Niko         green
Aaron          red
Penelope     white
Dean          gray
Christina    black
Cornelia       red
Name: color, dtype: object

In [14]:
color.index

Index([u'Jane', u'Niko', u'Aaron', u'Penelope', u'Dean', u'Christina',
       u'Cornelia'],
      dtype='object')

In [15]:
color.values

array(['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
      dtype=object)

## More on the index
The index is an important (and sometimes confusing) part of both the Series and DataFrame. It provides us with a label for each row. It is always **bold** and is NOT a column of data. It is a separate component of our DataFrame.

### The default index
If you don't specify an index when first reading in a DataFrame, then pandas will create one for you as integers beginning at 0. An index always exists even if it just appears to be the row number. Let's read in the movie dataset without setting an index.

In [16]:
movie = pd.read_csv('../data/movie.csv')
movie.head()

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
3,The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
4,Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,...,,,Documentary,,8,,,,,7.1


### Notice the integers in the index
These integers are the default index labels for each of the rows. Let's examine the underlying index object.

In [17]:
idx = movie.index
idx

RangeIndex(start=0, stop=4916, step=1)

In [18]:
type(idx)

pandas.core.indexes.range.RangeIndex

### A RangeIndex
pandas has many different types of index objects. A `RangeIndex` is similar to a Python `range` object. The values of a RangeIndex are not actually stored in memory and only accessed when requested.

### Select a value from the index
The index is a complex object on its own and has many features (many more than a Python list). We will not cover it in-depth because it is used infrequently. That said, the minimum we should know about an index is how to select values from it. We use **integer location**, just like it were a Python list, to make selections. Let's select a single value from it.

In [19]:
idx[5]

5

### A numpy array underlies the index
To get the underlying numpy array, use the `values` attribute. This is similar to how we get the underlying data from a pandas DataFrame.

In [20]:
idx.values

array([   0,    1,    2, ..., 4913, 4914, 4915], dtype=int64)

If you don't assign the index to a variable, you can retrieve the array from the DataFrame by chaining the attributes together like this:

In [21]:
movie.index.values

array([   0,    1,    2, ..., 4913, 4914, 4915], dtype=int64)

## Setting an index on read
pandas allows us to use one of the columns as the index when reading in the data.

### Setting an index when reading in the data with `read_csv`
The `read_csv` function gives us dozens of parameters that allow us to read in a wide variety of csv files. The `index_col` parameter may be used to select a particular column as the index. We can either use the column name or its integer location.

### Reread the movie dataset with the movie title as the index
There's a column in the movie dataset named `title`. Let's reread in the data with it as the index.

In [22]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1




Notice that now the titles of each movie serve as the label for each row. Also notice that the word **title** appears directly above the index. This is a bit confusing - **title** is NOT a column name, but rather the **name of the index**.

### Extract the new index and output its type
We again have an `Index` object.

In [23]:
idx2 = movie.index
idx2

Index([u'Avatar', u'Pirates of the Caribbean: At World's End', u'Spectre',
       u'The Dark Knight Rises', u'Star Wars: Episode VII - The Force Awakens',
       u'John Carter', u'Spider-Man 3', u'Tangled', u'Avengers: Age of Ultron',
       u'Harry Potter and the Half-Blood Prince',
       ...
       u'Primer', u'Cavite', u'El Mariachi', u'The Mongol King', u'Newlyweds',
       u'Signed Sealed Delivered', u'The Following', u'A Plague So Pleasant',
       u'Shanghai Calling', u'My Date with Drew'],
      dtype='object', name=u'title', length=4916)

In [24]:
type(idx2)

pandas.core.indexes.base.Index

## Selecting values from this index
Just like we did with our `RangeIndex`, we use the brackets operator to select a single index value.

In [25]:
idx2[105]

'Poseidon'

### Selection with slice notation
As with Python lists, you can select a range of values using slice notation with the three components, start, stop, and step separated by a colon like this - `start:stop:step`

In [26]:
idx2[100:120:4]

Index([u'The Fast and the Furious', u'The Sorcerer's Apprentice', u'Warcraft',
       u'Transformers', u'Hancock'],
      dtype='object', name=u'title')

### Selection with a list of integers
You can select multiple individual values with a list of integers. 

In [27]:
nums = [1000, 453, 713, 2999]
idx2[nums]

Index([u'The Life Aquatic with Steve Zissou', u'Daredevil', u'Daddy Day Care',
       u'The Ladies Man'],
      dtype='object', name=u'title')

## Choosing a good index
First, it's never necessary to choose an index for your DataFrames. You can complete all of your analysis with just the default `RangeIndex`. Setting a column to be an index can help identify the rows such as we did with the movie titles above.

I suggest choosing columns that are both **unique** and **descriptive**. Although uniqueness is not enforced, it does help when needing to identify one particular row.

## Setting the index after read with the `set_index` method
It is possible to set the index after reading the data with the `set_index` method. Pass it the name of the column you would like to use as the index. Below, we read in our data without setting an index.

In [28]:
movie = pd.read_csv('../data/movie.csv')
movie = movie.set_index('title')
movie.head()

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,...,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,,,Doug Walker,131.0,Doug Walker,131.0,Rob Walker,12.0,...,,,Documentary,,8,,,,,7.1


### Reassigned `movie` variable

Notice above that we reassigned the variable name `movie` as the result of the `set_index` command. This is because `set_index` makes an entire new copy of the data. It does not change the original DataFrame. We say the operation **does NOT happen in-place**.

## Changing Display Options
pandas gives you the ability to change how the output on your screen is displayed. For instance, the default number of columns displayed for a DataFrame is 20, meaning that if your DataFrame has more than 20 columns then only the first and last 10 columns will be shown on the screen.

### Get current option value with `get_option`
You can retrieve any option with the `get_option` function. Notice that this is not a DataFrame method. It is a function that you access directly from `pd`. It is not necessary to remember the option names. They are all available in the docstrings of the `get_option` function. Below are three of the most common options to change.

In [29]:
pd.get_option('display.max_columns')

20

In [30]:
pd.get_option('display.max_rows')

60

In [31]:
pd.get_option('display.max_colwidth')

50

### Use the `set_option` function to change an option value
To set a new option value, use the `set_option` function. You can set as many options as you would like at one time. It's usage is a bit strange. Pass it the option name as a string and follow it immediately with the value you want to set it to. Continue this pattern of option name followed by new value to set as many options as you desire. Below, we set the maximum number of columns to 40 and the maximum number of rows to 8. We will now be able to view all the columns in the movie DataFrame.

In [32]:
pd.set_option('display.max_columns', 40, 'display.max_rows', 8)
movie

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8
The Dark Knight Rises,2012.0,Color,PG-13,164.0,Christopher Nolan,22000.0,Tom Hardy,27000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,23000.0,448130642.0,Action|Thriller,813.0,1144337,deception|imprisonment|lawlessness|police offi...,English,USA,250000000.0,8.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
The Following,,Color,TV-14,43.0,,,Natalie Zea,841.0,Valorie Curry,593.0,Sam Underwood,319.0,,Crime|Drama|Mystery|Thriller,43.0,73839,cult|fbi|hideout|prison escape|serial killer,English,USA,,7.5
A Plague So Pleasant,2013.0,Color,,76.0,Benjamin Roberds,0.0,Eva Boehnke,0.0,Maxwell Moody,0.0,David Chandler,0.0,,Drama|Horror|Thriller,13.0,38,,English,USA,1400.0,6.3
Shanghai Calling,2012.0,Color,PG-13,100.0,Daniel Hsia,0.0,Alan Ruck,946.0,Daniel Henney,719.0,Eliza Coupe,489.0,10443.0,Comedy|Drama|Romance,14.0,1255,,English,USA,,6.3
My Date with Drew,2004.0,Color,PG,90.0,Jon Gunn,16.0,John August,86.0,Brian Herzlinger,23.0,Jon Gunn,16.0,85222.0,Documentary,43.0,4285,actress name in title|crush|date|four word tit...,English,USA,1100.0,6.6


### All available options
See the documentation for all the [available options](http://pandas.pydata.org/pandas-docs/stable/options.html#available-options).

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be something other than movie title. Are there any other good columns to use as an index?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Use `set_index` to set the index and keep the column as part of the data</span>

### Exercise 3
<span  style="color:green; font-size:16px">Assign the index of the movie DataFrame that has the titles in the index to its own variable. Output the last 10 movies titles.</span>

### Exercise 4
<span  style="color:green; font-size:16px">Use an integer instead of the column name for **`index_col`** when reading in the data using **`read_csv`**. What does it do?</span>

### Exercise 5
<span  style="color:green; font-size:16px">Use `pd.reset_option('all')` to reset the options to their default values. Test that this worked. </span>