# Setting a Meaningful Index

The index of a DataFrame provides a label for each of the rows. If not explicitly provided, pandas will use the sequence of consecutive integers beginning at 0 as the index. In this chapter, we learn how to set one of the columns of the DataFrame as the new index so that it provides a more meaningful label to each row.

## Extracting the components of a DataFrame
The DataFrame consists of three components - the index, columns, and data. It is possible to extract each component and assign them to their own variable. Let's read in a small dataset to show how this is done. Notice that when we read in the data, we choose the first column to be the index by setting the `index_col` parameter to 0.

In [None]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

### The attributes `index`, `columns`, and `values`
The index, columns, and data are each separate objects and can be extracted from the DataFrame. Notice that each of these objects are extracted as attributes and NOT methods. Let's assign them to their own variables.

In [None]:
index = df.index
columns = df.columns
data = df.values

### View these objects
Let's get a visual representation of these objects by outputting them to the screen.

In [None]:
index

In [None]:
columns

In [None]:
data

### Find the type of these objects
The output of these objects looks correct, but we don't know the exact type of each one. Let's find out the types of each object.

In [None]:
type(index)

In [None]:
type(columns)

In [None]:
type(data)

### pandas `Index` type
pandas has a special type of object called an `Index`. This object is similar to a list or a one dimensional array. You can think of it as a sequence of labels for either the rows or the columns. You will not deal with this object directly much, so we will not go into further details about it here. Notice that both the index and columns are of the same type.

### Two-dimensional numpy array
The values are returned as a single two-dimensional numpy array.

### Operating with the DataFrame as a whole
You will rarely need to operate with these components directly and instead be working with the entire DataFrame. But, it is important to understand that they are separate components and you can access them directly if needed.

## Extracting the components of a Series
Similarly, we can extract the two Series components - the index and the data. Let's first select a single column from our DataFrame so that we have a Series.

In [None]:
color = df['color']
color

Verify that `color` is a Series and then extract its components.

In [None]:
type(color)

In [None]:
color.index

In [None]:
color.values

## The default index
If you don't specify an index when first reading in a DataFrame, then pandas will create one for you as integers beginning at 0. An index always exists, even if it just appears to be the row number. Let's read in the movie dataset without explicitly setting an index.

In [None]:
movie = pd.read_csv('../data/movie.csv')
movie.head(3)

### Notice the integers in the index
These integers are the default index labels for each of the rows. Let's examine the underlying index object.

In [None]:
idx = movie.index
idx

In [None]:
type(idx)

### The RangeIndex
pandas has various types of index objects. A `RangeIndex` is the simplest index and represent the sequence of consecutive integers beginning at 0. It is similar to a Python `range` object in that the values are not actually stored in memory.

### Select a value from the index
The index is a complex object on its own and has many attributes and methods. The minimum we should know about an index is how to select values from it. We can select single values from an index just like we do with a Python list, by placing the integer location of the item we want within the square brackets. Here, we select the 6th item (integer location 5) from the index.

In [None]:
idx[5]

### A numpy array underlies the index
To get the underlying values into a numpy array, use the `values` attribute. This is similar to how we get the underlying data from a pandas DataFrame.

In [None]:
idx.values

If you don't assign the index to a variable, you can retrieve the array from the DataFrame by chaining the attributes together like this:

In [None]:
movie.index.values

## Setting an index on read
The `read_csv` function provides dozens of parameters that allow us to read in a wide variety of text files. The `index_col` parameter may be used to select a particular column as the index. We can either use the column name or its integer location.

### Reread the movie dataset with the movie title as the index
There's a column in the movie dataset named `title`. Let's reread in the data with it as the index.

In [None]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head(3)

Notice that now the titles of each movie serve as the label for each row. Also notice that the word **title** appears directly above the index. This is a bit confusing - **title** is NOT a column name, but rather the **name** of the index.

### Extract the new index and output its type
Let's extract this new index and find its exact type.

In [None]:
idx2 = movie.index
idx2

In [None]:
type(idx2)

### Selecting values from this index
Just like we did with our `RangeIndex`, we use the brackets operator to select a single index value.

In [None]:
idx2[105]

### Selection with slice notation
As with Python lists, you can select a range of values using slice notation. Provide the start, stop, and step components of slice notation separated by a colon like this - `start:stop:step`

In [None]:
idx2[100:120:4]

### Selection with a list of integers
You can select multiple individual values with a list of integers. This type of selection does not exist for Python lists.

In [None]:
nums = [1000, 453, 713, 2999]
idx2[nums]

## Choosing a good index
Before even considering using one of the columns as an index, know that it's not a necessity. You can complete all of your analysis with just the default `RangeIndex`. 

Setting a column to be an index can help make certain analysis easier in some situations, so it is something that can be considered. If you do choose to set an index for your DataFrame, I suggest using columns that are both **unique** and **descriptive**. Pandas does not enforce uniqueness for its index allowing the same value to repeat multiple times. That said, a good index will have unique values to identify each row.

## Setting the index with `set_index`
It is possible to set the index after reading the data with the `set_index` method. Pass it the name of the column you would like to use as the index. Below, we read in our data without setting an index.

In [None]:
movie = pd.read_csv('../data/movie.csv')
movie = movie.set_index('title')
movie.head(3)

### Reassigned `movie` variable

In the last code block, we reassigned the variable name `movie` to the result of the `set_index` command. This is necessary because `set_index` makes an entire new copy of the data and does not change the calling DataFrame. If we run the same commands, but do not assign the result of the `set_index` method, the DataFrame will not be changed. Let's verify this by changing the second line of code from above while using a new variable name, `movie2`. Notice that `title` is still a column and has not been set as the index.

In [None]:
movie2 = pd.read_csv('../data/movie.csv')
movie2.set_index('title')
movie2.head(3)

## Changing Display Options
pandas gives you the ability to change how the output on your screen is displayed. For instance, the default number of columns displayed for a DataFrame is 20, meaning that if your DataFrame has more than 20 columns then only the first and last 10 columns will be shown on the screen.

### Get current option value with `get_option`

You can retrieve any option with the `get_option` function. Notice that this is not a DataFrame method, but instead a function that is accessed directly from `pd`. It is not necessary to remember the option names. They are all available in the docstrings of the `get_option`. The official documentation also provides descriptions for all [available options][1]. Below are three of the most common options to change.

[1]: http://pandas.pydata.org/pandas-docs/stable/options.html#available-options

In [None]:
pd.get_option('display.max_columns')

In [None]:
pd.get_option('display.max_rows')

In [None]:
pd.get_option('display.max_colwidth')

### Use the `set_option` function to change an option value
To set a new option value, use the `set_option` function. You can set as many options as you would like at one time. It's usage is a bit strange. Pass it the option name as a string and follow it immediately with the value you want to set it to. Continue this pattern of option name followed by new value to set as many options as you desire. Below, we set the maximum number of columns to 40 and the maximum number of rows to 4. We will now be able to view all the columns in the movie DataFrame.

In [None]:
pd.set_option('display.max_columns', 40, 'display.max_rows', 4)

In [None]:
movie

## Exercises

You may wish to change the display options before completing the exercises.

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be something other than movie title. Are there any other good columns to use as an index?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Use `set_index` to set the index and keep the column as part of the data. Read the docstrings to find the parameter that controls this functionality.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Read in the movie DataFrame and set the index as the title column. Assign the index to its own variable and output the last 10 movies titles.</span>

### Exercise 4
<span  style="color:green; font-size:16px">Use an integer instead of the column name for `index_col` when reading in the data using `read_csv`. What does it do?</span>

### Exercise 5
<span  style="color:green; font-size:16px">Use `pd.reset_option('all')` to reset the options to their default values. Test that this worked. </span>