<a href="https://colab.research.google.com/github/cnrgrl/PANDAS/blob/main/04_Setting_a_Meaningful_Index_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# uncomment the following line, if you are using google collab
!rm -r Pandas
!git clone https://github.com/Wuebbelt/Pandas.git

Cloning into 'Pandas'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects:   1% (1/77)[Kremote: Counting objects:   2% (2/77)[Kremote: Counting objects:   3% (3/77)[Kremote: Counting objects:   5% (4/77)[Kremote: Counting objects:   6% (5/77)[Kremote: Counting objects:   7% (6/77)[Kremote: Counting objects:   9% (7/77)[Kremote: Counting objects:  10% (8/77)[Kremote: Counting objects:  11% (9/77)[Kremote: Counting objects:  12% (10/77)[Kremote: Counting objects:  14% (11/77)[Kremote: Counting objects:  15% (12/77)[Kremote: Counting objects:  16% (13/77)[Kremote: Counting objects:  18% (14/77)[Kremote: Counting objects:  19% (15/77)[Kremote: Counting objects:  20% (16/77)[Kremote: Counting objects:  22% (17/77)[Kremote: Counting objects:  23% (18/77)[Kremote: Counting objects:  24% (19/77)[Kremote: Counting objects:  25% (20/77)[Kremote: Counting objects:  27% (21/77)[Kremote: Counting objects:  28% (22/77)[Kremote: Counting o

# Setting a Meaningful Index

The index of a DataFrame provides a label for each of the rows. If not explicitly provided, pandas uses the sequence of consecutive integers beginning at 0 as the index. In this chapter, we learn how to set one of the columns of the DataFrame as the new index so that it provides a more meaningful label for each row.

## Setting an index of a DataFrame

Instead of using the default index for your pandas DataFrame, you can use the `set_index` method to use one of the columns as the index. Let's read in a small dataset to show how this is done.

In [None]:
import pandas as pd
df = pd.read_csv('Pandas/sample_data.csv')
df

Unnamed: 0,name,state,color,food,age,height,score
0,Jane,NY,blue,Steak,30,165,4.6
1,Niko,TX,green,Lamb,2,70,8.3
2,Aaron,FL,red,Mango,12,120,9.0
3,Penelope,AL,white,Apple,4,80,3.3
4,Dean,AK,gray,Cheese,32,180,1.8
5,Christina,TX,black,Melon,33,172,9.5
6,Cornelia,TX,red,Beans,69,150,2.2


### The `set_index` method

Pass the `set_index` method the name of the column to use it as the index. This column is no longer part of the data of the returned DataFrame.

In [None]:
df.set_index('name')

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### A new DataFrame copy is returned

The `set_index` method returns an entire new DataFrame copy by default and does not modify the original calling DataFrame. Let's verify this by outputting the original DataFrame.

In [None]:
df

Unnamed: 0,name,state,color,food,age,height,score
0,Jane,NY,blue,Steak,30,165,4.6
1,Niko,TX,green,Lamb,2,70,8.3
2,Aaron,FL,red,Mango,12,120,9.0
3,Penelope,AL,white,Apple,4,80,3.3
4,Dean,AK,gray,Cheese,32,180,1.8
5,Christina,TX,black,Melon,33,172,9.5
6,Cornelia,TX,red,Beans,69,150,2.2


### Assigning the result of `set_index` to a variable name

We must assign the result of the `set_index` method to a variable name if we are to use this new DataFrame with new index.

In [None]:
df2 = df.set_index('name')
df2

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Number of columns decreased

The new DataFrame, `df2`, has one less column than the original as the `name` column was set as the index. Let's verify this:

In [None]:
df.shape

(7, 7)

In [None]:
df2.shape

(7, 6)

## Accessing the index, columns, and data

The index, columns, and data are each separate objects that can be accessed from the DataFrame as attributes and NOT methods. Let's assign each of them to their own variable name beginning with the index and output it to the screen.

In [None]:
index = df2.index
index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object', name='name')

In [None]:
columns = df2.columns
columns

Index(['state', 'color', 'food', 'age', 'height', 'score'], dtype='object')

In [None]:
data = df2.values
data

array([['NY', 'blue', 'Steak', 30, 165, 4.6],
       ['TX', 'green', 'Lamb', 2, 70, 8.3],
       ['FL', 'red', 'Mango', 12, 120, 9.0],
       ['AL', 'white', 'Apple', 4, 80, 3.3],
       ['AK', 'gray', 'Cheese', 32, 180, 1.8],
       ['TX', 'black', 'Melon', 33, 172, 9.5],
       ['TX', 'red', 'Beans', 69, 150, 2.2]], dtype=object)

### Find the type of these objects

The output of these objects looks correct, but we don't know the exact type of each one. Let's find out the types of each object.

In [None]:
type(index)

pandas.core.indexes.base.Index

In [None]:
type(columns)

pandas.core.indexes.base.Index

In [None]:
type(data)

numpy.ndarray

### Accessing the components does not change the DataFrame

Accessing these components does nothing to our DataFrame. It merely gives us a variable to reference each of these components. Let's verify that the DataFrame remains unchanged.

In [None]:
df2

Unnamed: 0_level_0,state,color,food,age,height,score
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### pandas `Index` type

Both the index and columns are each a special type of object named `Index` an is similar to a list. You can think of it as a sequence of labels for either the rows or the columns. You will not deal with this object much directly, so we will not go into further details about it here.

### Two-dimensional numpy array

The values are returned as a single two-dimensional numpy array.

### Operating with DataFrame and not its components

You rarely need to operate with these components directly and instead be working with the entire DataFrame. But, it is important to understand that they are separate components and you can access them directly if needed.

## Accessing the components of a Series

Similarly, we can access the two Series components - the index and the data. Let's first select a single column from our DataFrame so that we have a Series. When we select a column from the DataFrame as a Series, the index remains the same.

In [None]:
color = df2['color']
color

name
Jane          blue
Niko         green
Aaron          red
Penelope     white
Dean          gray
Christina    black
Cornelia       red
Name: color, dtype: object

Let's access the index and the data from the `color` Series.

In [None]:
color.index

Index(['Jane', 'Niko', 'Aaron', 'Penelope', 'Dean', 'Christina', 'Cornelia'], dtype='object', name='name')

In [None]:
color.values

array(['blue', 'green', 'red', 'white', 'gray', 'black', 'red'],
      dtype=object)

### The default index

If you don't specify an index when first reading in a DataFrame, then pandas creates one for you as integers beginning at 0. Let's read in the movie dataset and keep the default index.

In [None]:
movie = pd.read_csv('Pandas/movie.csv')
movie.head(3)

Unnamed: 0,title,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
0,Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
1,Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
2,Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8


### Integers in the index

The integers you see above in the index are the labels for each of the rows. Let's examine the underlying index object.

In [None]:
idx = movie.index
idx

RangeIndex(start=0, stop=4916, step=1)

In [None]:
type(idx)

pandas.core.indexes.range.RangeIndex

### The RangeIndex

pandas has various types of index objects. A `RangeIndex` is the simplest index and represents the sequence of consecutive integers beginning at 0. It is similar to a Python `range` object in that the values are not actually stored in memory. Th

### A numpy array underlies the index

The index has a `values` attribute just like the DataFrame. Use it to retrieve the underlying index values as a numpy array.

In [None]:
idx.values

array([   0,    1,    2, ..., 4913, 4914, 4915])

It's not necessary to assign the index to a variable name to access its attributes and methods. You can access it beginning from the DataFrame.

In [None]:
movie.index.values

array([   0,    1,    2, ..., 4913, 4914, 4915])

## Setting an index on read

The `read_csv` function provides dozens of parameters that allow us to read in a wide variety of text files. The `index_col` parameter may be used to select a particular column as the index. We can either use the column name or its integer location.

### Reread the movie dataset with the movie title as the index

There's a column in the movie dataset named `title`. Let's reread the data using it as the index.

In [None]:
movie = pd.read_csv('Pandas/movie.csv', index_col='title')
movie.head(3)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,actor3,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,Wes Studi,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,Jack Davenport,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,Stephanie Sigman,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8


Notice that now the titles of each movie serve as the label for each row. Also notice that the word **title** appears directly above the index. This is a bit confusing. The word **title** is NOT a column name. Technically, it is the **name** of the index, but this isn't important at the moment.

### Access the new index and output its type

Let's access this new index, output its values, and verify that type is now `Index` instead of `RangeIndex`.

In [None]:
idx2 = movie.index
idx2

Index(['Avatar', 'Pirates of the Caribbean: At World's End', 'Spectre',
       'The Dark Knight Rises', 'Star Wars: Episode VII - The Force Awakens',
       'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron',
       'Harry Potter and the Half-Blood Prince',
       ...
       'Primer', 'Cavite', 'El Mariachi', 'The Mongol King', 'Newlyweds',
       'Signed Sealed Delivered', 'The Following', 'A Plague So Pleasant',
       'Shanghai Calling', 'My Date with Drew'],
      dtype='object', name='title', length=4916)

In [None]:
type(idx2)

pandas.core.indexes.base.Index

### Select a value from the index

The index is a complex object on its own and has many attributes and methods. The minimum we should know about an index is how to select values from it. We can select single values from an index just like we do with a Python list, by placing the integer location of the item we want within the square brackets. Here, we select the 4th item (integer location 3) from the index.

In [None]:
idx2[3]

'The Dark Knight Rises'

We can select this same index label without actually assigning the index to a variable first.

In [None]:
movie.index[3]

'The Dark Knight Rises'

### Selection with slice notation
As with Python lists, you can select a range of values using slice notation. Provide the start, stop, and step components of slice notation separated by a colon within the brackets.

In [None]:
idx2[100:120:4]

Index(['The Fast and the Furious', 'The Sorcerer's Apprentice', 'Warcraft',
       'Transformers', 'Hancock'],
      dtype='object', name='title')

### Selection with a list of integers
You can select multiple individual values with a list of integers. This type of selection does not exist for Python lists.

In [None]:
nums = [1000, 453, 713, 2999]
idx2[nums]

Index(['The Life Aquatic with Steve Zissou', 'Daredevil', 'Daddy Day Care',
       'The Ladies Man'],
      dtype='object', name='title')

## Choosing a good index
Before even considering using one of the columns as an index, know that it's not a necessity. You can complete all of your analysis with just the default `RangeIndex`. 

Setting a column to be an index can help make certain analysis easier in some situations, so it is something that can be considered. If you do choose to set an index for your DataFrame, I suggest using columns that are both **unique** and **descriptive**. Pandas does not enforce uniqueness for its index allowing the same value to repeat multiple times. That said, a good index will have unique values to identify each row.

## Exercises

You may wish to change the display options before completing the exercises.

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be something other than movie title. Are there any other good columns to use as an index?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Use `set_index` to set the index and keep the column as part of the data. Read the docstrings to find the parameter that controls this functionality.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Read in the movie DataFrame and set the index as the title column. Assign the index to its own variable and output the last 10 movies titles.</span>

### Exercise 4
<span  style="color:green; font-size:16px">Use an integer instead of the column name for `index_col` when reading in the data using `read_csv`. What does it do?</span>