Day 17: What do I need to know about the pandas index? (Part 1)

The DataFrame index is core  to the functionality of pandas, yet it's confusing to many users. In this video, I'll explain what the index is used for and why you might want to store your data in the index. I'll also demonstrate how to set and reset the index, and show how that affects the DataFrame's shape and contents.

In [50]:
import pandas as pd

In [51]:
drinks = pd.read_csv('https://bit.ly/drinksbycountry')

In [52]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In [53]:
drinks.index # Every data frame has the attribute of index and column. Index are also called row labels. One index for each row.

RangeIndex(start=0, stop=193, step=1)

In [8]:
drinks.columns # These are referred to as a special index but is not referred to as the index.

Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

Note: Neither the index or the columns are considered as part of the dataframe content.
so if you do a shape, it will give the number of rows (183) excluding the column header, and columns 6 is excluding the index column.

In [54]:
drinks.shape

(193, 6)

In [55]:
pd.read_table('https://bit.ly/movieusers', header=None, sep='|').head() # User who rated moved on the website. Since we didn't specify the header the values are default. 

Unnamed: 0,0,1,2,3,4
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


Why does the index exist?
1. For identification 
2. For selection
3. For alignment.

For identification


In [56]:
drinks[drinks.continent=='South America'] # Here is you look at the row index they remained the same and did not change, and did not re-number them. Thus, we can use the row index to identify the row, even if we filter the dataframe.

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
6,Argentina,193,25,221,8.3,South America
20,Bolivia,167,41,8,3.8,South America
23,Brazil,245,145,16,7.2,South America
35,Chile,130,124,172,7.6,South America
37,Colombia,159,76,3,4.2,South America
52,Ecuador,162,74,3,4.2,South America
72,Guyana,93,302,1,7.1,South America
132,Paraguay,213,117,74,7.3,South America
133,Peru,163,160,21,6.1,South America
163,Suriname,128,178,7,5.6,South America


#### Selection


In [57]:
drinks.loc[23,'beer_servings'] # Here we need to know the index number correspond to which row.

245

In [58]:
drinks.set_index('country', inplace=True) # Here the column country series has become the index. Here the index name is country.
drinks.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0,0,0,0.0,Asia
Albania,89,132,54,4.9,Europe
Algeria,25,0,14,0.7,Africa
Andorra,245,138,312,12.4,Europe
Angola,217,57,45,5.9,Africa


In [21]:
drinks.index

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
       'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=193)

In [22]:
drinks.columns

Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

In [23]:
drinks.shape

(193, 5)

In [27]:
drinks.loc['Brazil', 'beer_servings'] # Here instead of using index, we can use Brazil.

245

In [60]:
drinks.index.name = None
drinks.head()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
Afghanistan,0,0,0,0.0,Asia
Albania,89,132,54,4.9,Europe
Algeria,25,0,14,0.7,Africa
Andorra,245,138,312,12.4,Europe
Angola,217,57,45,5.9,Africa


In [59]:
drinks.index.name = 'country'
drinks.reset_index(inplace=True)
drinks.head()

AttributeError: module 'numpy' has no attribute 'matrix'

In [61]:
drinks.describe() # This is the numerical summary of the numerical columns. Which is a dataframe

Unnamed: 0,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


In [62]:
drinks.describe().index

Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')

In [63]:
drinks.describe().columns

Index(['spirit_servings', 'wine_servings', 'total_litres_of_pure_alcohol',
       'continent'],
      dtype='object')

In [66]:
drinks.describe().loc['25%', 'wine_servings'] # drinks.describe() outputs a dataframe, loc is dataframe method, then we pulled out the data from the index and column attribute, which is a value. 

4.0