# Indexing
- Indexing is fundamental to Pandas and is what makes retrieval and access to data much faster compared to other tools.
- It is crucial to set an appropriate index to optimize performance.
- An index is implemented in NumPy as an immutable (cannot be modified) array and contains hashable objects.
- A hashable object is one that can be converted to an integer value based on its contents (similar to mapping in a dictionary). Objects with different values will have different hash values.
- Pandas has two types of indexes - a row index (vertical) with labels attached to rows, and a column index with labels (column names) for every column.
- Let us now explore index objects – their data types, their properties, and how they speed up access to data.

## Type of an index object
- An index object has a data type, some of which are listed here.
- Index: This is a generic index type; the column index has this type.
- RangeIndex: Default index type in Pandas (used when an index is not defined separately), implemented as a range of increasing integers. This index type helps with saving memory.
- Int64Index: An index type containing integers as labels. For this index type, the index labels need not be equally spaced, whereas this is required for an index of type RangeIndex.
- Float64Index: Contains floating-point numbers (numbers with a decimal point) as index labels.
- IntervalIndex: Contains intervals (for instance, the interval between two integers) as labels.
- CategoricalIndex: A limited and finite set of values.
- DateTimeIndex: Used to represent date and time, like in time-series data.
- PeriodIndex: Represents periods like quarters, months, or years.
- TimedeltaIndex: Represents duration between two periods of time or two dates.
- MultiIndex: Hierarchical index with multiple levels.

### Creating a custom index and using columns as indexes
- When a Pandas object is created, a default index is created of the type RangeIndex
- An index of this type has the first label value as 0 and the second label as 1, following an arithmetic progression with a spacing of one integer.
- We can set a customized index, using either the index parameter or attribute. 
- In the Series and DataFrame objects in the absence of labels for the index object, the default index (of type RangeIndex) was used.
- We can use the index parameter when we define a Series or DataFrame to give custom values to the index labels.

In [3]:
import pandas as pd
periodic_table=pd.DataFrame({'Element':['Hydrogen','Helium','Lithium','Beryllium','Boron']},index=['H','He','Li','Be','B'])
periodic_table

Unnamed: 0,Element
H,Hydrogen
He,Helium
Li,Lithium
Be,Beryllium
B,Boron


In [4]:
periodic_table.index

Index(['H', 'He', 'Li', 'Be', 'B'], dtype='object')

In [5]:
periodic_table.index=["a","b","c","d","e"]    ## change the index name

### set_index()
- The set_index method can be used to set an index using an existing column
- ``` DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)```

In [6]:
import pandas as pd
periodic_table1=pd.DataFrame({'Element':['Hydrogen','Helium','Lithium','Beryllium','Boron'],"symbol":['H','He','Li','Be','B']})
periodic_table1

Unnamed: 0,Element,symbol
0,Hydrogen,H
1,Helium,He
2,Lithium,Li
3,Beryllium,Be
4,Boron,B


In [7]:
periodic_table1.set_index(["symbol"],inplace=True)

In [8]:
periodic_table1

Unnamed: 0_level_0,Element
symbol,Unnamed: 1_level_1
H,Hydrogen
He,Helium
Li,Lithium
Be,Beryllium
B,Boron


In [9]:
periodic_table1.reset_index(drop=True)     # if you want drop the index column

Unnamed: 0,Element
0,Hydrogen
1,Helium
2,Lithium
3,Beryllium
4,Boron


In [10]:
periodic_table1.index

Index(['H', 'He', 'Li', 'Be', 'B'], dtype='object', name='symbol')

In [11]:
periodic_table1.loc[["H","He"],:]

Unnamed: 0_level_0,Element
symbol,Unnamed: 1_level_1
H,Hydrogen
He,Helium


### reset_index()
- The index can be made a column again or reset using the reset_index method:
- ```DataFrame.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')```
- Reset the index, or a level of it.
- Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.

In [12]:
periodic_table1.reset_index(inplace=True)

In [13]:
periodic_table1

Unnamed: 0,symbol,Element
0,H,Hydrogen
1,He,Helium
2,Li,Lithium
3,Be,Beryllium
4,B,Boron


In [14]:
periodic_table1.reset_index(inplace=True,col_level=1)

In [15]:
periodic_table1

Unnamed: 0,index,symbol,Element
0,0,H,Hydrogen
1,1,He,Helium
2,2,Li,Lithium
3,3,Be,Beryllium
4,4,B,Boron


In [16]:
del periodic_table1["index"]

In [17]:
periodic_table1

Unnamed: 0,symbol,Element
0,H,Hydrogen
1,He,Helium
2,Li,Lithium
3,Be,Beryllium
4,B,Boron


### index_col parameter
- We can also set the index when we read data from an external file into a DataFrame,using the index_col parameter, as shown in the following.

In [18]:
import pandas as pd
olympic_data=pd.read_csv("https://raw.githubusercontent.com/svkarthik86/Advanced-python/main/olympics.csv"
                         ,skiprows=4,index_col=["Edition"])
olympic_data.head(3)

Unnamed: 0_level_0,City,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
Edition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze


In [19]:
olympic_data.head(3)

Unnamed: 0_level_0,City,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
Edition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze


In [20]:

olympic_data.loc[1896].head(2)

Unnamed: 0_level_0,City,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
Edition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver


In [21]:
olympic_data.reset_index(inplace=True)

In [22]:
olympic_data[olympic_data["Edition"]==1896].count()   # because now Edition column is the index column  

Edition         151
City            151
Sport           151
Discipline      151
Athlete         151
NOC             151
Gender          151
Event           151
Event_gender    151
Medal           151
dtype: int64

In [23]:
olympic_data.set_index(["Edition"])

Unnamed: 0_level_0,City,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
Edition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
1896,Athens,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
1896,Athens,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver
...,...,...,...,...,...,...,...,...,...
2008,Beijing,Wrestling,Wrestling Gre-R,"ENGLICH, Mirko",GER,Men,84 - 96kg,M,Silver
2008,Beijing,Wrestling,Wrestling Gre-R,"MIZGAITIS, Mindaugas",LTU,Men,96 - 120kg,M,Bronze
2008,Beijing,Wrestling,Wrestling Gre-R,"PATRIKEEV, Yuri",ARM,Men,96 - 120kg,M,Bronze
2008,Beijing,Wrestling,Wrestling Gre-R,"LOPEZ, Mijain",CUB,Men,96 - 120kg,M,Gold


### multi-index

In [24]:
olympic_data.set_index(["Edition","City"],inplace=True)

In [54]:
olympic_data..value_counts()

AttributeError: 'DataFrame' object has no attribute 'Edition'

In [26]:
olympic_data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 29216 entries, (1896, 'Athens') to (2008, 'Beijing')
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Sport         29216 non-null  object
 1   Discipline    29216 non-null  object
 2   Athlete       29216 non-null  object
 3   NOC           29216 non-null  object
 4   Gender        29216 non-null  object
 5   Event         29216 non-null  object
 6   Event_gender  29216 non-null  object
 7   Medal         29216 non-null  object
dtypes: object(8)
memory usage: 1.8+ MB


In [27]:
olympic_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Sport,Discipline,Athlete,NOC,Gender,Event,Event_gender,Medal
Edition,City,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100m freestyle,M,Gold
1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100m freestyle,M,Silver
1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100m freestyle for sailors,M,Bronze
1896,Athens,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100m freestyle for sailors,M,Gold
1896,Athens,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100m freestyle for sailors,M,Silver
...,...,...,...,...,...,...,...,...,...
2008,Beijing,Wrestling,Wrestling Gre-R,"ENGLICH, Mirko",GER,Men,84 - 96kg,M,Silver
2008,Beijing,Wrestling,Wrestling Gre-R,"MIZGAITIS, Mindaugas",LTU,Men,96 - 120kg,M,Bronze
2008,Beijing,Wrestling,Wrestling Gre-R,"PATRIKEEV, Yuri",ARM,Men,96 - 120kg,M,Bronze
2008,Beijing,Wrestling,Wrestling Gre-R,"LOPEZ, Mijain",CUB,Men,96 - 120kg,M,Gold


In [28]:
olympic_data["Gender"].value_counts()

Men      21721
Women     7495
Name: Gender, dtype: int64

### Indexes and speed of data retrieval
- indexes dramatically improve the speed of access to data.

In [29]:
periodic_table=pd.DataFrame({'Atomic Number':[1,2,3,4,5],'Element':['Hydrogen','Helium','Lithium','Beryllium','Boron'],
                             'Symbol':['H','He','Li','Be','B']})
periodic_table

Unnamed: 0,Atomic Number,Element,Symbol
0,1,Hydrogen,H
1,2,Helium,He
2,3,Lithium,Li
3,4,Beryllium,Be
4,5,Boron,B


In [30]:
%timeit periodic_table[periodic_table["Atomic Number"]==2]

513 µs ± 67.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Searching without using an index
- Now, try retrieving the element with atomic number 2 without the use of an index and measure the time taken for retrieval using the timeit magic function. When the index is not used, a linear search is performed to retrieve an element, which is relatively time consuming

In [31]:
periodic_table[periodic_table["Atomic Number"]==2]

Unnamed: 0,Atomic Number,Element,Symbol
1,2,Helium,He


In [32]:
periodic_table[periodic_table["Atomic Number"]>2]

Unnamed: 0,Atomic Number,Element,Symbol
2,3,Lithium,Li
3,4,Beryllium,Be
4,5,Boron,B


In [33]:
periodic_table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Atomic Number  5 non-null      int64 
 1   Element        5 non-null      object
 2   Symbol         5 non-null      object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes


### Search using an index
- Now, set the “Atomic Number” column as the index and use the loc indexer to see how much time the search takes now:

In [34]:
periodic_table=pd.DataFrame({'Atomic Number':[1,2,3,4,5],'Element':['Hydrogen','Helium','Lithium','Beryllium','Boron'],
                             'Symbol':['H','He','Li','Be','B']})
periodic_table

Unnamed: 0,Atomic Number,Element,Symbol
0,1,Hydrogen,H
1,2,Helium,He
2,3,Lithium,Li
3,4,Beryllium,Be
4,5,Boron,B


In [35]:
periodic_table.set_index(["Atomic Number"],inplace=True)

In [36]:
periodic_table

Unnamed: 0_level_0,Element,Symbol
Atomic Number,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Hydrogen,H
2,Helium,He
3,Lithium,Li
4,Beryllium,Be
5,Boron,B


In [37]:
periodic_table.loc[4:5]

Unnamed: 0_level_0,Element,Symbol
Atomic Number,Unnamed: 1_level_1,Unnamed: 2_level_1
4,Beryllium,Be
5,Boron,B


### Immutability of an index
- the index object is immutable 
- once defined, the index object or its labels cannot be modified.


In [38]:
periodic_table.index

Int64Index([1, 2, 3, 4, 5], dtype='int64', name='Atomic Number')

In [39]:
periodic_table.index[2]

3

In [40]:
periodic_table.index[2]=0

TypeError: Index does not support mutable operations

## index attributes

- Some of the attributes of the column index are

In [41]:
periodic_table

Unnamed: 0_level_0,Element,Symbol
Atomic Number,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Hydrogen,H
2,Helium,He
3,Lithium,Li
4,Beryllium,Be
5,Boron,B


In [42]:
periodic_table.reset_index(inplace=True)

In [43]:
periodic_table

Unnamed: 0,Atomic Number,Element,Symbol
0,1,Hydrogen,H
1,2,Helium,He
2,3,Lithium,Li
3,4,Beryllium,Be
4,5,Boron,B


In [44]:
periodic_table.set_index(["Symbol"])

Unnamed: 0_level_0,Atomic Number,Element
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
H,1,Hydrogen
He,2,Helium
Li,3,Lithium
Be,4,Beryllium
B,5,Boron


In [45]:
column_index=periodic_table.columns

In [46]:
column_index

Index(['Atomic Number', 'Element', 'Symbol'], dtype='object')

### values

In [47]:
column_index.values

array(['Atomic Number', 'Element', 'Symbol'], dtype=object)

### hasnass

2.hasnans attribute: Returns a Boolean True or False value based on the presence of null values.

In [48]:
column_index.hasnans

False

### nbytes

In [None]:
3.nbytes attribute: Returns the number of bytes occupied in memory

In [52]:
column_index.nbytes

24