# Pandas Index

**In world of Computer Science, 2 types of Index are there, One used in DB and One in Programming. What's the difference?**

- Index, , is mainly used in 2 sense.
    1. Indexing is used for optimized searching - This is basically a Database concept(BTree, B+Tree).
    2. Indexing is used to access an element(`for i in range(len(nums))`) - This is a programming language concept.
- Pandas Index object follows the programming language concept. So going forward, we will say this General Indexing concept as **Programming Index**.
- This Programming Index, usually comes in 2 flavours.
    1. **Implicit**. and this is also knows as **positional index**. This is something we see in Python List and Tuple
    2. **Explicit** and this is also known as **key**. This is something we see in Python Dictionary.
- The **Implicit** index is mainly *0,1,2,3,....n-1* and **Explicit** index is mainly a *hashable*. There is no specific Index class which is defined in python for this.
- In pandas, we have mainly 2 important data structures, `Series` and `Dataframe` and to access an element in both of these data structures, pandas also takes the help of **Programming Index**. But only thing is, pandas team created a specific `Index` class. Because of that, they can support features like **slicing**, **masking** and **fancy indexing**.

**Among Database Indexing and Programming Indexing, which one does NumPy Use**

- Before learning about pandas `Index` I should have 1 thing in mind, that is numpy index.
    - NumPy array's also use the concept of Programming Index to access an element in the array.
    - NumPy team however did not create any specialized Index class though, they are working with 0,1,2,...n-1
    - The most important thing is, among the 2 flavour of Programming Index(Implicit and Explicit), NumPy uses ONLY Implicit Index.

**How Numpy array is internally used by pandas Index class?**

- The reason we need to know about numpy indexing way is because, pandas Index depends heavily on NumPy array.
- Internally, `pandas.Index` class holds a NumPy array and whatever iterable we get in the constructor in `data` param, they are kept in that NumPy array.
- We can even see the params we have given to create the Index object using `._data` or `.values` properties. Don't use `._data` because it is for internal purpose, so might not return the labels always but `values` will always return the labels which we expect.
- And becuse the NumPy array has implicit(or in other words positional) index, the iterable we get in the Index class constructor, can have duplicates.

In [35]:
import pandas as pd

labels = [1,1,2,3,4,5] # It has duplicates
idx = pd.Index(data=labels)
idx, idx._data, type(idx._data), idx.values, type(idx.values)

(Index([1, 1, 2, 3, 4, 5], dtype='int64'),
 array([1, 1, 2, 3, 4, 5]),
 numpy.ndarray,
 array([1, 1, 2, 3, 4, 5]),
 numpy.ndarray)

**What kind of data can we pass in pandas.Index constructor and why?**

- Also one more thing we can extract from this info. Because pandas.Index object basically takes the data in `data` param and converts it into a numpy array, in the constructor of `pandas.Index` object we can pass anything which can be converted into a numpy array. That means, whatever can be passed in `numpy.array()` constructor, the same can be passed in `pandas.Index`

**How can pandas.Index support Slicing, Masking and Fancy Indexing?**

- Numpy array supports 3 operations, Slicing, Masking and Fancy Indexing. And ofcource for the obvious same reason, *pandas.Index internally stores the given labels in a numpy array*, it supports all the above 3.
- Few things to know about each of these operation
    1. Slicing: When we slice, we get a view, not a new array
    2. Masking: When we mask, we get a new array
    3. Fancy Indexing: When we do fancy Indexing, we get a new array. Also, in fancy indexing, after the operation `arr%2==0` or `arr>2` we get a boolean kind of array and that is then numpy performs **AND** with the original array and returns the result. **I need some fact checking in this**

**Slicing**

In [25]:
import numpy as np
arr = np.array([0, 1, 2, 3, 4, 5])
arr[1:4], arr[:3], arr[::2]

(array([1, 2, 3]), array([0, 1, 2]), array([0, 2, 4]))

In [30]:
import pandas as pd
idx = pd.Index([0, 1, 2, 3, 4, 5])
idx[1:4], idx[:3], idx[::2]

(Index([1, 2, 3], dtype='int64'),
 Index([0, 1, 2], dtype='int64'),
 Index([0, 2, 4], dtype='int64'))

**Masking**

In [26]:
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
arr[[2,3,4]]

array([30, 40, 50])

In [31]:
import pandas as pd
idx = pd.Index([10, 20, 30, 40, 50])
idx[[2,3,4]]

Index([30, 40, 50], dtype='int64')

**Fancy Indexing**

In [34]:
import numpy as np
arr = np.array([0,1,2,3,4,5])
arr%2, arr[arr%2==0]

(array([0, 1, 0, 1, 0, 1]), array([0, 2, 4]))

In [33]:
import pandas as pd
idx = pd.Index([0, 1, 2, 3, 4, 5])
idx%2==0, idx[idx%2==0]

(array([ True, False,  True, False,  True, False]),
 Index([0, 2, 4], dtype='int64'))

**What are subclasses of pandas.Index? And what will be dtype of internal numpy array?**

- `pandas.Index` is a very generic class. In the constructor,
    - if you provide int, the internal numpy array dtype will be int64
    - if you provide float, the internal numpy array dtype will be float64

In [36]:
import pandas as pd

idx_int = pd.Index([1,2,3,4,5])
idx_float = pd.Index([1.0, 2.0, 3.0, 4.0, 5.0])
idx_int, idx_float

(Index([1, 2, 3, 4, 5], dtype='int64'),
 Index([1.0, 2.0, 3.0, 4.0, 5.0], dtype='float64'))

- Also `pandas.Index` has few child classes like

- The difference between pandas.RangeIndex and pandas.Index is similar to the difference between list and range object in pandas.
- RangeIndex is efficient, hence if we don't provide any label to pandas.Series, then by default it creates a RangeIndex object.

In [37]:
import pandas as pd
idx = pd.Index(range(5))
idx

RangeIndex(start=0, stop=5, step=1)

**NumPy arrays are mutable. pandas.Index is basically a wrapper around numpy.ndarray. Is pandas.Index is also mutable?**

No, because index is used to locate an element in the data structure. If that were mutable, then then the entire data structure will fall apart. That's why pandas.Index is not mutable, even though it's internally a numpy.ndarray.

In [4]:
import pandas as pd

idx = pd.Index([0,1,2,3,4,5])

try:
    idx[3] = 30
except TypeError as ex:
    print(ex)

Index does not support mutable operations


**How pandas Index class supports SET Operation? And why was that needed in the first place?**

- pandas indexes have set like properties
- we can perform union(|), intersection(&) elementof(in) operation