# Indexing in Pandas

# A. Data Indexing and Selection in Series


In [None]:
# import necessary libraries
import pandas as pd
import numpy as np

## A.1 Series as a Dictionary

In [2]:
# let us create a series object 
series_1 = pd.Series(data= np.random.random(10), 
                     index=[i for i in range(1,11)])
print(f'The series is:\n{series_1}')

# just like a dict we can access the keys and the values of the series
print(f'The keys of the series are: {series_1.keys()}')
print(f'The values of the series are:{series_1.values}')

The series is:
1     0.119761
2     0.059803
3     0.899937
4     0.102466
5     0.060674
6     0.289320
7     0.683394
8     0.990552
9     0.762207
10    0.055721
dtype: float64
The keys of the series are: Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64')
The values of the series are:[0.11976098 0.05980322 0.89993722 0.10246627 0.06067385 0.28931954
 0.68339441 0.9905525  0.76220744 0.0557205 ]


In [None]:
# Let us zip the key and the values together 
print(list(zip(series_1.keys(), series_1.values)))

[(1, 0.43166568261732396), (2, 0.24983271613878089), (3, 0.2065006860182862), (4, 0.6424725840012879), (5, 0.3196313979480794), (6, 0.5198276122721387), (7, 0.8683615716318782), (8, 0.5276659294341741), (9, 0.13090406696542), (10, 0.4780327433023466)]


In [15]:
# items of the series - same as the items() method used in dict
print(f'The items of the series (list of (key,value) tuples): {series_1.items()}')
'''
The output you're seeing indicates that series_1.items() returns a zip object. This object is an iterator that lazily produces (index, value) pairs from the series. However, it does not directly produce a list unless explicitly converted.
'''

# let us explicitly mention this 
# Get the items and convert to a list
items = list(series_1.items())

print(f'The items of the series (list of (key, value) tuples): {items}')

The items of the series (list of (key,value) tuples): <zip object at 0x000001DB7C9AEC00>
The items of the series (list of (key, value) tuples): [(1, 0.43166568261732396), (2, 0.24983271613878089), (3, 0.2065006860182862), (4, 0.6424725840012879), (5, 0.3196313979480794), (6, 0.5198276122721387), (7, 0.8683615716318782), (8, 0.5276659294341741), (9, 0.13090406696542), (10, 0.4780327433023466)]


In [16]:
# like dict, you can use key to access the elements of the series 
series_1[3]

0.2065006860182862

In [17]:
# you can add elements to the series by defining a new key/index
series_1[11] = 0.9876
print(series_1)

1     0.431666
2     0.249833
3     0.206501
4     0.642473
5     0.319631
6     0.519828
7     0.868362
8     0.527666
9     0.130904
10    0.478033
11    0.987600
dtype: float64


## A.2 Series as a One-Dimensional Array

If you have a Series with an **explicit index starting from 1**, slicing like `series[1:3]` will use the **implicit integer positions, not the explicit index**.

**Solution:**
- You can use `.loc[]` or `.iloc[]`to make your intent explicit:
    - Use `.loc[]` to **slice using the explicit (labeled) index**.
    - Use `.iloc[]` to slice using the **implicit (integer) index**.

In [18]:
# slicing with explicit index 
print(f'The first three items using explicit indexing : \n{series_1.loc[1:3]}')

# slicing with implicit index
print(f'The first three items with implicit indexing : \n{series_1.iloc[:3]}')

The first three items using explicit indexing : 
1    0.431666
2    0.249833
3    0.206501
dtype: float64
The first three items with implicit indexing : 
1    0.431666
2    0.249833
3    0.206501
dtype: float64


The behavior you observe is because the `.loc[]` indexer in pandas **includes the end label** in the slicing range, following a **closed interval** logic `[start:end]`. This is in contrast to Python‚Äôs default slicing behavior (like with lists or `.iloc[]`), which uses a **half-open interval** `[start:end)`.

Here‚Äôs a breakdown:

### Key Differences:
1. **`.loc[]` (label-based slicing)**:
   - Includes the end label in the slice (`[start:end]`).
   - This is consistent with the behavior of operations on labels in pandas.
   - Example: `series.loc[1:3]` includes both 1, 2, and 3 as it uses the labels.

2. **`.iloc[]` (position-based slicing)**:
   - Excludes the end position in the slice (`[start:end)`).
   - Example: `series.iloc[:3]` includes positions 0, 1, and 2, stopping before position 3.


### Example to Clarify:

```python
import pandas as pd

# Create a Series with an explicit index starting from 1
series_1 = pd.Series([10, 20, 30, 40], index=[1, 2, 3, 4])

print("Original Series:")
print(series_1)

# Explicit index slicing with .loc[]
print("\nSlicing with .loc[] (explicit indexing, closed interval):")
print(series_1.loc[1:3])  # Includes the end label (index 3)

# Implicit index slicing with .iloc[]
print("\nSlicing with .iloc[] (implicit indexing, half-open interval):")
print(series_1.iloc[:3])  # Excludes the end position (position 3)
```

### Output:

```plaintext
Original Series:
1    10
2    20
3    30
4    40
dtype: int64

Slicing with .loc[] (explicit indexing, closed interval):
1    10
2    20
3    30
dtype: int64

Slicing with .iloc[] (implicit indexing, half-open interval):
1    10
2    20
3    30
dtype: int64
```

### Why `.loc[]` Includes the End Label:
The `.loc[]` behavior aligns with pandas‚Äô philosophy of working with labeled data:
- When working with labels, it‚Äôs often more intuitive for users to think in terms of "from label X to label Y" and expect both labels to be included.

This behavior ensures clarity and reduces off-by-one errors when working with labeled indices.

In [3]:
# fancy indexing - pass as a list of index in the loc 
print(f'The first, third and seventh element are:\n{series_1.loc[[1,3,7]]}')

The first, third and seventh element are:
1    0.119761
3    0.899937
7    0.683394
dtype: float64


In [None]:
# fancy indexing - pass as a list of index in the iloc 
print(f'The first, third and seventh element are:\n{series_1.iloc[[1,3,7]]}')

The first, third and seventh element are:
2    0.059803
4    0.102466
8    0.990552
dtype: float64


### Boolean Masking in Series ( like NumPy arrays )
- ALWAYS use `.loc` for BOOLEAN MASKING

In [20]:
# boolean logic and masking in the Series object - just like the NumPy arrays 

# consider the following series
series = pd.Series(data=[12,45,67,88,34,54,76,99,7,8,10])

# find the mean of the series 
mean_series = np.mean(series)
print(f"The mean of the series is {mean_series}")

# create the boolean array 
bool_series = series > mean_series
print("Boolean Series (True when value greater than mean): ")
print(bool_series)

# use this boolean series to filer the original series 
print("Using the boolean series to filter the original :")
print(series.loc[bool_series]) # use loc[] indexer for masking

The mean of the series is 45.45454545454545
Boolean Series (True when value greater than mean): 
0     False
1     False
2      True
3      True
4     False
5      True
6      True
7      True
8     False
9     False
10    False
dtype: bool
Using the boolean series to filter the original :
2    67
3    88
5    54
6    76
7    99
dtype: int64


---

# B. Data Indexing and Selection in DataDrame


In [6]:
area = pd.Series({'California': 423967, 'Texas': 695662, 
                  'Florida': 170312, 'New York': 141297, 
                  'Pennsylvania': 119280})
pop = pd.Series({'California': 39538223, 'Texas': 29145505, 
                 'Florida': 21538187, 'New York': 20201249, 
                 'Pennsylvania': 13002700})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187
New York,141297,20201249
Pennsylvania,119280,13002700


In [7]:
# accessing each series of the dataframe 
area_series = data['area']
print(area_series)

California      423967
Texas           695662
Florida         170312
New York        141297
Pennsylvania    119280
Name: area, dtype: int64


In [9]:
# add a new fature to the dataframe 
data['density'] = data['pop']/data['area']
data

Unnamed: 0,area,pop,density
California,423967,39538223,93.257784
Texas,695662,29145505,41.896072
Florida,170312,21538187,126.463121
New York,141297,20201249,142.97012
Pennsylvania,119280,13002700,109.009893


## B.1 DataFrame as a 2D NumPy array

In [12]:
data_numpy = data.to_numpy()
print(f'The NumPy array of the dataframe is:\n{data_numpy}')

data_numpy_T = data_numpy.T # transposes the dataframe 
print(f'The transposed array converted to NumPy array is:\n{data_numpy_T}')

The NumPy array of the dataframe is:
[[4.23967000e+05 3.95382230e+07 9.32577842e+01]
 [6.95662000e+05 2.91455050e+07 4.18960717e+01]
 [1.70312000e+05 2.15381870e+07 1.26463121e+02]
 [1.41297000e+05 2.02012490e+07 1.42970120e+02]
 [1.19280000e+05 1.30027000e+07 1.09009893e+02]]
The transposed array converted to NumPy array is:
[[4.23967000e+05 6.95662000e+05 1.70312000e+05 1.41297000e+05
  1.19280000e+05]
 [3.95382230e+07 2.91455050e+07 2.15381870e+07 2.02012490e+07
  1.30027000e+07]
 [9.32577842e+01 4.18960717e+01 1.26463121e+02 1.42970120e+02
  1.09009893e+02]]


In [None]:
# Alternate way to create the NumPy array from the dataframe
data.values # also gives a NumPy array from the dataframe 

array([[4.23967000e+05, 3.95382230e+07, 9.32577842e+01],
       [6.95662000e+05, 2.91455050e+07, 4.18960717e+01],
       [1.70312000e+05, 2.15381870e+07, 1.26463121e+02],
       [1.41297000e+05, 2.02012490e+07, 1.42970120e+02],
       [1.19280000e+05, 1.30027000e+07, 1.09009893e+02]])

Using the `iloc` indexer, we can index a dataframe as if it is a simple 2D NumPy array!

In [35]:
# say I want the 1st and third row of area and pop
print(data.iloc[[0,2],[0,1]]) # note use the implicit index only!

print(data.iloc[:2,:2]) # gives the same result as above 

print(data.iloc[:2,1]) # gives a series object 

              area       pop
California  423967  39538223
Florida     170312  21538187
              area       pop
California  423967  39538223
Texas       695662  29145505
California    39538223
Texas         29145505
Name: pop, dtype: int64


In [45]:
# when we use the loc indexer, we have to use the explicit indexing 
data.loc[:'Florida',['area','pop']]

Unnamed: 0,area,pop
California,423967,39538223
Texas,695662,29145505
Florida,170312,21538187


In [41]:
# we can use the loc[] indexer to mask
data.loc[data['density']<120, ['pop','density']]

Unnamed: 0,pop,density
California,39538223,93.257784
Texas,29145505,41.896072
Pennsylvania,13002700,109.009893


In [44]:
# we can modify any value 
data.iloc[0,1] = 39600000
print(data)

# back to the original 
data.loc['California','pop'] = 39538223
print(f'\n{data}')

                area       pop     density
California    423967  39600000   93.257784
Texas         695662  29145505   41.896072
Florida       170312  21538187  126.463121
New York      141297  20201249  142.970120
Pennsylvania  119280  13002700  109.009893

                area       pop     density
California    423967  39538223   93.257784
Texas         695662  29145505   41.896072
Florida       170312  21538187  126.463121
New York      141297  20201249  142.970120
Pennsylvania  119280  13002700  109.009893


---

# C. `DataFrame.filter()`

`filter()` is used to **select rows or columns by their labels** using:

* **Exact names (`items`)**
* **Substrings (`like`)**
* **Regular expressions (`regex`)**

By default, it works on **columns** (`axis=1`).
If you set `axis=0`, it filters rows by their index labels.


## üìå Syntax

```python
DataFrame.filter(items=None, like=None, regex=None, axis=None)
```

### Parameters:

* **`items`** ‚Üí List of exact labels to keep.
* **`like`** ‚Üí Substring search (keeps labels containing this text).
* **`regex`** ‚Üí Pattern search with regular expressions.
* **`axis`** ‚Üí `0` = rows (index), `1` = columns (default).



### 1. Filtering columns by exact names (items)

In [21]:
df = pd.DataFrame({
    "A": [1,2,3],
    "B": [4,5,6],
    "C_score": [7,8,9],
    "score_C": [10,11,12]
})

print(df.filter(items=["A", "C_score"]))

   A  C_score
0  1        7
1  2        8
2  3        9


### 2. Filtering columns by substring (like)

In [22]:
df.filter(like="C", axis=1)

Unnamed: 0,C_score,score_C
0,7,10
1,8,11
2,9,12


üëâ Kept only columns containing "C".

### 3. Filtering columns using regex

In [23]:
df.filter(regex="^C", axis=1)   # keep columns starting with "C"

Unnamed: 0,C_score
0,7
1,8
2,9


In [None]:
df.filter(regex=r"(^C|C$)", axis=1) # start or end with C

Unnamed: 0,C_score,score_C
0,7,10
1,8,11
2,9,12


### 4. Filtering rows by index (axis=0)

In [25]:
df.index = ["row1", "row2", "row3"]
df.filter(items=["row1", "row3"], axis=0)

Unnamed: 0,A,B,C_score,score_C
row1,1,4,7,10
row3,3,6,9,12


#### üöÄ When to use `filter()`?
- When you want label-based selection (not position-based like iloc).
- Very handy for selecting columns by name patterns (e.g., "score", "2024", "temp_").

## üìä Comparison Table: `.filter()` vs `.loc[]` vs `.iloc[]`

| Feature / Method         | `.filter()`                                                 | `.loc[]`                                                   | `.iloc[]`                                             |
| ------------------------ | ----------------------------------------------------------- | ---------------------------------------------------------- | ----------------------------------------------------- |
| **Selection type**       | Label-based only (names, substrings, regex)                 | Label-based (rows/cols by exact names, slices, conditions) | Position-based (row/col numbers)                      |
| **Axis default**         | Works on columns (`axis=1` by default)                      | Works on both rows and columns                             | Works on both rows and columns                        |
| **Supports regex?**      | ‚úÖ Yes (via `regex`)                                         | ‚ùå No                                                       | ‚ùå No                                                  |
| **Supports substrings?** | ‚úÖ Yes (via `like`)                                          | ‚ùå No                                                       | ‚ùå No                                                  |
| **Supports conditions?** | ‚ùå No                                                        | ‚úÖ Yes (e.g., `df.loc[df['A'] > 2]`)                        | ‚ùå No                                                  |
| **Use cases**            | Filtering by column/row **names/patterns**                  | Filtering by **labels** and applying **conditions**        | Pure **integer position** based selection             |
| **Example**              | `df.filter(like="score")` ‚Üí selects cols containing "score" | `df.loc[:, ["A", "C_score"]]` ‚Üí selects exact cols         | `df.iloc[:, [0, 2]]` ‚Üí selects first and third column |


---