### Indexing, Selection, and Filtering

In [1]:
import numpy as np 
import pandas as pd 
from pandas import Series, DataFrame 

- series indexing works analogously to np array indexing, except you can use the Series's index values instead of only integers

In [2]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [3]:
obj['b']

np.float64(1.0)

In [4]:
obj[1]

  obj[1]


np.float64(1.0)

In [5]:
obj[2:4] # takes as a range

c    2.0
d    3.0
dtype: float64

In [6]:
 obj[["b", "a", "d"]] # individually 

b    1.0
a    0.0
d    3.0
dtype: float64

In [7]:
obj[[1,3]] # , take as individual positions for integers

  obj[[1,3]] # , take as individual positions for integers


b    1.0
d    3.0
dtype: float64

In [8]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

In [9]:
# while you can select data by label this way, the preferred way to select index values is with the special loc operator

# reason - bcoz of its clarity, flexibility, and explicit behavior
#        - diff treatment of integers when indexing with []


### Data Selection - regular, `loc[]`, `iloc[]`          #imp

-   when using [] to index a Series or DataFrame without explicitly specifying .loc or .iloc, it behaves depends on the **type of index in your data** 
    - if your index contains integers: it will treat the integers in [] as labels ( !positions )
    - if your index is the default (0,1,2,...): it will treat the integers in [] as positions


#### Examples:
Case 1: Default Index

In [10]:
# when we don't provide a custom index, it automatically assigns the default index (0,1,2,...)

s = pd.Series([12, 34, 56])

# regular indexing
print(s[0])     # 0 is treated as the position

# iloc (explicit position-based indexing)
print(s.iloc[0])

12
12


Case 2: Custom Integer Index

In [13]:
# If you explicitly provide an index with integers, [] treats the numbers as labels, not positions

# Custom index (1, 2, 3)
s = pd.Series([10, 20, 30], index=[1, 2, 3])

# Regular indexing
print(s[1])  # (1 is treated as a label, not a position)

# iloc (position-based indexing)
print(s.iloc[1])  # (second element by position)

10
20


Case 3: String Index

In [12]:
# If the index is not an integer (e.g., strings), there’s no confusion because labels and positions don’t overlap

# String index
s = pd.Series([10, 20, 30], index=["a", "b", "c"])

# Regular indexing
print(s["a"])  #  (Access by label)

# iloc (position-based indexing)
print(s.iloc[0])  #  (First element by position)

10
10


### **Comparison Table**

| **Operation**              | **Custom Integer Index** | **Default Index** |
|-----------------------------|--------------------------|-------------------|
| `[]`                       | Uses labels (e.g., 1, 2) | Uses positions    |
| `.loc[]`                   | Always uses labels       | Always uses labels|
| `.iloc[]`                  | Always uses positions    | Always uses positions|

In [14]:
# Final Example

df = pd.DataFrame({'A': [10,20,30]}, index=[1,2,3])

# Regular indexing
print(df['A'][1])  # Label-based: (index label = 1)

# Position-based indexing
print(df['A'].iloc[1])  # Position-based: (second row)

# Explicit label-based indexing
print(df['A'].loc[1])  # Label-based

10
20
10


You can also slice with labels, but it works differently from normal Python slicing in that the endpoint is inclusive

In [15]:
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])

obj2.loc['b':'c']

b    2
c    3
dtype: int64

In [17]:
# assigning values using these methods modifies the corresponding section of the Series:

obj2.loc['b':'c'] = 5
obj2

a    1
b    5
c    5
dtype: int64

NOTE:
- always use [] with .loc and .iloc
- don’t treat them like functions with ()—they’re indexing tools, not function calls
    - [] are special syntax in python for indexing and slicing
    - they allow to specify ranges or combine row/column indexing

Indexing into a DataFrame !!!

In [18]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)), 
                    index=["Ohio", "Colorado", "Utah", "New York"], 
                    columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [29]:
data['two']

# imp! passing a single element or a list to the [] operator selects columns

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [21]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [22]:
# indexing like this has a few special cases
# 1st is slicing or selecting data with a Boolean Array

data[:2]    # row selection syntax

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


Indexing with a Boolean DataFrame, such as one produced by a scalar comparison

In [30]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [32]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [33]:
data['three'] > 5 # will return boolean result

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [35]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### **Selection on DataFrame with loc and iloc**

#### Key Difference Between `.loc` and `.iloc`

| `.loc` (Label-based)           | `.iloc` (Position-based)         |
|--------------------------------|-----------------------------------|
| Access data using **row and column labels**. | Access data using **row and column positions** (like Python lists). |
| Works with **row/column names**. | Works with **integer-based indices**. |
| Includes the endpoint in slices. | Excludes the endpoint in slices. |

---


#### Summary of Features

| **Feature**                     | `.loc` (Labels)                | `.iloc` (Positions)            |
|----------------------------------|--------------------------------|---------------------------------|
| Access single value              | `df.loc[row_label, col_label]` | `df.iloc[row_pos, col_pos]`    |
| Select specific rows/columns     | List of labels: `df.loc[[...]]`| List of indices: `df.iloc[[...]]`|
| Slice rows/columns               | `df.loc[start:end]` (inclusive)| `df.iloc[start:end]` (exclusive)|
| Select all rows/columns          | Use `:`: `df.loc[:, 'col']`    | Use `:`: `df.iloc[:, 1]`       |

---

#### Key Takeaways
1. **Use `.loc`** when working with **labels** (row and column names).
2. **Use `.iloc`** when working with **positions** (integer indices).
3. Both allow you to combine row and column indexing using `[]` notation.
4. Remember:
   - `.loc` slices include the **end index**.
   - `.iloc` slices **exclude** the end index.

In [36]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selection Using .loc & .iloc

1. Select a Single Value

In [37]:
# using .loc (label-based)

print(data.loc['Utah', 'three']) # value at row 'Utah' and column 'three

10


In [39]:
# using .iloc (position-based)

print(data.iloc[2, 3]) # value at 2nd row (index 2) and 2nd column (index 3)

11


2. Select Entire Row

In [40]:
# using .loc

print(data.loc['Colorado']) # select all columns for row 'Colorado' 

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64


In [41]:
# using .iloc

print(data.iloc[1])     # select all columns for the 1st row (index 1)

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64


3. Select Entire Column

In [42]:
# using .loc

print(data.loc[:, 'two']) # select all rows for column 'two' 

Ohio         0
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64


In [44]:
# using .iloc

print(data.iloc[:, 1])   # select all rows for the 3rd column (index 1)

Ohio         0
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64


4. Select Multiple Rows and Columns

In [45]:
# using .loc

print(data.loc[["Ohio", "Utah"], ['one', 'three']])  # selects rows 'Ohio' and 'Utah', and columns 'one' and 'three'

      one  three
Ohio    0      0
Utah    8     10


In [46]:
# using .iloc

print(data.iloc[[0, 2], [0, 2]])  # select 1st and 3rd rows, and 1st and 3rd columns

      one  three
Ohio    0      0
Utah    8     10


5. Slicing Rows and Columns

In [47]:
# using .loc

print(data.loc['Ohio':'Utah', 'one':'three'])   # slice rows 'Ohio' to 'Utah', and columns 'one' to 'three'

          one  two  three
Ohio        0    0      0
Colorado    0    5      6
Utah        8    9     10


In [48]:
# using .iloc

print(data.iloc[0:3, 0:3]) # slice 0 to 2 (exclusive) rows and 0 to 3 (exclusive) columns

          one  two  three
Ohio        0    0      0
Colorado    0    5      6
Utah        8    9     10


6. Boolean Indexing with .loc

In [49]:
print(data.loc[data['three'] > 6]) # select rows where columns 'three' has values greater than 6

          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15


### Notes:
1. `.loc` works with **row and column labels** (names).
2. `.iloc` works with **row and column integer positions** (index numbers).
3. `.loc` slices **include the endpoint**.
4. `.iloc` slices **exclude the endpoint**.

#### .at & .iat
1. .at
    - works with rows and cols labels (name based indexing)
    - best used for accessing or updating a single value
    - equivalent to data.loc[row_label, column_label], but faster
2. .iat
    - works with row and cols integer positions (position based indexing)
    - best used for accessing or updating a single value
    - equivalent to data.iloc[row_position, column_position], but faster

In [51]:
# Accessing a single value 

# using .at (label based)
print(data.at['Ohio', 'three'])

# using .iat (position based)
print(data.iat[0, 2])

0
0


In [53]:
# Modifying a single value

# using .at
data.at['Colorado', 'two'] = 99
print(data)

# using .iat
data.iat[3, 1] = 77
print(data)

          one  two  three  four
Ohio        0    0      0     0
Colorado    0   99      6     7
Utah        8    9     10    11
New York   12   13     14    15
          one  two  three  four
Ohio        0    0      0     0
Colorado    0   99      6     7
Utah        8    9     10    11
New York   12   77     14    15


#### 4. Comparison with `.loc` and `.iloc`

| **Method**    | **Access Type**      | **Example**                | **Performance** |
|---------------|----------------------|----------------------------|-----------------|
| `.loc`        | Label-based          | `data.loc['Ohio', 'three']`| Slower          |
| `.iloc`       | Position-based       | `data.iloc[0, 2]`          | Slower          |
| `.at`         | Label-based (single) | `data.at['Ohio', 'three']` | Faster          |
| `.iat`        | Position-based (single)| `data.iat[0, 2]`         | Faster          |

---

### Key Points to Remember
- **`.at`** and **`.iat`** are used for single-value access or updates only.
- For accessing slices or multiple values, continue to use `.loc` and `.iloc`.
- Both `.at` and `.iat` improve performance for single-cell operations in large DataFrames.


# Summary Table

| **Indexing Method**            | **Syntax**                                 | **Purpose**                                                                                       | **Example**                                |
|--------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------|--------------------------------------------|
| Access a Column            | `data['column_name']`                      | Selects a single column from the DataFrame.                                                       | `data['one']`                              |
| Access Multiple Columns    | `data[['col1', 'col2']]`                   | Selects multiple columns as a new DataFrame.                                                      | `data[['one', 'three']]`                   |
| Access a Row by Label      | `data.loc['row_label']`                    | Selects a row based on its label.                                                                 | `data.loc['Ohio']`                         |
| Access a Row by Position   | `data.iloc[row_index]`                     | Selects a row based on its integer position.                                                      | `data.iloc[0]`                             |
| Access a Single Value      | `data.loc['row', 'column']`                | Selects a single value based on row and column labels.                                            | `data.loc['Utah', 'three']`                |
|                                | `data.iloc[row, col]`                      | Selects a single value based on integer positions of row and column.                              | `data.iloc[2, 2]`                          |
| Access a Single Value Faster| `data.at['row', 'column']`                | Optimized for accessing a single value based on row and column labels.                            | `data.at['Utah', 'three']`                 |
|                                | `data.iat[row, col]`                       | Optimized for accessing a single value based on row and column positions.                         | `data.iat[2, 2]`                           |
| Slice Rows by Labels       | `data.loc['start':'end']`                  | Selects a range of rows by their labels (inclusive of the endpoint).                              | `data.loc['Ohio':'Utah']`                  |
| Slice Rows by Position     | `data.iloc[start:end]`                     | Selects a range of rows by their integer positions (exclusive of the endpoint).                   | `data.iloc[0:2]`                           |
| Slice Columns              | `data.loc[:, 'start':'end']`               | Selects a range of columns by their labels.                                                      | `data.loc[:, 'one':'three']`               |
|                                | `data.iloc[:, start:end]`                  | Selects a range of columns by their positions.                                                    | `data.iloc[:, 0:3]`                        |
| Boolean Indexing           | `data.loc[data['column'] > value]`         | Filters rows based on a condition applied to a column.                                            | `data.loc[data['three'] > 6]`              |
| Select Multiple Rows/Cols  | `data.loc[['row1', 'row2'], ['col1', 'col2']]` | Selects specific rows and columns based on their labels.                                           | `data.loc[['Ohio', 'Utah'], ['one', 'four']]` |
|                                | `data.iloc[[row1, row2], [col1, col2]]`    | Selects specific rows and columns based on their positions.                                       | `data.iloc[[0, 2], [0, 3]]`                |
| Entire DataFrame Slice     | `data.loc[:, :]`                           | Selects all rows and all columns (entire DataFrame).                                              | `data.loc[:, :]`                           |
| Using Boolean Conditions   | `data[data['column'] > value]`             | Filters rows directly using boolean conditions on a column (without `.loc`).                      | `data[data['three'] > 6]`                  |
| Set a Value               | `data.loc['row', 'col'] = value`           | Assigns a value to a specific cell based on row and column labels.                                | `data.loc['Utah', 'four'] = 20`            |
|                                | `data.iloc[row, col] = value`              | Assigns a value to a specific cell based on row and column positions.                             | `data.iloc[2, 3] = 20`                     |
| Set a Single Value Faster  | `data.at['row', 'column'] = value`         | Assigns a value to a single cell (optimized version of `.loc`).                                   | `data.at['Utah', 'four'] = 20`             |
|                                | `data.iat[row, col] = value`               | Assigns a value to a single cell (optimized version of `.iloc`).                                  | `data.iat[2, 3] = 20`                      |
| Reindex Rows/Columns      | `data.reindex(new_labels)`                 | Reorders or adds new rows/columns to match the new index.                                         | `data.reindex(['Ohio', 'Utah', 'Texas'])`  |
|                                | `data.reindex(columns=new_labels)`         | Reorders or adds new columns to match the new column labels.                                      | `data.reindex(columns=['one', 'five'])`    |
