# Level 4: Data Selection & Indexing

Efficiently selecting and filtering data is a critical skill in data analysis. Pandas provides a powerful set of tools for this, which we'll explore in this section.

In [1]:
import pandas as pd
import numpy as np

# Let's create a more detailed dataset for these examples
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace'],
    'Age': [25, 30, 35, 40, 22, 45, 28],
    'City': ['NY', 'LA', 'Chicago', 'NY', 'LA', 'Chicago', 'NY'],
    'Score': [88, 92, 78, 85, 95, 62, 75]
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,Score
0,Alice,25,NY,88
1,Bob,30,LA,92
2,Charlie,35,Chicago,78
3,David,40,NY,85
4,Eva,22,LA,95
5,Frank,45,Chicago,62
6,Grace,28,NY,75


## 4.1 Selecting Data

### Column Selection

In [2]:
# Selecting a single column (returns a Series)
df['Name']

0      Alice
1        Bob
2    Charlie
3      David
4        Eva
5      Frank
6      Grace
Name: Name, dtype: object

In [3]:
# Selecting multiple columns (returns a DataFrame)
df[['Name', 'City']]

Unnamed: 0,Name,City
0,Alice,NY
1,Bob,LA
2,Charlie,Chicago
3,David,NY
4,Eva,LA
5,Frank,Chicago
6,Grace,NY


### Row Selection: `.loc` and `.iloc`

**`.loc[]` (Label-based selection)**
- Selects data based on the index labels.

In [4]:
# Selecting a single row by its index label
df.loc[0]

Name     Alice
Age         25
City        NY
Score       88
Name: 0, dtype: object

In [5]:
# Selecting multiple rows by labels
df.loc[[0, 2, 4]]

Unnamed: 0,Name,Age,City,Score
0,Alice,25,NY,88
2,Charlie,35,Chicago,78
4,Eva,22,LA,95


In [6]:
# Selecting a range of rows
df.loc[1:3] # Note: inclusive of the end label

Unnamed: 0,Name,Age,City,Score
1,Bob,30,LA,92
2,Charlie,35,Chicago,78
3,David,40,NY,85


In [7]:
# Selecting rows and columns simultaneously
df.loc[0, 'Name'] # Row 0, Column 'Name'

'Alice'

In [8]:
df.loc[1:3, ['Name', 'Score']] # Rows 1 to 3, Columns 'Name' and 'Score'

Unnamed: 0,Name,Score
1,Bob,92
2,Charlie,78
3,David,85


**`.iloc[]` (Position-based selection)**
- Selects data based on integer positions (from 0 to length-1).

In [9]:
# Selecting the first row (at position 0)
df.iloc[0]

Name     Alice
Age         25
City        NY
Score       88
Name: 0, dtype: object

In [10]:
# Selecting rows at positions 0, 2, 4
df.iloc[[0, 2, 4]]

Unnamed: 0,Name,Age,City,Score
0,Alice,25,NY,88
2,Charlie,35,Chicago,78
4,Eva,22,LA,95


In [11]:
# Selecting a range of rows
df.iloc[1:4] # Note: exclusive of the end position (like standard Python slicing)

Unnamed: 0,Name,Age,City,Score
1,Bob,30,LA,92
2,Charlie,35,Chicago,78
3,David,40,NY,85


In [12]:
# Selecting rows and columns by position
df.iloc[0, 0] # Row at position 0, Column at position 0

'Alice'

In [13]:
df.iloc[1:4, [0, 3]] # Rows 1 to 3, Columns 0 and 3

Unnamed: 0,Name,Score
1,Bob,92
2,Charlie,78
3,David,85


### Boolean Indexing
This is a powerful way to filter data based on conditions.

In [14]:
# Create a boolean Series
is_age_over_30 = df['Age'] > 30
print(is_age_over_30)

0    False
1    False
2     True
3     True
4    False
5     True
6    False
Name: Age, dtype: bool


In [15]:
# Use the boolean Series to filter the DataFrame
df[is_age_over_30]

Unnamed: 0,Name,Age,City,Score
2,Charlie,35,Chicago,78
3,David,40,NY,85
5,Frank,45,Chicago,62


In [16]:
# This is more commonly written in one line
df[df['Age'] > 30]

Unnamed: 0,Name,Age,City,Score
2,Charlie,35,Chicago,78
3,David,40,NY,85
5,Frank,45,Chicago,62


## 4.2 Setting & Resetting Index

### `.set_index()`
You can set any column as the DataFrame's index.

In [17]:
df_named_index = df.set_index('Name')
df_named_index

Unnamed: 0_level_0,Age,City,Score
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,25,NY,88
Bob,30,LA,92
Charlie,35,Chicago,78
David,40,NY,85
Eva,22,LA,95
Frank,45,Chicago,62
Grace,28,NY,75


In [18]:
# Now you can use .loc with names
df_named_index.loc['Bob']

Age      30
City     LA
Score    92
Name: Bob, dtype: object

### `.reset_index()`
This moves the index back to a regular column and restores the default integer index.

In [19]:
df_named_index.reset_index()

Unnamed: 0,Name,Age,City,Score
0,Alice,25,NY,88
1,Bob,30,LA,92
2,Charlie,35,Chicago,78
3,David,40,NY,85
4,Eva,22,LA,95
5,Frank,45,Chicago,62
6,Grace,28,NY,75


### MultiIndex (Basics)
You can create a hierarchical index by setting multiple columns as the index.

In [20]:
df_multi_index = df.set_index(['City', 'Name'])
df_multi_index

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Score
City,Name,Unnamed: 2_level_1,Unnamed: 3_level_1
NY,Alice,25,88
LA,Bob,30,92
Chicago,Charlie,35,78
NY,David,40,85
LA,Eva,22,95
Chicago,Frank,45,62
NY,Grace,28,75


## 4.3 Filtering Data (Advanced)

### Multiple Conditions
Use `&` for AND, `|` for OR. Remember to wrap each condition in parentheses `()`.

In [21]:
# Age > 30 AND City is 'NY'
df[(df['Age'] > 30) & (df['City'] == 'NY')]

Unnamed: 0,Name,Age,City,Score
3,David,40,NY,85


### `.isin()`
Filter rows where a column's value is in a list of allowed values.

In [22]:
allowed_cities = ['NY', 'LA']
df[df['City'].isin(allowed_cities)]

Unnamed: 0,Name,Age,City,Score
0,Alice,25,NY,88
1,Bob,30,LA,92
3,David,40,NY,85
4,Eva,22,LA,95
6,Grace,28,NY,75


### `.between()`
Filter for values within a range (inclusive).

In [23]:
df[df['Score'].between(80, 90)]

Unnamed: 0,Name,Age,City,Score
0,Alice,25,NY,88
3,David,40,NY,85


### `.str.contains()`
Filter string columns based on a substring.

In [24]:
# Find names that contain the letter 'a'
df[df['Name'].str.contains('a')]

Unnamed: 0,Name,Age,City,Score
2,Charlie,35,Chicago,78
3,David,40,NY,85
4,Eva,22,LA,95
5,Frank,45,Chicago,62
6,Grace,28,NY,75


### `.query()`
A convenient way to filter using a string expression. This can be more readable for complex conditions.

In [25]:
df.query("Age > 30 and City == 'NY'")

Unnamed: 0,Name,Age,City,Score
3,David,40,NY,85


In [26]:
# You can also use variables in a query string by prefixing them with '@'
min_score = 80
df.query("Score > @min_score")

Unnamed: 0,Name,Age,City,Score
0,Alice,25,NY,88
1,Bob,30,LA,92
3,David,40,NY,85
4,Eva,22,LA,95
