# Session 15: Filtering, Selecting, and Aggregation

Now that you know how to create and explore DataFrames, it's time to learn how to slice, dice, and summarize your data. These are the core operations you'll use in almost every data analysis task.

## Learning Objectives

By the end of this session, you will be able to:
- Select columns by name and position
- Select rows using `loc` and `iloc`
- Filter data using boolean indexing and the `query` method
- Apply functions to columns with `apply` and `map`
- Sort DataFrames
- Aggregate data using built-in methods
- Group data and perform grouped aggregations

In [None]:
import pandas as pd

# Create a sample dataset we'll use throughout this session
# This represents sales data for a retail company

sales_data = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    'date': ['2024-01-15', '2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17',
             '2024-01-17', '2024-01-18', '2024-01-18', '2024-01-19', '2024-01-19'],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop',
                'Headphones', 'Mouse', 'Laptop', 'Monitor', 'Keyboard'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics',
                 'Accessories', 'Accessories', 'Electronics', 'Electronics', 'Accessories'],
    'quantity': [2, 5, 3, 1, 1, 4, 10, 3, 2, 5],
    'unit_price': [999.99, 29.99, 79.99, 299.99, 1099.99, 149.99, 24.99, 899.99, 349.99, 69.99],
    'region': ['North', 'South', 'North', 'East', 'West', 'South', 'North', 'East', 'West', 'South']
})

# Calculate total for each order
sales_data['total'] = sales_data['quantity'] * sales_data['unit_price']

sales_data

## 1. Selecting Columns

There are multiple ways to select columns from a DataFrame.

### Single Column Selection

In [None]:
# Using bracket notation (returns a Series)
products = sales_data['product']
type(products), products

In [None]:
# Using dot notation (only works for column names without spaces or special characters)
quantities = sales_data.quantity
quantities

### Multiple Column Selection

In [None]:
# Using a list of column names (returns a DataFrame)
subset = sales_data[['product', 'quantity', 'total']]
type(subset), subset

In [None]:
# Reordering columns
reordered = sales_data[['order_id', 'date', 'product', 'total']]
reordered.head()

## 2. Selecting Rows with loc and iloc

Pandas provides two main indexers for selecting rows:
- `loc`: Label-based selection (uses index labels)
- `iloc`: Integer-based selection (uses integer positions)

### iloc: Integer-based Indexing

In [None]:
# Select a single row by position
first_row = sales_data.iloc[0]
first_row

In [None]:
# Select multiple rows by position
first_three = sales_data.iloc[0:3]
first_three

In [None]:
# Select specific rows
specific_rows = sales_data.iloc[[0, 3, 7]]
specific_rows

In [None]:
# Select rows and columns by position
# iloc[rows, columns]
subset = sales_data.iloc[0:3, 0:4]
subset

In [None]:
# Last row
last_row = sales_data.iloc[-1]
last_row

### loc: Label-based Indexing

In [None]:
# Create a DataFrame with a meaningful index
employees = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'department': ['IT', 'HR', 'IT', 'Sales'],
    'salary': [75000, 55000, 80000, 65000]
}, index=['E001', 'E002', 'E003', 'E004'])

employees

In [None]:
# Select by label
employee = employees.loc['E002']
employee

In [None]:
# Select multiple rows by label
some_employees = employees.loc[['E001', 'E003']]
some_employees

In [None]:
# Select rows and specific columns
# loc[rows, columns]
result = employees.loc[['E001', 'E003'], ['name', 'salary']]
result

In [None]:
# loc with default integer index (works like iloc for row selection)
# Note: loc includes the end point in slices!
sales_data.loc[0:2, ['product', 'total']]  # Returns rows 0, 1, AND 2

### Key Difference: loc vs iloc with Slices

- `iloc[0:3]` returns rows 0, 1, 2 (excludes 3)
- `loc[0:3]` returns rows 0, 1, 2, 3 (includes 3)

In [None]:
# Demonstration
sales_data.iloc[0:3]['order_id'].tolist(), sales_data.loc[0:3]['order_id'].tolist()

## 3. Boolean Indexing (Filtering)

One of the most powerful features of Pandas is the ability to filter data using boolean conditions.

In [None]:
# Basic filter: orders with total > 1000
high_value = sales_data[sales_data['total'] > 1000]
high_value

In [None]:
# What's happening under the hood?
# The condition creates a boolean Series
mask = sales_data['total'] > 1000
mask

In [None]:
# Filter by category
electronics = sales_data[sales_data['category'] == 'Electronics']
electronics

In [None]:
# Multiple conditions: use & (and), | (or), ~ (not)
# IMPORTANT: Each condition must be in parentheses!

# Electronics with total > 500
filtered = sales_data[(sales_data['category'] == 'Electronics') & (sales_data['total'] > 500)]
filtered

In [None]:
# Using OR: Electronics OR total > 1000
filtered_or = sales_data[(sales_data['category'] == 'Electronics') | (sales_data['total'] > 1000)]
filtered_or

In [None]:
# Using NOT: NOT Electronics
not_electronics = sales_data[~(sales_data['category'] == 'Electronics')]
not_electronics

In [None]:
# Using isin() for multiple values
selected_regions = sales_data[sales_data['region'].isin(['North', 'South'])]
selected_regions

In [None]:
# String methods for filtering
# Products that contain 'o'
contains_o = sales_data[sales_data['product'].str.contains('o', case=False)]
contains_o

In [None]:
# Products that start with 'M'
starts_m = sales_data[sales_data['product'].str.startswith('M')]
starts_m

## 4. The query() Method

The `query()` method provides a more readable way to filter data using a string expression.

In [None]:
# Simple query
sales_data.query('total > 1000')

In [None]:
# Multiple conditions (use 'and', 'or' instead of &, |)
sales_data.query('category == "Electronics" and total > 500')

In [None]:
# Using 'in' for multiple values
sales_data.query('region in ["North", "South"]')

In [None]:
# Using variables in query with @
min_total = 500
max_total = 2000
sales_data.query('@min_total < total < @max_total')

## 5. Apply and Map

These methods let you apply functions to your data.

### apply() - Apply a function along an axis

In [None]:
# Apply a function to a Series (column)
def categorize_total(total):
    if total < 100:
        return 'Small'
    elif total < 500:
        return 'Medium'
    else:
        return 'Large'

sales_data['order_size'] = sales_data['total'].apply(categorize_total)
sales_data[['order_id', 'total', 'order_size']]

In [None]:
# Using lambda for simple transformations
sales_data['total_with_tax'] = sales_data['total'].apply(lambda x: x * 1.21)
sales_data[['order_id', 'total', 'total_with_tax']].head()

In [None]:
# Apply to entire DataFrame (row-wise with axis=1)
def summarize_order(row):
    return f"{row['quantity']}x {row['product']} = ${row['total']:.2f}"

sales_data['summary'] = sales_data.apply(summarize_order, axis=1)
sales_data[['order_id', 'summary']].head()

### map() - Map values using a dictionary or function

In [None]:
# Map using a dictionary
region_codes = {
    'North': 'N',
    'South': 'S',
    'East': 'E',
    'West': 'W'
}

sales_data['region_code'] = sales_data['region'].map(region_codes)
sales_data[['region', 'region_code']].head()

In [None]:
# Map using a function
sales_data['product_upper'] = sales_data['product'].map(str.upper)
sales_data[['product', 'product_upper']].head()

## 6. Sorting

Sort your data by values or index.

In [None]:
# Sort by a single column (ascending by default)
sorted_by_total = sales_data.sort_values('total')
sorted_by_total[['order_id', 'product', 'total']].head()

In [None]:
# Sort descending
sorted_desc = sales_data.sort_values('total', ascending=False)
sorted_desc[['order_id', 'product', 'total']].head()

In [None]:
# Sort by multiple columns
# First by category, then by total (descending)
multi_sort = sales_data.sort_values(['category', 'total'], ascending=[True, False])
multi_sort[['category', 'product', 'total']]

In [None]:
# Sort by index
shuffled = sales_data.sample(frac=1)  # Shuffle rows
shuffled.index.tolist(), shuffled.sort_index().index.tolist()

## 7. Aggregation Functions

Pandas provides many built-in aggregation functions to summarize your data.

In [None]:
# Basic aggregations on a column
(sales_data['total'].sum(),
 sales_data['total'].mean(),
 sales_data['total'].median(),
 sales_data['total'].min(),
 sales_data['total'].max(),
 sales_data['total'].count())

In [None]:
# Standard deviation and variance
sales_data['total'].std(), sales_data['total'].var()

In [None]:
# Quantiles/Percentiles
sales_data['total'].quantile(0.25), sales_data['total'].quantile(0.75)

In [None]:
# Value counts - frequency of unique values
sales_data['product'].value_counts()

In [None]:
# Unique values
sales_data['product'].unique(), sales_data['product'].nunique()

## 8. GroupBy: Split-Apply-Combine

GroupBy is one of the most powerful features in Pandas. It follows a "split-apply-combine" pattern:
1. **Split** the data into groups
2. **Apply** a function to each group
3. **Combine** the results

In [None]:
# Basic groupby with single aggregation
# Total sales by category
sales_data.groupby('category')['total'].sum()

In [None]:
# Average sale by region
sales_data.groupby('region')['total'].mean()

In [None]:
# Count by category
sales_data.groupby('category')['order_id'].count()

In [None]:
# Multiple aggregations
sales_data.groupby('category')['total'].agg(['sum', 'mean', 'count'])

In [None]:
# Group by multiple columns
sales_data.groupby(['category', 'region'])['total'].sum()

In [None]:
# Reset index to get a regular DataFrame
grouped = sales_data.groupby(['category', 'region'])['total'].sum().reset_index()
grouped

## 9. Multiple Aggregations with agg()

The `agg()` method gives you maximum flexibility for aggregations.

In [None]:
# Different aggregations for different columns
result = sales_data.groupby('category').agg({
    'total': ['sum', 'mean'],
    'quantity': 'sum',
    'order_id': 'count'
})
result

In [None]:
# Flatten multi-level column names
result.columns = ['_'.join(col).strip() for col in result.columns.values]
result

In [None]:
# Named aggregations (cleaner column names)
result = sales_data.groupby('category').agg(
    total_revenue=('total', 'sum'),
    avg_order_value=('total', 'mean'),
    total_units=('quantity', 'sum'),
    num_orders=('order_id', 'count')
)
result

In [None]:
# Custom aggregation functions
def range_func(x):
    return x.max() - x.min()

result = sales_data.groupby('category').agg(
    total_revenue=('total', 'sum'),
    price_range=('total', range_func),
    avg_quantity=('quantity', 'mean')
)
result

## 10. Practical Examples

Let's apply what we've learned to answer some business questions.

In [None]:
# Let's reset our DataFrame without the extra columns we added
sales = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010],
    'date': ['2024-01-15', '2024-01-15', '2024-01-16', '2024-01-16', '2024-01-17',
             '2024-01-17', '2024-01-18', '2024-01-18', '2024-01-19', '2024-01-19'],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Laptop',
                'Headphones', 'Mouse', 'Laptop', 'Monitor', 'Keyboard'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Electronics',
                 'Accessories', 'Accessories', 'Electronics', 'Electronics', 'Accessories'],
    'quantity': [2, 5, 3, 1, 1, 4, 10, 3, 2, 5],
    'unit_price': [999.99, 29.99, 79.99, 299.99, 1099.99, 149.99, 24.99, 899.99, 349.99, 69.99],
    'region': ['North', 'South', 'North', 'East', 'West', 'South', 'North', 'East', 'West', 'South']
})
sales['total'] = sales['quantity'] * sales['unit_price']
sales

In [None]:
# Q1: What are the top 3 products by total revenue?
top_products = sales.groupby('product')['total'].sum().sort_values(ascending=False).head(3)
top_products

In [None]:
# Q2: Which region has the highest average order value?
region_avg = sales.groupby('region')['total'].mean().sort_values(ascending=False)
region_avg, region_avg.idxmax(), region_avg.max()

In [None]:
# Q3: How many orders are above $500 in each category?
high_value_counts = sales[sales['total'] > 500].groupby('category')['order_id'].count()
high_value_counts

In [None]:
# Q4: What's the revenue breakdown by category and region?
pivot = sales.groupby(['category', 'region'])['total'].sum().unstack(fill_value=0)
pivot

In [None]:
# Q5: Find orders where quantity is above average
avg_quantity = sales['quantity'].mean()
above_avg = sales[sales['quantity'] > avg_quantity][['order_id', 'product', 'quantity']]
avg_quantity, above_avg

## Summary

In this session, we covered:

1. **Selecting Columns**: Single (`df['col']`) and multiple (`df[['col1', 'col2']]`)
2. **Selecting Rows**: `iloc` (position-based) and `loc` (label-based)
3. **Boolean Indexing**: Filter with conditions using `&`, `|`, `~`
4. **query()**: Readable string-based filtering
5. **apply() and map()**: Apply functions to transform data
6. **Sorting**: `sort_values()` and `sort_index()`
7. **Aggregation**: `sum()`, `mean()`, `count()`, `min()`, `max()`, etc.
8. **groupby()**: Split-apply-combine for grouped analysis
9. **agg()**: Multiple and named aggregations

### Key Points to Remember

- `loc` uses labels, `iloc` uses positions
- Boolean conditions must be in parentheses when combined
- `groupby` is essential for categorical analysis
- Named aggregations make results more readable
- Always reset index when you need a regular DataFrame

### Next Session

Practice time! We'll apply these skills to solve business problems.