[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mihiarc/pyfia/blob/main/notebooks/03_filtering_grouping.ipynb)

---

In [None]:
# Google Colab Setup - Run this cell first!
import sys
if 'google.colab' in sys.modules:
    print("Running in Google Colab - installing pyFIA...")
    !pip install -q pyfia matplotlib rich
    
    # Download helpers.py for Colab
    import urllib.request
    helpers_url = "https://raw.githubusercontent.com/mihiarc/pyfia/main/notebooks/helpers.py"
    urllib.request.urlretrieve(helpers_url, "helpers.py")
    print("Setup complete!")
else:
    print("Running locally - no additional setup needed.")

# Domain Filtering and Grouping

This notebook covers advanced filtering techniques for custom forest analyses using pyFIA's domain expressions.

## What You'll Learn

1. Land type shortcuts (`forest`, `timber`, `all`)
2. Tree type shortcuts (`live`, `dead`, `gs`)
3. `tree_domain` expressions (species, diameter filtering)
4. `area_domain` expressions (ownership, forest type)
5. `plot_domain` expressions (county, geography)
6. Combining multiple filters
7. Real-world analysis patterns

**Prerequisites**: Complete Notebooks 1-2

**Estimated time**: 40 minutes

---

## Setup

In [None]:
# Core imports
from pyfia import (
    FIA, 
    area, 
    volume, 
    biomass, 
    tpa,
    join_species_names,
    join_forest_type_names,
)
import polars as pl
import matplotlib.pyplot as plt

# Notebook helpers
from helpers import ensure_ri_data, display_estimate, plot_by_category

# Ensure data is available
db_path = ensure_ri_data()
print("Ready to begin!")

---

## 1. Understanding Filtering in FIA

FIA data has multiple levels of organization:

```
PLOT (sample location)
  └── CONDITION (stand characteristics)
        └── TREE (individual trees)
```

pyFIA provides three domain parameters to filter at each level:

| Parameter | Filters | Example Use Case |
|-----------|---------|------------------|
| `plot_domain` | PLOT table | County, geographic area |
| `area_domain` | COND table | Ownership, forest type, age |
| `tree_domain` | TREE table | Species, diameter, height |

Plus shortcuts:
- `land_type` - Quick filter for forest/timberland
- `tree_type` - Quick filter for live/dead trees

---

## 2. Land Type Shortcuts

The `land_type` parameter filters by land classification:

| Value | Description | FIA Definition |
|-------|-------------|----------------|
| `"forest"` | All forest land | COND_STATUS_CD = 1 |
| `"timber"` | Timberland only | Forest capable of 20+ cuft/ac/yr, not reserved |
| `"all"` | All sampled land | No filter |

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Compare land types
    all_land = area(db, land_type="all")
    forest = area(db, land_type="forest")
    timber = area(db, land_type="timber")

print("Rhode Island Land Area Comparison:")
print(f"  All sampled land: {all_land['AREA_TOTAL'][0]:>12,.0f} acres")
print(f"  Forest land:      {forest['AREA_TOTAL'][0]:>12,.0f} acres")
print(f"  Timberland:       {timber['AREA_TOTAL'][0]:>12,.0f} acres")
print(f"\n  Forest %: {forest['AREA_TOTAL'][0]/all_land['AREA_TOTAL'][0]*100:.1f}%")
print(f"  Timber % of Forest: {timber['AREA_TOTAL'][0]/forest['AREA_TOTAL'][0]*100:.1f}%")

---

## 3. Tree Type Shortcuts

The `tree_type` parameter filters by tree status:

| Value | Description | FIA Definition |
|-------|-------------|----------------|
| `"live"` | Living trees (default) | STATUSCD = 1 |
| `"dead"` | Standing dead trees | STATUSCD = 2 |
| `"gs"` | Growing stock | Live trees meeting quality standards |
| `"all"` | All trees | No status filter |

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    live_vol = volume(db, tree_type="live")
    dead_vol = volume(db, tree_type="dead")
    gs_vol = volume(db, tree_type="gs")

print("Volume by Tree Status:")
print(f"  Live trees:     {live_vol['VOLCFNET_TOTAL'][0]/1e6:>10,.1f} million cuft")
print(f"  Dead trees:     {dead_vol['VOLCFNET_TOTAL'][0]/1e6:>10,.1f} million cuft")
print(f"  Growing stock:  {gs_vol['VOLCFNET_TOTAL'][0]/1e6:>10,.1f} million cuft")
print(f"\n  Dead/Live ratio: {dead_vol['VOLCFNET_TOTAL'][0]/live_vol['VOLCFNET_TOTAL'][0]*100:.1f}%")

---

## 4. Tree Domain Expressions

Use `tree_domain` to filter individual trees based on TREE table columns.

### Common Tree Columns

| Column | Description | Example Values |
|--------|-------------|----------------|
| `SPCD` | Species code | 316 (red maple), 833 (red oak) |
| `DIA` | Diameter at breast height | 5.0 to 60+ inches |
| `HT` | Total height | feet |
| `TREECLCD` | Tree class | 2=growing stock, 3=rough cull |
| `CR` | Crown ratio | 0-100% |

### Filtering by Diameter

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # All trees (5+ inches, FIA minimum)
    all_trees = tpa(db)
    
    # Sawtimber-sized trees (11+ inches)
    sawtimber = tpa(db, tree_domain="DIA >= 11.0")
    
    # Large trees (20+ inches)
    large = tpa(db, tree_domain="DIA >= 20.0")

print("Trees by Size:")
print(f"  All trees (5"+):      {all_trees['TPA_ACRE'][0]:>8,.0f} per acre")
print(f"  Sawtimber (11"+):     {sawtimber['TPA_ACRE'][0]:>8,.0f} per acre")
print(f"  Large trees (20"+):   {large['TPA_ACRE'][0]:>8,.0f} per acre")

### Filtering by Species

You can filter to specific species using their SPCD codes.

In [None]:
# First, find the top species codes
with FIA(db_path) as db:
    db.clip_most_recent()
    species_vol = volume(db, by_species=True)
    
species_named = join_species_names(species_vol, "SPCD")
top_5 = species_named.sort("VOLCFNET_TOTAL", descending=True).head(5)
print("Top 5 species by volume:")
for row in top_5.iter_rows(named=True):
    print(f"  SPCD {row['SPCD']}: {row['SPCD_NAME']}")

In [None]:
# Now filter to specific species
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Red maple only (SPCD = 316)
    red_maple = volume(db, tree_domain="SPCD == 316")
    
    # All oaks (SPCD in 800s range)
    oaks = volume(db, tree_domain="SPCD >= 800 AND SPCD < 900")
    
    # Multiple specific species
    select_species = volume(db, tree_domain="SPCD IN (316, 833, 318)")

print("Volume for Selected Species:")
print(f"  Red maple:       {red_maple['VOLCFNET_TOTAL'][0]/1e6:,.1f} million cuft")
print(f"  All oaks:        {oaks['VOLCFNET_TOTAL'][0]/1e6:,.1f} million cuft")
print(f"  Select species:  {select_species['VOLCFNET_TOTAL'][0]/1e6:,.1f} million cuft")

### Combining Tree Filters

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Large oaks (diameter 16+ inches, oak species)
    large_oaks = volume(db, tree_domain="DIA >= 16.0 AND SPCD >= 800 AND SPCD < 900")
    
    # Small diameter maples (pole-sized)
    pole_maples = tpa(db, tree_domain="DIA BETWEEN 5.0 AND 10.9 AND SPCD IN (316, 318)")

print(f"Large oaks (16"+): {large_oaks['VOLCFNET_TOTAL'][0]/1e6:,.2f} million cuft")
print(f"Pole-sized maples: {pole_maples['TPA_ACRE'][0]:,.0f} trees per acre")

---

## 5. Area Domain Expressions

Use `area_domain` to filter based on stand conditions (COND table).

### Common Condition Columns

| Column | Description | Example Values |
|--------|-------------|----------------|
| `FORTYPCD` | Forest type code | 503 (oak/hickory), 801 (white pine) |
| `OWNGRPCD` | Ownership group | 10, 20, 30, 40 |
| `STDAGE` | Stand age | years |
| `STDSZCD` | Stand size class | 1, 2, 3, 5 |
| `SITECLCD` | Site productivity class | 1-7 |
| `SLOPE` | Ground slope | percent |

### Filtering by Ownership

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Private lands only
    private = area(db, area_domain="OWNGRPCD == 40")
    
    # Public lands (Federal, State, Local)
    public = area(db, area_domain="OWNGRPCD IN (10, 20, 30)")

print("Forest Area by Ownership:")
print(f"  Private:  {private['AREA_TOTAL'][0]:>12,.0f} acres")
print(f"  Public:   {public['AREA_TOTAL'][0]:>12,.0f} acres")

### Filtering by Stand Age

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Young forests (< 40 years)
    young = area(db, area_domain="STDAGE < 40 AND STDAGE > 0")
    
    # Mature forests (40-80 years)
    mature = area(db, area_domain="STDAGE >= 40 AND STDAGE < 80")
    
    # Old forests (80+ years)
    old = area(db, area_domain="STDAGE >= 80")

print("Forest Area by Age Class:")
print(f"  Young (<40 yrs):   {young['AREA_TOTAL'][0]:>12,.0f} acres")
print(f"  Mature (40-80):    {mature['AREA_TOTAL'][0]:>12,.0f} acres")
print(f"  Old (80+):         {old['AREA_TOTAL'][0]:>12,.0f} acres")

### Filtering by Forest Type

In [None]:
# First identify major forest type codes
with FIA(db_path) as db:
    db.clip_most_recent()
    ft_area = area(db, grp_by="FORTYPCD")

ft_named = join_forest_type_names(ft_area, "FORTYPCD")
print("Top forest types:")
for row in ft_named.sort("AREA_TOTAL", descending=True).head(5).iter_rows(named=True):
    print(f"  FORTYPCD {row['FORTYPCD']}: {row['FORTYPCD_NAME']}")

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Oak types (500-519 range includes oak-hickory types)
    oak_types = area(db, area_domain="FORTYPCD >= 500 AND FORTYPCD < 520")
    
    # Pine types (100-199 range)
    pine_types = area(db, area_domain="FORTYPCD >= 100 AND FORTYPCD < 200")
    
    # Hardwood total (500-999)
    hardwood = area(db, area_domain="FORTYPCD >= 500 AND FORTYPCD < 1000")

print("Forest Area by Type Group:")
print(f"  Oak types:     {oak_types['AREA_TOTAL'][0]:>12,.0f} acres")
print(f"  Pine types:    {pine_types['AREA_TOTAL'][0]:>12,.0f} acres")
print(f"  All hardwoods: {hardwood['AREA_TOTAL'][0]:>12,.0f} acres")

### Filtering by Site Productivity

Site class (`SITECLCD`) indicates timber productivity potential:

| Class | Productivity |
|-------|-------------|
| 1 | Very high (225+ cuft/ac/yr) |
| 2-3 | High |
| 4-5 | Medium |
| 6-7 | Low |

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # High productivity sites (class 1-3)
    high_prod = volume(db, area_domain="SITECLCD <= 3 AND SITECLCD > 0", land_type="timber")
    
    # Low productivity sites (class 5-7)
    low_prod = volume(db, area_domain="SITECLCD >= 5", land_type="timber")

print("Volume by Site Productivity:")
print(f"  High productivity: {high_prod['VOLCFNET_ACRE'][0]:,.0f} cuft/acre")
print(f"  Low productivity:  {low_prod['VOLCFNET_ACRE'][0]:,.0f} cuft/acre")

---

## 6. Plot Domain Expressions

Use `plot_domain` to filter by plot-level attributes (geography, inventory year).

### Common Plot Columns

| Column | Description |
|--------|-------------|
| `COUNTYCD` | County FIPS code |
| `LAT` | Latitude (degrees) |
| `LON` | Longitude (degrees) |
| `ELEV` | Elevation (feet) |
| `INVYR` | Inventory year |

In [None]:
# See what counties are in Rhode Island
with FIA(db_path) as db:
    db.clip_most_recent()
    county_area = area(db, grp_by="COUNTYCD")
    
print("Rhode Island Counties (FIPS codes):")
for row in county_area.sort("COUNTYCD").iter_rows(named=True):
    print(f"  County {row['COUNTYCD']}: {row['AREA_TOTAL']:,.0f} acres forest")

In [None]:
# Rhode Island county codes:
# 1 = Bristol, 3 = Kent, 5 = Newport, 7 = Providence, 9 = Washington

with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Providence County only
    providence = area(db, plot_domain="COUNTYCD == 7")
    
    # Multiple counties
    coastal = area(db, plot_domain="COUNTYCD IN (1, 5, 9)")

print(f"Providence County forest: {providence['AREA_TOTAL'][0]:,.0f} acres")
print(f"Coastal counties forest:  {coastal['AREA_TOTAL'][0]:,.0f} acres")

---

## 7. Combining Multiple Filters

The real power of domain expressions comes from combining them. This enables complex, targeted analyses.

### Example: Sawtimber Volume on Private Timberland

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    result = volume(
        db,
        land_type="timber",              # Timberland only
        tree_domain="DIA >= 11.0",       # Sawtimber-sized trees
        area_domain="OWNGRPCD == 40",    # Private ownership
    )

print(f"Sawtimber volume on private timberland: {result['VOLCFNET_TOTAL'][0]/1e6:,.1f} million cuft")
print(f"Per acre: {result['VOLCFNET_ACRE'][0]:,.0f} cuft/acre")

### Example: Oak Volume in Mature Stands by Ownership

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    result = volume(
        db,
        grp_by="OWNGRPCD",
        tree_domain="SPCD >= 800 AND SPCD < 900",  # Oak species
        area_domain="STDAGE >= 60",                # Mature stands
        land_type="timber",
    )

# Add names
ownership_names = {10: "National Forest", 20: "Other Federal", 30: "State/Local", 40: "Private"}
result_named = result.with_columns(
    pl.col("OWNGRPCD").replace(ownership_names).alias("Ownership")
)

display_estimate(
    result_named.select(["Ownership", "VOLCFNET_ACRE", "VOLCFNET_TOTAL", "N_TREES"]),
    title="Oak Volume in Mature Stands by Ownership"
)

### Example: High-Value Timber Analysis

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # High-value timber: large diameter, growing stock trees
    # on high-productivity sites, accessible slopes
    high_value = volume(
        db,
        by_species=True,
        land_type="timber",
        tree_type="gs",                           # Growing stock quality
        tree_domain="DIA >= 16.0",                # Large diameter
        area_domain="SITECLCD <= 3 AND SLOPE < 35",  # High productivity, accessible
    )

high_value_named = join_species_names(high_value, "SPCD")
top_species = high_value_named.sort("VOLCFNET_TOTAL", descending=True).head(10)

display_estimate(
    top_species.select(["SPCD_NAME", "VOLCFNET_TOTAL", "VOLCFNET_ACRE", "N_TREES"]),
    title="High-Value Timber by Species"
)

---

## 8. Common Analysis Patterns

Here are some frequently-used filtering patterns for forest analysis.

### Pattern 1: Softwood vs. Hardwood

In [None]:
# Species codes: Softwoods are typically < 300, Hardwoods >= 300
# (This is a simplified rule - specific ranges vary)

with FIA(db_path) as db:
    db.clip_most_recent()
    
    softwood = volume(db, tree_domain="SPCD < 300")
    hardwood = volume(db, tree_domain="SPCD >= 300")

total = softwood['VOLCFNET_TOTAL'][0] + hardwood['VOLCFNET_TOTAL'][0]
print(f"Softwood volume: {softwood['VOLCFNET_TOTAL'][0]/1e6:,.1f} M cuft ({softwood['VOLCFNET_TOTAL'][0]/total*100:.0f}%)")
print(f"Hardwood volume: {hardwood['VOLCFNET_TOTAL'][0]/1e6:,.1f} M cuft ({hardwood['VOLCFNET_TOTAL'][0]/total*100:.0f}%)")

### Pattern 2: Age Class Distribution

In [None]:
age_classes = [
    ("0-20", "STDAGE > 0 AND STDAGE <= 20"),
    ("21-40", "STDAGE > 20 AND STDAGE <= 40"),
    ("41-60", "STDAGE > 40 AND STDAGE <= 60"),
    ("61-80", "STDAGE > 60 AND STDAGE <= 80"),
    ("81-100", "STDAGE > 80 AND STDAGE <= 100"),
    ("100+", "STDAGE > 100"),
]

results = []
with FIA(db_path) as db:
    db.clip_most_recent()
    
    for name, domain in age_classes:
        result = area(db, area_domain=domain)
        results.append({"Age Class": name, "Area": result['AREA_TOTAL'][0]})

# Display
print("Forest Area by Age Class:")
for r in results:
    print(f"  {r['Age Class']:>8}: {r['Area']:>12,.0f} acres")

In [None]:
# Visualize age distribution
fig, ax = plt.subplots(figsize=(10, 5))
labels = [r["Age Class"] for r in results]
values = [r["Area"] for r in results]

ax.bar(labels, values, color='#2E7D32', edgecolor='white')
ax.set_xlabel('Stand Age Class (years)')
ax.set_ylabel('Area (acres)')
ax.set_title('Rhode Island Forest Age Distribution')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.tight_layout()
plt.show()

### Pattern 3: Carbon Stock on Different Land Types

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Public vs Private carbon
    private_carbon = biomass(db, area_domain="OWNGRPCD == 40")
    public_carbon = biomass(db, area_domain="OWNGRPCD IN (10, 20, 30)")
    
    # By stand maturity
    young_carbon = biomass(db, area_domain="STDAGE < 40 AND STDAGE > 0")
    mature_carbon = biomass(db, area_domain="STDAGE >= 60")

print("Carbon by Ownership:")
print(f"  Private: {private_carbon['CARBON_TOTAL'][0]/1e6:,.2f} million tons ({private_carbon['CARBON_ACRE'][0]:.1f} tons/acre)")
print(f"  Public:  {public_carbon['CARBON_TOTAL'][0]/1e6:,.2f} million tons ({public_carbon['CARBON_ACRE'][0]:.1f} tons/acre)")

print("\nCarbon Density by Age:")
print(f"  Young (<40 yrs): {young_carbon['CARBON_ACRE'][0]:.1f} tons/acre")
print(f"  Mature (60+):    {mature_carbon['CARBON_ACRE'][0]:.1f} tons/acre")

---

## Exercise 1: Custom Species Analysis

**Task**: Analyze red maple (SPCD=316) vs. red oak (SPCD=833) across the state.

1. Compare total volume for each species
2. Compare volume per acre
3. Find average tree size (volume/tree) for each

**Hint**: Use `tree_domain="SPCD == 316"` for red maple

In [None]:
# Your code here


<details>
<summary><b>Click to reveal solution</b></summary>

```python
with FIA(db_path) as db:
    db.clip_most_recent()
    
    red_maple = volume(db, tree_domain="SPCD == 316")
    red_oak = volume(db, tree_domain="SPCD == 833")

print("Red Maple vs Red Oak Comparison:")
print(f"\n{'Metric':<25} {'Red Maple':>15} {'Red Oak':>15}")
print("-" * 55)
print(f"{'Total Volume (M cuft)':<25} {red_maple['VOLCFNET_TOTAL'][0]/1e6:>15,.1f} {red_oak['VOLCFNET_TOTAL'][0]/1e6:>15,.1f}")
print(f"{'Volume/Acre (cuft)':<25} {red_maple['VOLCFNET_ACRE'][0]:>15,.0f} {red_oak['VOLCFNET_ACRE'][0]:>15,.0f}")
print(f"{'Number of Trees':<25} {red_maple['N_TREES'][0]:>15,} {red_oak['N_TREES'][0]:>15,}")

# Calculate average tree size
maple_avg = red_maple['VOLCFNET_TOTAL'][0] / red_maple['N_TREES'][0] if red_maple['N_TREES'][0] > 0 else 0
oak_avg = red_oak['VOLCFNET_TOTAL'][0] / red_oak['N_TREES'][0] if red_oak['N_TREES'][0] > 0 else 0
print(f"{'Avg Volume/Tree (cuft)':<25} {maple_avg:>15,.1f} {oak_avg:>15,.1f}")
```

</details>

---

## Exercise 2: Forest Management Opportunity Analysis

**Task**: Identify potential timber harvest opportunity by finding:
- Private timberland
- Mature stands (age 60+)
- High site productivity (SITECLCD 1-3)
- Accessible slopes (<35%)

Calculate the total volume and area meeting all these criteria.

**Hint**: Combine `land_type`, `area_domain` with AND conditions

In [None]:
# Your code here


<details>
<summary><b>Click to reveal solution</b></summary>

```python
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # Opportunity analysis with combined filters
    opportunity = volume(
        db,
        land_type="timber",
        area_domain="OWNGRPCD == 40 AND STDAGE >= 60 AND SITECLCD <= 3 AND SITECLCD > 0 AND SLOPE < 35",
    )
    
    # Compare to all private timberland
    all_private = volume(db, land_type="timber", area_domain="OWNGRPCD == 40")

print("Timber Harvest Opportunity Analysis:")
print(f"\nOpportunity Area (meeting all criteria):")
print(f"  Area: {opportunity['AREA_TOTAL'][0]:,.0f} acres")
print(f"  Volume: {opportunity['VOLCFNET_TOTAL'][0]/1e6:,.1f} million cuft")
print(f"  Volume/Acre: {opportunity['VOLCFNET_ACRE'][0]:,.0f} cuft/acre")

print(f"\nAll Private Timberland:")
print(f"  Area: {all_private['AREA_TOTAL'][0]:,.0f} acres")
print(f"  Volume: {all_private['VOLCFNET_TOTAL'][0]/1e6:,.1f} million cuft")

pct = opportunity['AREA_TOTAL'][0] / all_private['AREA_TOTAL'][0] * 100
print(f"\nOpportunity represents {pct:.1f}% of private timberland")
```

</details>

---

## Summary

In this notebook, you learned:

1. **Land type shortcuts** - `forest`, `timber`, `all`
2. **Tree type shortcuts** - `live`, `dead`, `gs`, `all`
3. **`tree_domain`** - Filter by species, diameter, height, etc.
4. **`area_domain`** - Filter by ownership, forest type, age, site class
5. **`plot_domain`** - Filter by county, geography
6. **Combining filters** - Build complex queries with multiple domains
7. **Analysis patterns** - Common filtering approaches for forest analysis

### Domain Expression Syntax

| Operator | Example |
|----------|--------|
| `==`, `!=` | `SPCD == 316` |
| `<`, `>`, `<=`, `>=` | `DIA >= 11.0` |
| `AND`, `OR` | `STDAGE >= 60 AND SLOPE < 35` |
| `IN` | `OWNGRPCD IN (10, 20, 30)` |
| `BETWEEN` | `DIA BETWEEN 5.0 AND 10.9` |

## Next Steps

Continue to **Notebook 4: Change Analysis** to learn:
- Mortality estimation
- Growth estimation  
- Removals (harvest) analysis
- Net change calculations