[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mihiarc/pyfia/blob/main/notebooks/01_getting_started.ipynb)

---

In [None]:
# Google Colab Setup - Run this cell first!
import sys
if 'google.colab' in sys.modules:
    print("Running in Google Colab - installing pyFIA from GitHub...")
    !pip install -q git+https://github.com/mihiarc/pyfia.git polars duckdb matplotlib rich
    
    # Download helpers.py for Colab
    import urllib.request
    helpers_url = "https://raw.githubusercontent.com/mihiarc/pyfia/main/notebooks/helpers.py"
    urllib.request.urlretrieve(helpers_url, "helpers.py")
    print("Setup complete! You may now run the remaining cells.")
else:
    print("Running locally - no additional setup needed.")

# Getting Started with pyFIA

Welcome to pyFIA! This notebook will introduce you to the basics of working with USDA Forest Inventory and Analysis (FIA) data using Python.

## What You'll Learn

1. What FIA data is and why it matters
2. How to install and set up pyFIA
3. Downloading data from FIA DataMart
4. Connecting to FIA databases
5. Understanding EVALIDs (evaluation identifiers)
6. Running your first forest area estimate
7. Interpreting results and standard errors

**Estimated time:** 30 minutes

---

## 1. Introduction to FIA Data

### What is FIA?

The **Forest Inventory and Analysis (FIA)** program is the nation's forest census, conducted by the USDA Forest Service. It provides:

- **Comprehensive coverage**: Every U.S. state and territory
- **Consistent methodology**: Standardized protocols since the 1930s
- **Annual updates**: Rotating panel design means ~20% of plots measured each year
- **Rich detail**: Tree species, size, health, ownership, and more

### Why Use FIA Data?

FIA is the authoritative source for questions like:

- How much forest exists in a state?
- What tree species are most common?
- How much timber volume is available?
- How fast are forests growing or declining?
- What's the carbon stored in forest biomass?

### The Challenge

FIA data is powerful but complex:

- Multiple interconnected tables (PLOT, TREE, COND, etc.)
- Complex sampling design requires proper statistical estimation
- Evaluation cycles and panel structures can be confusing

**pyFIA simplifies all of this** with a clean Python API that handles the complexity for you.

---

## 2. Setup and Imports

Let's start by importing the libraries we'll need.

In [None]:
# Core pyFIA imports
from pyfia import FIA, download, area

# For visualization
import matplotlib.pyplot as plt

# Notebook helpers
from helpers import ensure_ri_data, display_estimate, plot_by_category

print("Imports successful!")

---

## 3. Downloading FIA Data

pyFIA can download data directly from the [FIA DataMart](https://apps.fs.usda.gov/fia/datamart/datamart.html), the official data distribution system.

### Our Tutorial Dataset: Rhode Island

We'll use Rhode Island because:
- It's the smallest U.S. state (~1,000 sq miles)
- Downloads quickly (~1-2 minutes)
- Contains all the FIA data structures you'll find in larger states

Let's download it now.

In [None]:
# Download Rhode Island data (or use cached version)
db_path = ensure_ri_data()
print(f"\nDatabase location: {db_path}")

### How Download Works

The `download()` function:

1. Fetches data from FIA DataMart (CSV files)
2. Converts them to an efficient DuckDB database
3. Caches locally so subsequent runs are instant

```python
# Download a single state
db_path = download("RI")

# Download multiple states (merged into one database)
db_path = download(["RI", "CT", "MA"])

# Download to a specific directory
db_path = download("RI", dir="./my_data")
```

---

## 4. Connecting to the Database

The `FIA` class is your gateway to the data. It manages database connections and provides methods for filtering and estimation.

In [None]:
# Create a database connection
db = FIA(db_path)
print(f"Connected to: {db_path}")

# Check what tables are in the database
import duckdb
con = duckdb.connect(str(db_path), read_only=True)
tables = [t[0] for t in con.execute("SHOW TABLES").fetchall()]
con.close()
print(f"Available tables: {tables[:10]}...")  # Show first 10

### Key FIA Tables

| Table | Description |
|-------|-------------|
| **PLOT** | Plot locations and metadata |
| **TREE** | Individual tree measurements (species, diameter, height) |
| **COND** | Stand conditions (forest type, age, ownership) |
| **POP_\*** | Population tables for statistical estimation |

You rarely need to work with these tables directly - pyFIA's estimation functions handle the complexity.

### Using Context Managers (Recommended)

The preferred way to work with FIA databases is using Python's `with` statement. This ensures connections are properly closed.

In [None]:
# Recommended pattern: context manager
with FIA(db_path) as db:
    print(f"Inside context: database is open")
    # Do your analysis here
    
print("Outside context: database is closed")

### Why EVALIDs Matter

**Important**: If you don't filter by EVALID, you might accidentally include overlapping evaluations and double-count data!

pyFIA handles this automatically with `clip_most_recent()`, which selects the most recent evaluation of the appropriate type.

**Evaluation Types** (use these with `eval_type` parameter):

| Short Code | Full Name | Purpose |
|------------|-----------|--------|
| `"VOL"` | EXPVOL | Volume and biomass estimates |
| `"GRM"` | EXPGRM | Growth, Removals, Mortality |
| `"ALL"` | EXPALL | Full land area estimates |

In [None]:
# Find available EVALIDs in our database
with FIA(db_path) as db:
    evalids = db.find_evalid()
    print("Available EVALIDs in Rhode Island:")
    for evalid in evalids:
        print(f"  {evalid}")

### Why EVALIDs Matter

**Important**: If you don't filter by EVALID, you might accidentally include overlapping evaluations and double-count data!

pyFIA handles this automatically with `clip_most_recent()`, which selects the most recent evaluation of the appropriate type.

In [None]:
# Let pyFIA select the most recent evaluation automatically
with FIA(db_path) as db:
    db.clip_most_recent(eval_type="VOL")  # For volume estimates
    print(f"Selected EVALID: {db.evalid}")

---

## 6. Your First Estimate: Forest Area

Let's estimate the total forest area in Rhode Island. This is the most fundamental FIA question.

### The Simple Version

In [None]:
# Estimate total forest area in Rhode Island
with FIA(db_path) as db:
    db.clip_most_recent()
    result = area(db)
    
display_estimate(result, title="Forest Area in Rhode Island")

### Understanding the Results

Let's break down what each column means:

| Column | Description |
|--------|-------------|
| **YEAR** | Inventory year |
| **AREA** | Total forest area in acres |
| **AREA_SE** | Standard error of the estimate (uncertainty) |
| **AREA_SE_PERCENT** | Relative standard error as percentage |
| **N_PLOTS** | Number of plots used in the estimate |

In [None]:
# Extract key values
total_acres = result["AREA"][0]
se_acres = result["AREA_SE"][0]
se_percent = result["AREA_SE_PERCENT"][0]
n_plots = result["N_PLOTS"][0]

print(f"Rhode Island Forest Area:")
print(f"  Total: {total_acres:,.0f} acres")
print(f"  Standard Error: ±{se_acres:,.0f} acres ({se_percent:.1f}%)")
print(f"  Based on: {n_plots} sample plots")
print(f"\n  95% Confidence Interval: {total_acres - 1.96*se_acres:,.0f} to {total_acres + 1.96*se_acres:,.0f} acres")

### What is Standard Error?

The **standard error (SE)** quantifies uncertainty in the estimate. Because FIA uses sample plots (not a complete census), there's always some uncertainty.

**Rule of thumb**:
- SE% < 5%: Very precise estimate
- SE% 5-15%: Acceptable precision
- SE% > 15%: Use with caution, consider larger area or grouping

The 95% confidence interval is approximately: **estimate ± 1.96 × SE**

---

## 7. Grouping Results

One of pyFIA's most powerful features is the `grp_by` parameter, which lets you break down estimates by category.

In [None]:
# Forest area by ownership group
with FIA(db_path) as db:
    db.clip_most_recent()
    result = area(db, grp_by="OWNGRPCD")
    
display_estimate(result, title="Forest Area by Ownership")

### Ownership Codes

| OWNGRPCD | Description |
|----------|-------------|
| 10 | National Forest |
| 20 | Other Federal |
| 30 | State and Local Government |
| 40 | Private |

In [None]:
# Visualize ownership distribution
import polars as pl

# Create a mapping DataFrame for ownership names
ownership_df = pl.DataFrame({
    "OWNGRPCD": [10, 20, 30, 40],
    "Ownership": ["National Forest", "Other Federal", "State/Local", "Private"]
})

# Join to add names
result_named = result.join(ownership_df, on="OWNGRPCD", how="left")

fig = plot_by_category(
    result_named,
    category_col="Ownership",
    value_col="AREA",
    error_col="AREA_SE",
    title="Rhode Island Forest Area by Ownership",
    xlabel="Area (acres)"
)
plt.show()

---

## 8. Forest vs. Timberland

FIA distinguishes between:

- **Forest land**: All land meeting forest definition (>10% tree cover, >1 acre)
- **Timberland**: Forest land capable of producing 20+ cubic feet/acre/year of wood, not legally reserved

Use the `land_type` parameter to switch between them.

In [None]:
with FIA(db_path) as db:
    db.clip_most_recent()
    
    # All forest land
    forest = area(db, land_type="forest")
    
    # Timberland only
    timber = area(db, land_type="timber")

print(f"All Forest Land: {forest['AREA'][0]:,.0f} acres")
print(f"Timberland:      {timber['AREA'][0]:,.0f} acres")
print(f"Difference:      {forest['AREA'][0] - timber['AREA'][0]:,.0f} acres (reserved or low productivity)")

---

## 9. Quick Reference

### Key Functions

```python
# Download data
db_path = download("RI")

# Connect to database
with FIA(db_path) as db:
    db.clip_most_recent()  # Select most recent evaluation
    
    # Estimate area
    result = area(db)
    result = area(db, grp_by="OWNGRPCD")
    result = area(db, land_type="timber")
```

### Common Parameters

| Parameter | Description | Example |
|-----------|-------------|--------|
| `grp_by` | Group results by column(s) | `grp_by="OWNGRPCD"` |
| `land_type` | Filter land type | `"forest"`, `"timber"`, `"all"` |
| `variance` | Return variance instead of SE | `variance=True` |
| `totals` | Include total area column | `totals=True` |

---

## Exercise 1: Forest Area by Forest Type

**Task**: Estimate forest area grouped by forest type code (`FORTYPCD`).

1. Connect to the Rhode Island database
2. Select the most recent evaluation
3. Run `area()` with `grp_by="FORTYPCD"`
4. Display the top 10 forest types by area

**Hint**: Forest type codes are numeric. You'll learn how to add names in the next notebook!

In [None]:
# Your code here


<details>
<summary><b>Click to reveal solution</b></summary>

```python
with FIA(db_path) as db:
    db.clip_most_recent()
    result = area(db, grp_by="FORTYPCD")
    
# Sort by area and show top 10
top_10 = result.sort("AREA", descending=True).head(10)
display_estimate(top_10, title="Top 10 Forest Types by Area")
```

</details>

---

## Exercise 2: Timberland by Ownership

**Task**: How much timberland does each ownership category have in Rhode Island?

1. Estimate timberland area (not all forest) by ownership
2. Create a bar chart showing the distribution

**Hint**: Use `land_type="timber"` and `grp_by="OWNGRPCD"`

In [None]:
# Your code here


<details>
<summary><b>Click to reveal solution</b></summary>

```python
with FIA(db_path) as db:
    db.clip_most_recent()
    result = area(db, grp_by="OWNGRPCD", land_type="timber")

# Create a mapping DataFrame for ownership names
ownership_df = pl.DataFrame({
    "OWNGRPCD": [10, 20, 30, 40],
    "Ownership": ["National Forest", "Other Federal", "State/Local", "Private"]
})

# Join to add names
result_named = result.join(ownership_df, on="OWNGRPCD", how="left")

display_estimate(result_named, title="Timberland by Ownership")

fig = plot_by_category(
    result_named,
    category_col="Ownership",
    value_col="AREA",
    error_col="AREA_SE",
    title="Rhode Island Timberland by Ownership",
    xlabel="Area (acres)"
)
plt.show()
```

</details>

---

## Summary

In this notebook, you learned:

1. **FIA data** provides comprehensive forest inventory for all U.S. states
2. **pyFIA** simplifies working with this complex data
3. **`download()`** fetches data from FIA DataMart
4. **`FIA()`** class connects to databases
5. **EVALIDs** identify specific evaluations - use `clip_most_recent()` to avoid double-counting
6. **`area()`** estimates forest area with proper variance
7. **`grp_by`** breaks down results by category
8. **Standard error** quantifies uncertainty in estimates

## Next Steps

Continue to **Notebook 2: Core Estimators** to learn about:
- Volume estimation
- Biomass and carbon
- Trees per acre (TPA)
- Reference table lookups for species names