# Species Name Harmonization Using World Flora Online (WFO)

## Background

When working with ecological datasets from multiple sources, species names often vary due to:
- Different taxonomic authorities and naming conventions
- Synonyms and outdated nomenclature
- Spelling variations and author citations
- Subspecies, varieties, and cultivars vs. species-level names

The World Flora Online (WFO) Plant Name Portal provides a comprehensive, standardized reference for vascular plant nomenclature. By matching species names against WFO before joining datasets, we can ensure taxonomic consistency and maximize the overlap between datasets.

## Methodology

This notebook demonstrates how to:
1. Load two datasets with species names
2. Use the WFO matching API to standardize species names
3. Join the datasets using harmonized nomenclature
4. Evaluate the quality and success rate of the matching process

## Scientific Context

Taxonomic name standardization is crucial for:
- **Trait synthesis**: Combining functional trait data from multiple databases
- **Biogeographic analysis**: Linking occurrence records with trait measurements
- **Comparative ecology**: Ensuring species-level comparisons use consistent nomenclature
- **Conservation assessment**: Accurate species identification for threat status evaluation


In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.append(str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Any

# Import our WFO matching utilities
from src.utils.wfo_matching import (
    WFOMatchingAPI,
    harmonize_species_names,
    harmonize_and_join_tables,
    get_match_quality_report
)
from src.utils.trait_utils import clean_species_name
from src.conf.environment import log

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Imports successful!")


## Create Example Datasets

For demonstration purposes, we'll create two mock datasets with overlapping but differently formatted species names.


In [None]:
# Create example dataset 1: Trait data with some naming variations
trait_data = pd.DataFrame({
    'species': [
        'Quercus alba',           # Standard binomial
        'Acer saccharum',         # Standard binomial
        'Pinus strobus L.',       # With author
        'Fagus grandifolia',      # Standard binomial
        'Betula papyrifera',      # Standard binomial
        'Tsuga canadensis (L.) Carri√®re',  # With full author citation
        'Abies balsamea',         # Standard binomial
        'Picea rubens',           # Standard binomial
        'Thuja occidentalis',     # Standard binomial
        'Ulmus americana',        # Standard binomial
    ],
    'leaf_area_cm2': [45.2, 78.3, 12.4, 56.7, 34.1, 8.9, 15.2, 6.7, 3.4, 62.5],
    'wood_density_g_cm3': [0.68, 0.63, 0.35, 0.64, 0.55, 0.40, 0.37, 0.41, 0.31, 0.46],
    'height_m': [25.3, 28.7, 35.2, 30.1, 22.4, 28.9, 18.6, 25.1, 15.2, 24.8]
})

# Create example dataset 2: Occurrence data with different naming conventions
occurrence_data = pd.DataFrame({
    'scientific_name': [
        'Quercus alba L.',        # With author
        'Acer saccharum Marshall', # With different author format
        'Pinus strobus',          # Without author
        'Fagus grandifolia Ehrh.', # With author
        'Betula papyrifera Marshall', # With author
        'Tsuga canadensis',       # Without author
        'Abies balsamea (L.) Mill.', # With author
        'Picea rubens Sarg.',     # With author
        'Thuja occidentalis L.',  # With author
        'Populus tremuloides',    # Different species not in trait data
        'Fraxinus americana',     # Different species not in trait data
    ],
    'latitude': [45.2, 44.8, 46.1, 45.7, 47.2, 45.9, 46.8, 45.4, 46.2, 44.9, 45.1],
    'longitude': [-75.3, -74.2, -76.8, -75.9, -78.1, -76.4, -77.2, -75.8, -76.7, -74.8, -75.6],
    'abundance': [12, 8, 15, 6, 23, 11, 9, 18, 7, 14, 5]
})

print("Example datasets created:")
print(f"\nTrait data ({len(trait_data)} species):")
print(trait_data.head())
print(f"\nOccurrence data ({len(occurrence_data)} species):")
print(occurrence_data.head())


## WFO-Based Species Name Harmonization

Now let's use the World Flora Online API to standardize the species names and improve the match rate.

**Note**: This example shows the workflow with mock data. In practice, you would replace this with your actual datasets.


In [None]:
# Initialize WFO API client
# Note: In production, you may want to adjust rate_limit_delay based on API limits
wfo_client = WFOMatchingAPI(rate_limit_delay=0.5)  # 0.5 second delay between requests

# Create cache directory
cache_dir = Path("wfo_cache")
cache_dir.mkdir(exist_ok=True)

# Perform harmonized join in one step
print("Harmonizing species names and joining tables...")
joined_data = harmonize_and_join_tables(
    df1=trait_data,
    df2=occurrence_data,
    species_col1='species',
    species_col2='scientific_name',
    join_type='inner',  # Only keep species present in both datasets
    api_client=wfo_client,
    cache_dir=cache_dir
)

print(f"\nJoin Results:")
print(f"  Original trait data: {len(trait_data)} species")
print(f"  Original occurrence data: {len(occurrence_data)} species")
print(f"  Joined data: {len(joined_data)} species")
print(f"  Success rate: {100*len(joined_data)/min(len(trait_data), len(occurrence_data)):.1f}%")

print(f"\nJoined dataset preview:")
print(joined_data[['species_harmonized', 'leaf_area_cm2', 'latitude', 'longitude']].head())


## Usage with Your Data

To use this workflow with your actual datasets, follow these steps:

1. **Replace example data** with your actual DataFrames
2. **Specify correct column names** for species in each dataset
3. **Set up caching** to avoid repeated API calls
4. **Choose appropriate join type** based on your analysis needs

### Example with Project Data

```python
# Load your actual data
from src.conf.conf import get_config
cfg = get_config()

# Load GBIF data
gbif_data = pd.read_parquet(
    Path(cfg.interim_dir, cfg.gbif.interim.dir, cfg.gbif.interim.matched)
)

# Load TRY trait data  
try_data = pd.read_parquet(get_try_traits_interim_fn())

# Harmonize and join
harmonized_data = harmonize_and_join_tables(
    df1=gbif_data,
    df2=try_data,
    species_col1='speciesname',
    species_col2='speciesname', 
    join_type='inner',
    cache_dir=Path('data/wfo_cache')
)
```

### Best Practices

- **Cache results**: Always use `cache_dir` to avoid repeated API calls
- **Rate limiting**: Adjust `rate_limit_delay` based on API usage policies
- **Quality checks**: Use `get_match_quality_report()` to assess matching success
- **Batch processing**: For large datasets, process in smaller chunks
