# Chapter 8: Association, Partitioning, and Ordination

This chapter covers multivariate analysis techniques for ecological data,
focusing on ordination methods using the nuee package (Python port of R's vegan).

## Learning Objectives
- Understand different ordination methods and their applications
- Perform PCA, CA, RDA, and CCA analyses
- Interpret ordination results in ecological context
- Visualize community patterns and species-environment relationships

In [1]:
# Essential imports for ordination analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nuee
import warnings
warnings.filterwarnings('ignore')

print("Packages loaded for ordination analysis")

Packages loaded for ordination analysis


## Introduction to Ordination

**Ordination** arranges samples or species along axes that represent the main
patterns of variation in ecological data.

**Common methods**:
- **PCA**: Principal Component Analysis (linear, unconstrained)
- **CA**: Correspondence Analysis (unimodal, unconstrained)
- **RDA**: Redundancy Analysis (linear, constrained)
- **CCA**: Canonical Correspondence Analysis (unimodal, constrained)
- **NMDS**: Non-metric Multidimensional Scaling (distance-based)

**When to use each method**:
- **PCA**: Short environmental gradients, linear species responses
- **CA**: Long gradients, unimodal species responses
- **RDA**: Relate community composition to environmental variables (linear)
- **CCA**: Relate community composition to environmental variables (unimodal)
- **NMDS**: Complex, non-linear relationships

## Load Example Data

In [2]:
# Load classic ecological datasets from nuee
varespec = nuee.datasets.varespec()  # Lichen species data
varechem = nuee.datasets.varechem()  # Environmental variables

print("Varespec dataset (species composition):")
print(f"Shape: {varespec.shape}")
print(varespec.iloc[:5, :8])  # First 5 sites, 8 species

print("\nVarechem dataset (environmental variables):")
print(f"Shape: {varechem.shape}")
print(varechem.head())

Varespec dataset (species composition):
Shape: (24, 44)
    Callvulg  Empenigr  Rhodtome  Vaccmyrt  Vaccviti  Pinusylv  Descflex  \
18      0.55     11.13      0.00      0.00     17.80      0.07       0.0   
15      0.67      0.17      0.00      0.35     12.13      0.12       0.0   
24      0.10      1.55      0.00      0.00     13.47      0.25       0.0   
27      0.00     15.13      2.42      5.92     15.97      0.00       3.7   
23      0.00     12.68      0.00      0.00     23.73      0.03       0.0   

    Betupube  
18       0.0  
15       0.0  
24       0.0  
27       0.0  
23       0.0  

Varechem dataset (environmental variables):
Shape: (24, 14)
       N     P      K     Ca     Mg     S     Al    Fe     Mn    Zn   Mo  \
18  19.8  42.1  139.9  519.4   90.0  32.3   39.0  40.9   58.1   4.5  0.3   
15  13.4  39.1  167.3  356.7   70.7  35.2   88.1  39.0   52.4   5.4  0.3   
24  20.2  67.7  207.1  973.3  209.1  58.1  138.0  35.4   32.1  16.8  0.8   
27  20.6  60.8  233.7  834.0  12

## Data Exploration and Preparation

In [3]:
# Basic statistics
print("Species data summary:")
print(f"Total species: {varespec.shape[1]}")
print(f"Total sites: {varespec.shape[0]}")
print(f"Total abundance: {varespec.sum().sum()}")
print(f"Species per site (mean +/- SD): {varespec.sum(axis=1).mean():.1f} +/- {varespec.sum(axis=1).std():.1f}")
print(f"Sites per species (mean +/- SD): {(varespec > 0).sum(axis=0).mean():.1f} +/- {(varespec > 0).sum(axis=0).std():.1f}")

# Check for empty sites or rare species
site_totals = varespec.sum(axis=1)
species_totals = varespec.sum(axis=0)

print(f"\nEmpty sites: {(site_totals == 0).sum()}")
print(f"Absent species: {(species_totals == 0).sum()}")
print(f"Rare species (< 3 occurrences): {(species_totals < 3).sum()}")

Species data summary:
Total species: 44
Total sites: 24
Total abundance: 2417.72
Species per site (mean +/- SD): 100.7 +/- 15.8
Sites per species (mean +/- SD): 14.0 +/- 7.9

Empty sites: 0
Absent species: 0
Rare species (< 3 occurrences): 14


## Principal Component Analysis (PCA)

PCA is appropriate for short environmental gradients where species responses
are approximately linear.

In [4]:
# Perform PCA using nuee
pca_result = nuee.pca(varespec)

print("PCA Results:")
print(f"Eigenvalues: {pca_result.eigenvalues[:4]}")
print(f"Proportion explained: {pca_result.proportion_explained[:4]}")
print(f"Cumulative proportion: {pca_result.cumulative_proportion[:4]}")

# How much variance is explained by first two axes?
var_explained = pca_result.cumulative_proportion[1] * 100
print(f"\nFirst two axes explain {var_explained:.1f}% of variance")

PCA Results:
Eigenvalues: [8.898398   4.75555273 4.26430911 3.73150636]
Proportion explained: [0.20223632 0.10808074 0.09691612 0.08480696]
Cumulative proportion: [0.20223632 0.31031706 0.40723318 0.49204014]

First two axes explain 31.0% of variance


## PCA Visualization

In [None]:
# Create PCA biplot
pca_plot = pca_result.biplot(scaling=2, n_species=10)
pca_plot

## Correspondence Analysis (CA)

CA is appropriate for longer environmental gradients where species show
unimodal responses.

In [6]:
# Perform CA using nuee (CA is CCA without constraints)
ca_result = nuee.ca(varespec)

print("CA Results:")
print(f"Eigenvalues: {ca_result.eigenvalues[:4]}")
print(f"Proportion explained: {ca_result.proportion_explained[:4]}")
print(f"Cumulative proportion: {ca_result.cumulative_proportion[:4]}")

# CA typically explains less variance per axis than PCA
ca_var_explained = ca_result.cumulative_proportion[1] * 100
print(f"\nFirst two axes explain {ca_var_explained:.1f}% of variance")

CA Results:
Eigenvalues: [0.52493199 0.35679802 0.23443751 0.19546325]
Proportion explained: [0.25198369 0.17127415 0.1125373  0.09382844]
Cumulative proportion: [0.25198369 0.42325784 0.53579514 0.62962359]

First two axes explain 42.3% of variance


## CA Visualization

In [None]:
# Create CA biplot
ca_plot = ca_result.biplot(
    title="CA of Lichen Species Composition",
    figsize=(10, 8),
    n_species=12,
)
ca_plot

## Redundancy Analysis (RDA)

RDA relates species composition to environmental variables.
It's a constrained ordination method (canonical analysis).

In [8]:
# Perform RDA with environmental constraints
rda_result = nuee.rda(varespec, varechem)

print("RDA Results:")
print(f"Constrained eigenvalues: {rda_result.constrained_eigenvalues[:4]}")
print(f"Unconstrained eigenvalues: {rda_result.unconstrained_eigenvalues[:4]}")
print(f"Constrained proportion: {rda_result.constrained_proportion[:4]}")
print(f"Total proportion: {rda_result.total_proportion[:4]}")

# How much variation is explained by environmental variables?
constrained_var = rda_result.constrained_proportion.sum() * 100
print(f"\nEnvironmental variables explain {constrained_var:.1f}% of species variation")

RDA Results:
Constrained eigenvalues: [820.10421071 399.28474306 102.56167814  47.63169397]
Unconstrained eigenvalues: [186.19173249  88.46419414  38.1882774   18.40207455]
Constrained proportion: [0.44920986 0.21870714 0.05617788 0.02609013]
Total proportion: [0.44920986 0.21870714 0.05617788 0.02609013]

Environmental variables explain 80.0% of species variation


## RDA Visualization

In [None]:
rda_plot = rda_result.biplot(
    scaling=2,
    figsize=(10, 8),
    n_species=10,
    arrow_mul=1.5,
)
rda_plot

## Environmental Fitting (envfit)

Test significance of environmental variables in ordination space.

In [None]:
# Fit environmental variables to ordination
envfit_result = nuee.envfit(pca_result, varechem)

vectors = envfit_result["vectors"]

print("Environmental fitting results:")
print("Variable correlations with ordination axes:")
print(pd.DataFrame(
    vectors["scores"],
    index=vectors["variables"],
    columns=[f"Axis{i+1}" for i in range(vectors["scores"].shape[1])]
).round(3))

print(f"\nR-squared values:")
for var, r2 in zip(vectors["variables"], vectors["r2"]):
    print(f"  {var}: R2 = {r2:.3f}")

print(f"\nSignificant variables (p < 0.05):")
sig_mask = vectors["pvals"] < 0.05
for i, (var, p_val) in enumerate(zip(vectors["variables"], vectors["pvals"])):
    if sig_mask[i]:
        print(f"  {var}: p = {p_val:.3f}")

Environmental fitting results:
Variable correlations with ordination axes:
          Axis1  Axis2  Axis3  Axis4  Axis5  Axis6  Axis7  Axis8  Axis9  \
N        -0.216 -0.408 -0.102  0.514  0.062  0.065  0.113  0.032 -0.090   
P         0.005 -0.224  0.205 -0.667 -0.017  0.187  0.144  0.074  0.241   
K         0.025 -0.272  0.171 -0.499  0.066 -0.090  0.205 -0.157  0.211   
Ca        0.210 -0.067  0.378 -0.487 -0.086  0.320  0.344  0.070  0.006   
Mg        0.213 -0.059  0.394 -0.356  0.051  0.134  0.287  0.348  0.143   
S        -0.059  0.022  0.129 -0.624  0.222 -0.131  0.046  0.032  0.331   
Al       -0.323  0.290 -0.466 -0.196  0.191  0.021 -0.185 -0.138  0.506   
Fe       -0.314  0.095 -0.434  0.011 -0.016  0.048 -0.206  0.031  0.449   
Mn        0.088 -0.677  0.259 -0.266 -0.218  0.001 -0.051 -0.053 -0.249   
Zn        0.035 -0.122  0.172 -0.487  0.393  0.287  0.296  0.164  0.023   
Mo       -0.229  0.091 -0.077 -0.179  0.519 -0.173 -0.196  0.168  0.483   
Baresoil  0.416 -0.101  0

## Non-metric Multidimensional Scaling (NMDS)

NMDS is a robust ordination method that can handle non-linear relationships
and different distance measures.

In [None]:
# Perform NMDS
nmds_result = nuee.metaMDS(varespec, distance='bray', k=2)

print("NMDS Results:")
print(f"Stress: {nmds_result.stress:.3f}")
print(f"Converged: {nmds_result.converged}")
print(f"Number of runs: {nmds_result.nruns}")

# Stress interpretation
if nmds_result.stress < 0.05:
    stress_interpretation = "excellent"
elif nmds_result.stress < 0.1:
    stress_interpretation = "good"
elif nmds_result.stress < 0.2:
    stress_interpretation = "fair"
else:
    stress_interpretation = "poor"

print(f"Stress interpretation: {stress_interpretation}")

NMDS Results:
Stress: 0.133
Converged: True
Number of runs: 20
Stress interpretation: fair


## NMDS Visualization

In [None]:
# Create NMDS plot
nmds_plot = nmds_result.plot(
    title=f"NMDS of Lichen Communities (stress = {nmds_result.stress:.3f})",
    figsize=(10, 8),
    n_species=12,
)
nmds_plot

## Canonical Correspondence Analysis (CCA)

CCA is the unimodal equivalent of RDA, appropriate for long environmental gradients.

In [None]:
# Perform CCA with environmental constraints
cca_result = nuee.cca(varespec, varechem)

print("CCA Results:")
print(f"Constrained eigenvalues: {cca_result.constrained_eigenvalues[:4]}")
print(f"Unconstrained eigenvalues: {cca_result.unconstrained_eigenvalues[:4]}")
print(f"Constrained proportion: {cca_result.constrained_proportion[:4]}")

# Total inertia in CCA is the sum of all eigenvalues
total_inertia = cca_result.total_inertia
constrained_inertia = cca_result.constrained_inertia

print(f"\nTotal inertia: {total_inertia:.3f}")
print(f"Constrained inertia: {constrained_inertia:.3f}")
print(f"Proportion constrained: {constrained_inertia/total_inertia:.3f}")

CCA Results:
Constrained eigenvalues: [0.43887043 0.29177526 0.16284652 0.14213024]
Unconstrained eigenvalues: [0.19776451 0.14192556 0.1011741  0.07078684]
Constrained proportion: [0.21067146 0.1400612  0.0781714  0.06822694]

Total inertia: 2.083
Constrained inertia: 1.441
Proportion constrained: 0.692


## CCA Visualization

In [None]:
# Create CCA triplot
cca_plot = cca_result.biplot(
    title="CCA: Canonical Correspondence Analysis",
    figsize=(10, 8),
    n_species=10,
)
cca_plot

## Permutation Tests for Constrained Ordination

Test the significance of constrained ordination results.

In [None]:
# Test overall significance of RDA
rda_perm = nuee.permutest(rda_result, permutations=999)

print("RDA Permutation Test:")
print(f"F-statistic: {rda_perm['f_observed']:.3f}")
print(f"P-value: {rda_perm['tab'].loc['Model', 'Pr(>F)']:.3f}")
print(f"Permutations: {rda_perm['permutations']}")

print(f"\nANOVA Table:")
print(rda_perm['tab'])

RDA Permutation Test:
F-statistic: 2.566
P-value: 0.006
Permutations: 999

ANOVA Table:
          Df     Variance         F  Pr(>F)
Model     14  1459.889052  2.565818   0.006
Residual   9   365.770352       NaN     NaN


## Variable Selection in Constrained Ordination

Identify the most important environmental variables by fitting RDA with
subsets of predictors.

In [None]:
# Select a few key environmental variables and compare models
# Full model
full_var = rda_result.constrained_proportion.sum()
print(f"Full model ({varechem.shape[1]} variables): variance explained = {full_var:.3f}")

# Reduced model with a subset of variables
selected_vars = ['Al', 'P', 'K']
selected_env = varechem[selected_vars]
rda_reduced = nuee.rda(varespec, selected_env)

reduced_var = rda_reduced.constrained_proportion.sum()
print(f"Reduced model ({len(selected_vars)} variables: {selected_vars}): variance explained = {reduced_var:.3f}")

# Test significance of reduced model
rda_reduced_perm = nuee.permutest(rda_reduced, permutations=999)
print(f"\nReduced model significance:")
print(f"  F = {rda_reduced_perm['f_observed']:.3f}, "
      f"p = {rda_reduced_perm['tab'].loc['Model', 'Pr(>F)']:.3f}")

Full model (14 variables): variance explained = 0.800
Reduced model (3 variables: ['Al', 'P', 'K']): variance explained = 0.377

Reduced model significance:
  F = 4.034, p = 0.001


## Comparing Ordination Methods

Compare different ordination approaches for the same dataset.

In [None]:
# Create comparison table
comparison_data = {
    'Method': ['PCA', 'CA', 'RDA', 'CCA', 'NMDS'],
    'Type': ['Unconstrained', 'Unconstrained', 'Constrained', 'Constrained', 'Distance-based'],
    'Response_model': ['Linear', 'Unimodal', 'Linear', 'Unimodal', 'Non-parametric'],
    'Axis1_variance': [
        pca_result.proportion_explained[0],
        ca_result.proportion_explained[0],
        rda_result.constrained_proportion[0],
        cca_result.constrained_proportion[0],
        np.nan  # NMDS doesn't have explained variance
    ],
    'Axis2_variance': [
        pca_result.proportion_explained[1],
        ca_result.proportion_explained[1],
        rda_result.constrained_proportion[1],
        cca_result.constrained_proportion[1],
        np.nan
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("Ordination Method Comparison:")
print(comparison_df.round(3))

Ordination Method Comparison:
  Method            Type  Response_model  Axis1_variance  Axis2_variance
0    PCA   Unconstrained          Linear           0.202           0.108
1     CA   Unconstrained        Unimodal           0.252           0.171
2    RDA     Constrained          Linear           0.449           0.219
3    CCA     Constrained        Unimodal           0.211           0.140
4   NMDS  Distance-based  Non-parametric             NaN             NaN


## Interpreting Ordination Results

**Key principles for interpretation**:

1. **Eigenvalues**: Measure the amount of variation explained by each axis
2. **Species scores**: Show species positions in ordination space
3. **Site scores**: Show sample positions in ordination space
4. **Environmental arrows**: Show direction and strength of environmental gradients

**Ecological interpretation**:
- **Proximity**: Similar sites/species are close in ordination space
- **Gradients**: Environmental arrows show the direction of increasing variable values
- **Length**: Longer arrows indicate stronger environmental effects
- **Angles**: Small angles between arrows indicate correlated variables

**Statistical considerations**:
- Check for outliers that might distort ordination
- Consider data transformations (log, square root)
- Validate results with independent data when possible
- Use permutation tests to assess significance

## Practical Guidelines

**Choosing an ordination method**:

1. **Gradient length**: Use DCA or check beta-diversity
   - Short gradients (< 3 SD): Linear methods (PCA, RDA)
   - Long gradients (> 4 SD): Unimodal methods (CA, CCA)

2. **Research question**:
   - Explore patterns: Unconstrained ordination (PCA, CA, NMDS)
   - Test hypotheses: Constrained ordination (RDA, CCA)

3. **Data characteristics**:
   - Species abundances: All methods applicable
   - Presence/absence: CA, CCA may be preferred
   - Highly skewed: Consider transformations or NMDS

4. **Sample size**:
   - Small samples: NMDS may be unstable
   - Large samples: All methods applicable

**Common pitfalls**:
- Don't over-interpret axes with low eigenvalues
- Be cautious with rare species (consider removing)
- Check for outlying sites that dominate ordination
- Validate environmental relationships independently

## Summary and Next Steps

In this chapter, we covered:

- **Ordination concepts**: Different methods and their applications
- **PCA**: Linear unconstrained ordination
- **CA**: Unimodal unconstrained ordination
- **RDA**: Linear constrained ordination
- **CCA**: Unimodal constrained ordination
- **NMDS**: Distance-based ordination
- **Environmental fitting**: Testing variable significance
- **Variable selection**: Identifying important predictors
- **Interpretation**: Understanding ecological patterns

**Next chapter**: Data quality assessment - outlier detection and imputation

**Key takeaways**:
- Choose ordination method based on gradient length and research question
- Constrained ordination tests specific hypotheses about species-environment relationships
- Permutation tests provide statistical validation
- Environmental fitting helps identify important variables
- Visualization is crucial for interpretation