# Mercury Correlations in Lake Sediments: Pre- and Post-1970 Analysis

This notebook analyzes the relationships between mercury (Hg) concentrations and selected elemental proxies in lake sediment cores. 

We group elements into three categories:

- **Al & Si**: Representing alumino-silicates
- **Ti & Zr**: Proxies for erosion
- **Fe & Br**: Related to organic matter content

We compute Pearson correlation coefficients (r) and p-values separately for data before and after 1970 to highlight potential shifts in sediment geochemistry linked to environmental changes.

Scatter plots visualize Hg vs element concentrations colored by age, without regression lines. Statistical results are summarized in tables.

In [148]:
# --- Imports ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from scipy.stats import linregress

# Plot styling
plt.rc('font', size=15)
plt.rc('axes', titlesize=15, labelsize=15)
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.rc('legend', fontsize=15)
plt.rc('figure', titlesize=15)

## Data Loading

Load all necessary data:  
- Elemental scan data with depth `'Age_X_EYC'`  
- Mercury concentrations with depths `'Age_EYC'`  

These are expected in a single DataFrame, with columns for each element, Hg, and their respective depths.

In [149]:
# Relative paths (assuming starting from erosion_proxies/ folder)
path_data = '../../Data/'

# Load the files
df_age = pd.read_excel(path_data + '210_Pb_dating/Age.xlsx')
df_Hg = pd.read_excel(path_data + 'Hg.xlsx')
df_Xray = pd.read_excel(path_data + 'X_ray.xlsx')

# Select relevant columns
df_age_sel = df_age[['age_EYC']]
df_Hg_sel = df_Hg[['Hg_conc_EYC']]
df_Xray_sel = df_Xray[['CLR_Al', 'CLR_Si', 'CLR_Ti', 'CLR_Zr', 'CLR_Fe', 'CLR_Br', 'Age_X_EYC']]

# Concatenate columns into a single DataFrame
# This assumes that the rows correspond across all files (same order)
df = pd.concat([df_age_sel.reset_index(drop=True),
                df_Hg_sel.reset_index(drop=True),
                df_Xray_sel.reset_index(drop=True)], axis=1)

# Drop rows with NaNs in relevant columns
relevant_cols = ['age_EYC', 'Hg_conc_EYC', 'CLR_Al', 'CLR_Si', 'CLR_Ti', 'CLR_Zr', 'CLR_Fe', 'CLR_Br']
df = df.dropna(subset=relevant_cols)

print(df.head())

      age_EYC  Hg_conc_EYC    CLR_Al    CLR_Si    CLR_Ti    CLR_Zr    CLR_Fe  \
0  2023.00000    66.497230 -0.392258  1.469406 -0.671633 -0.895438  2.225043   
1  2020.36297    63.486721 -0.379071  1.469166 -0.717179 -0.982412  2.255241   
2  2017.72594    59.526447 -0.352728  1.504679 -0.786079 -1.031684  2.249497   
3  2014.92409    59.938530 -0.369098  1.504981 -0.719286 -0.945888  2.252495   
4  2011.95743    63.450665 -0.264869  1.543930 -0.760564 -1.137432  2.244747   

     CLR_Br    Age_X_EYC  
0 -1.942426  2023.000000  
1 -1.544312  2022.835186  
2 -1.845394  2022.670371  
3 -2.003616  2022.505557  
4 -1.779186  2022.340742  


## Aggregation Function

This function averages elemental concentrations between consecutive Hg depth intervals to align datasets sampled at different resolutions.

In [150]:
def aggregate_element_by_Hg_sample(df, element_list, depth_col='Age_X_EYC', hg_col='Hg_conc_EYC', hg_depth_col='age_EYC'):
    """
    Aggregates element concentrations over intervals defined by the spacing of Hg samples.
    The mean value of each element is computed within each Hg sampling interval.

    Parameters:
    - df: pandas DataFrame containing Hg and element data
    - element_list: list of elements (columns) to average
    - depth_col: column name for depth/age of the elements (default 'Age_X_EYC')
    - hg_col: column name for Hg values (default 'Hg_EYC')
    - hg_depth_col: column name for depth/age of Hg values (default 'Age_EYC')

    Returns:
    - pd.DataFrame with averaged elements and Hg for each interval
    """
    # Drop rows with missing values in any relevant column
    df_clean = df[[hg_col, hg_depth_col, depth_col] + element_list].dropna()

    # Extract unique Hg depths and sort them
    hg_depths = sorted(df_clean[hg_depth_col].unique())
    aggregated_data = []

    for i in range(len(hg_depths) - 1):
        # Define the interval based on two consecutive Hg sample ages
        depth_min = hg_depths[i]
        depth_max = hg_depths[i + 1]

        # Select rows whose element depth falls within the interval
        mask = (df_clean[depth_col] >= depth_min) & (df_clean[depth_col] < depth_max)
        subset = df_clean[mask]

        if not subset.empty:
            # Compute mean concentrations for each element in the interval
            element_means = subset[element_list].mean()

            # Associate the mean Hg value in the interval
            mean_hg = df_clean[
                (df_clean[hg_depth_col] >= depth_min) & (df_clean[hg_depth_col] < depth_max)
            ][hg_col].mean()

            # Save results
            aggregated_data.append({
                'depth_min': depth_min,
                'depth_max': depth_max,
                hg_col: mean_hg,
                **element_means.to_dict()
            })

    return pd.DataFrame(aggregated_data)

## Aggregate Data

Run the aggregation to produce a dataset where element concentrations are averaged within the Hg depth intervals.

In [151]:
elements = ['CLR_Al', 'CLR_Si', 'CLR_Ti', 'CLR_Zr', 'CLR_Fe', 'CLR_Br']
df_agg = aggregate_element_by_Hg_sample(df, element_list=elements)

# Check result
print(df_agg)

    depth_min   depth_max  Hg_conc_EYC    CLR_Al    CLR_Si    CLR_Ti  \
0  2014.92409  2017.72594    59.938530 -0.359395  1.500958 -0.736232   
1  2017.72594  2020.36297    59.526447 -0.330759  1.488888 -0.810287   
2  2020.36297  2023.00000    63.486721 -0.331581  1.510642 -0.751655   

     CLR_Zr    CLR_Fe    CLR_Br  
0 -1.173573  2.199302 -1.703041  
1 -0.982284  2.231027 -1.722973  
2 -1.015815  2.238764 -1.794820  


## Correlation Analysis

We will analyze correlations between Hg and elements grouped as:  
- Aluminum and Silicon (Al, Si) – proxies for aluminosilicates  
- Titanium and Zirconium (Ti, Zr) – proxies for erosion  
- Iron and Bromine (Fe, Br) – related to organic matter  

Correlation statistics (r, p-value) will be computed separately for pre-1970 and post-1970 samples.

In [152]:
def compute_correlations_by_period(df_agg, element_list=None, split_year=1970, method='pearson'):
    """
    Compute correlations between Hg and selected elements before and after a given year.

    Parameters:
    - df_agg (pd.DataFrame): Aggregated DataFrame with 'age_EYC', 'Hg_conc_EYC', and element concentrations
    - element_list (list): List of element columns to include (default: inferred automatically)
    - split_year (int): Year to split the dataset (e.g., 1970)
    - method (str): Correlation method ('pearson', 'spearman', or 'kendall')

    Returns:
    - pd.DataFrame: Table with correlation coefficients before and after the split year
    """
    if element_list is None:
        element_list = [col for col in df_agg.columns if col not in ['Hg_conc_EYC', 'age_EYC']]

    # Subset the DataFrame before and after the split year
    df_before = df_agg[df_agg['age_EYC'] < split_year]
    df_after = df_agg[df_agg['age_EYC'] >= split_year]

    # Compute correlations for both periods
    corr_before = df_before[element_list].corrwith(df_before['Hg_conc_EYC'], method=method)
    corr_after = df_after[element_list].corrwith(df_after['Hg_conc_EYC'], method=method)

    # Combine results into a single DataFrame
    correlation_table = pd.DataFrame({
        f'Corr_before_{split_year}': corr_before,
        f'Corr_after_{split_year}': corr_after
    })

    return correlation_table

In [153]:
# Define list of element columns
elements = ['CLR_Al', 'CLR_Si', 'CLR_Ti', 'CLR_Zr', 'CLR_Fe', 'CLR_Br']

# Compute correlations before and after 1970
correlation_by_period = compute_correlations_by_period(df_agg, element_list=elements)

# Display the result
correlation_by_period

KeyError: 'age_EYC'

## Summary

- Element concentrations were averaged within Hg sampling depth intervals to align datasets.  
- Correlations between Hg and element proxies were analyzed separately before and after 1970 to highlight temporal shifts.  
- The plot shows scatterplots colored by sample age with correlation coefficients annotated for both periods.