# Demo 3: Loss Landscape Post-Processing and Feature Extraction

This notebook demonstrates how to **post-process loss landscape data** to extract meaningful features and statistics for machine learning analysis. We'll show both the **manual step-by-step approach** and the **automated script method**.

## **What This Demo Covers**

This notebook teaches you how to:
1. **Load and inspect loss landscape data** from previous demos
2. **Extract chemical and structural features** using matminer
3. **Compute loss landscape metrics** (origin loss, roughness, optimality, etc.)
4. **Apply statistical transformations** (normalization, standardization)
5. **Save processed data** for machine learning analysis
6. **Use automated post-processing scripts** with configuration files

## **Expected Input Data**

This demo expects loss landscape data generated from **Demo 1** or **Demo 2**:

### **Required Input Files:**
- `computed_loss_landscapes/demo2_automated_landscape/loss_landscapes_df.pkl` - Loss landscape DataFrame
- `computed_loss_landscapes/demo2_automated_landscape/config.yml` - Original experiment configuration
- `demo/demo_JVDFT_dHf_dataset_50.pkl` - Original dataset with material information

## **Expected Generated Files**

This post-processing workflow will create several analysis-ready files:

### **Manual Processing Results:**
- Various intermediate DataFrames with computed features
- Statistical metrics for each loss landscape
- Normalized and standardized data arrays

### **Automated Script Results (`computed_loss_landscapes/demo2_automated_landscape/`):**
- `feat_sample_df.pkl` - Enhanced sample data with additional features
- `feat_sample_composition_df.pkl` - Chemical composition features (optional)
- `feat_sample_structure_df.pkl` - Crystal structure features (optional)  
- `processed_loss_function_dict.pkl` - Complete loss landscape metrics and statistics

## **Key Processing Steps**

1. **Loss Landscape Metrics**: Origin loss, average loss, standard deviation, optimality checks
2. **Chemical Features**: Elemental properties, stoichiometry, valence orbitals (optional)
3. **Structural Features**: Density, symmetry features (optional)
4. **Statistical Transformations**: Log transforms, z-score normalization, min-max scaling

## **Script We'll Use**

- **`post_process_loss_landscapes.py`** - Automated post-processing from config file

---

**Let's get started with the manual approach first!**


# Part 1: Manual Step-by-Step Post-Processing

## 1. Import Libraries and Setup

First, let's import all the necessary libraries for data processing, feature extraction, and analysis.


In [None]:
import sys
import os
os.chdir('..') # Change working directory to the parent directory
import numpy as np
import pandas as pd
import yaml
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, Any

# For loss landscape processing
from util.landscape_processing import (
    extract_loss_landscapes,
    apply_function_to_column_and_add,
    loss_at_origin,
    average_loss,
    standard_deviation_of_loss,
    is_original_loss_the_lowest,
    lowest_over_original_loss,
    euclidean_distance_best_to_original,
    log_of_array,
    z_transform_standardize,
    min_max_normalize
)

# For sample featurizing (optional chemical/structural features)
from util.sample_featurizing import get_pymatgen_structures_to_df, add_num_elements_column

print("Libraries imported successfully!")
print("Note: Matminer features are optional and will be shown later")

## 2. Load Loss Landscape Data

Let's load the loss landscape data generated from Demo 2 (or Demo 1) and examine its structure.


In [None]:
# Load loss landscape data from Demo 2
landscape_folder = os.path.join("computed_loss_landscapes", "demo2_automated_landscape")
landscape_pkl_path = os.path.join(landscape_folder, "loss_landscapes_df.pkl")
config_path = os.path.join(landscape_folder, "config.yml")

# Check if files exist
if os.path.exists(landscape_pkl_path):
    print(f"Found loss landscape data: {landscape_pkl_path}")
    loss_landscapes_df = pd.read_pickle(landscape_pkl_path)
    print(f"Loaded DataFrame shape: {loss_landscapes_df.shape}")
    print(f"Columns: {list(loss_landscapes_df.columns)}")
else:
    print(f"Loss landscape file not found: {landscape_pkl_path}")
    print("Please run Demo 2 first to generate loss landscape data.")

# Load original experiment config
if os.path.exists(config_path):
    print(f"\nFound experiment config: {config_path}")
    with open(config_path, 'r') as f:
        expt_config = yaml.safe_load(f)
    print("Configuration:")
    for key, value in expt_config.items():
        print(f"  {key}: {value}")
else:
    print(f"Config file not found: {config_path}")

# Load original dataset
original_data_path = expt_config['data_path']
if os.path.exists(original_data_path):
    print(f"\nLoading original dataset: {original_data_path}")
    sample_df = pd.read_pickle(original_data_path)
    print(f"Original dataset shape: {sample_df.shape}")
    print(f"Original dataset columns: {list(sample_df.columns)}")
else:
    print(f"Original dataset not found: {original_data_path}")


## 3. Examine Loss Landscape Structure

Let's examine the structure of the loss landscape data and understand what we're working with.


In [None]:
# Examine the loss landscape data structure
print("Loss Landscape Data Analysis:")
print("=" * 40)

# Look at the first few rows
print("\nFirst few rows:")
print(loss_landscapes_df.head())

# Examine a sample loss landscape array
print(f"\nSample loss landscape analysis:")
sample_landscape = loss_landscapes_df['raw_loss_landscapes'].iloc[0]
print(f"Individual landscape shape: {sample_landscape.shape}")
print(f"Data type: {sample_landscape.dtype}")
print(f"Value range: {sample_landscape.min():.6f} to {sample_landscape.max():.6f}")

# Get the 2D landscape (remove singleton dimension)
landscape_2d = sample_landscape[:, :, 0]
center_idx = landscape_2d.shape[0] // 2
center_loss = landscape_2d[center_idx, center_idx]
min_loss = landscape_2d.min()
max_loss = landscape_2d.max()

print(f"\n2D Landscape Analysis:")
print(f"Grid size: {landscape_2d.shape}")
print(f"Center loss (original model): {center_loss:.6f}")
print(f"Minimum loss in grid: {min_loss:.6f}")
print(f"Maximum loss in grid: {max_loss:.6f}")
print(f"Loss range: {max_loss - min_loss:.6f}")

# Create a simple visualization
plt.figure(figsize=(8, 6))
im = plt.imshow(sample_landscape, cmap='viridis', origin='lower', extent=[-center_idx, center_idx, -center_idx, center_idx])
plt.colorbar(label='Loss Value')
plt.title(f'Sample Loss Landscape: {loss_landscapes_df["jid"].iloc[0]}')
plt.xlabel('Max Eigenvector Direction')
plt.ylabel('Min Eigenvector Direction')
plt.plot(0, 0, 'r*', markersize=15, label='Original Model')
plt.legend()
plt.tight_layout()
plt.show()


## 4. Prepare Data for Processing

We need to rename the loss landscape column and extract the landscapes into the expected format for processing.


In [None]:
# Prepare the loss landscape data for processing
run_id = expt_config['run_id']
print(f"Run ID: {run_id}")

# Rename the column to include the run_id (this is what the processing functions expect)
loss_landscapes_processed = loss_landscapes_df.rename(
    columns={'raw_loss_landscapes': f'{run_id}_mse_loss_landscape_array'}
)

print(f"Renamed column to: {run_id}_mse_loss_landscape_array")
print(f"Processed DataFrame columns: {list(loss_landscapes_processed.columns)}")

# Extract loss landscapes into the expected dictionary format
print(f"\nExtracting loss landscapes...")
loss_function_dict = extract_loss_landscapes(loss_landscapes_processed)

# Examine the extracted data
for loss_function, df in loss_function_dict.items():
    print(f"\nLoss function: {loss_function}")
    print(f"  DataFrame shape: {df.shape}")
    print(f"  Columns: {list(df.columns)}")
    print(f"  Sample landscape array shape: {df.iloc[0, 1].shape}")


## 5. Compute Loss Landscape Metrics

Now let's compute various metrics from the loss landscapes that characterize their properties.


In [None]:
# Compute various loss landscape metrics
print("Computing Loss Landscape Metrics:")
print("=" * 40)

# Work with the first (and only) loss function in our dictionary
loss_function = list(loss_function_dict.keys())[0]
df = loss_function_dict[loss_function].copy()

print(f"Processing loss function: {loss_function}")
print(f"Number of samples: {len(df)}")

# Define the metrics to compute
metrics = [
    (loss_at_origin, "loss_at_origin", "Loss value at the original model (center point)"),
    (average_loss, "average_of_landscape", "Average loss across the entire landscape"),
    (standard_deviation_of_loss, "std_dev_of_landscape", "Standard deviation of loss values"),
    (is_original_loss_the_lowest, "is_original_loss_lowest", "Whether original model is optimal"),
    (lowest_over_original_loss, "lowest_loss_over_original", "Ratio of minimum to original loss"),
    (euclidean_distance_best_to_original, "euclidean_distance_best_to_original", "Distance from original to optimal point"),
    (log_of_array, "log_loss_landscape_array", "Log-transformed loss landscape")
]

print(f"\nComputing {len(metrics)} metrics for each landscape...")

# Apply each metric function
for i, (metric_func, column_name, description) in enumerate(metrics):
    print(f"  {i+1}. Computing {column_name}...")
    df = apply_function_to_column_and_add(
        df, metric_func, "loss_landscape_array", column_name, loss_function
    )

print(f"\nCompleted metric computation!")
print(f"DataFrame now has {len(df.columns)} columns:")
for col in df.columns:
    print(f"  - {col}")

# Display some sample results
print(f"\nSample metric values (first 5 samples):")
metric_columns = [col for col in df.columns if any(metric[1] in col for metric in metrics)]
sample_metrics = df[['jid'] + metric_columns].head()
sample_metrics


## 6. Apply Statistical Transformations

Let's apply normalization and standardization to make the data suitable for machine learning analysis.


In [None]:
# Apply statistical transformations
print("Applying Statistical Transformations:")
print("=" * 40)

# Apply z-score standardization to log-transformed arrays
print("1. Applying z-score standardization to log-transformed landscapes...")
df = z_transform_standardize(df, "log_loss_landscape_array", loss_function)

# Apply min-max normalization to log-transformed arrays  
print("2. Applying min-max normalization to log-transformed landscapes...")
df = min_max_normalize(df, "log_loss_landscape_array", loss_function)

print("Statistical transformations completed!")

# Check what new columns were added
new_columns = [col for col in df.columns if 'z_transform' in col or 'min_max' in col]
print(f"\nNew transformation columns added:")
for col in new_columns:
    print(f"  - {col}")

# Update the loss function dictionary
loss_function_dict[loss_function] = df

print(f"\nFinal DataFrame shape: {df.shape}")
print(f"Total columns: {len(df.columns)}")

# Show a summary of the metric values
print(f"\nMetric Value Summary:")
numeric_columns = [col for col in df.columns if col not in ['jid'] and not any(x in col for x in ['array', 'log_loss'])]
for col in numeric_columns:
    if df[col].dtype in ['float64', 'int64', 'bool']:
        if df[col].dtype == 'bool':
            print(f"  {col}: {df[col].sum()}/{len(df)} samples are True")
        else:
            print(f"  {col}: {df[col].mean():.4f} ± {df[col].std():.4f}")
    else:
        print(f"  {col}: {df[col].dtype}")


## 7. Enhance Sample Data

Let's add some basic features to the original sample data to prepare it for the complete analysis.


In [None]:
# Enhance the sample data with additional features
print("Enhancing Sample Data:")
print("=" * 30)

# Add number of elements column (a simple structural feature)
print("Adding number of elements feature...")
sample_df_enhanced = add_num_elements_column(sample_df.copy())

print(f"Enhanced sample DataFrame shape: {sample_df_enhanced.shape}")
print(f"New columns: {list(sample_df_enhanced.columns)}")

# Show some examples
print(f"\nSample data with number of elements:")
print(sample_df_enhanced[['jid', 'formation_energy_peratom', 'num_elements']].head())

# Analyze the number of elements distribution
print(f"\nNumber of elements distribution:")
element_counts = sample_df_enhanced['num_elements'].value_counts().sort_index()
for num_elem, count in element_counts.items():
    print(f"  {num_elem} elements: {count} samples")

print(f"\nBasic enhanced sample data is ready!")
print(f"Note: Chemical composition and crystal structure features require matminer, \nwhich is not shown here but is available in the full script.")


## 8. Visualize Key Metrics

Let's create some visualizations to understand the computed loss landscape metrics.


In [None]:
# Create visualizations of the key metrics
print("Creating Loss Landscape Metric Visualizations:")

# Get the processed DataFrame
df_viz = loss_function_dict[loss_function]

# Create a figure with multiple subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

# Define metrics to visualize
viz_metrics = [
    (f'{loss_function}_loss_at_origin', 'Loss at Origin'),
    (f'{loss_function}_average_of_landscape', 'Average Loss'),
    (f'{loss_function}_std_dev_of_landscape', 'Loss Standard Deviation'),
    (f'{loss_function}_lowest_loss_over_original', 'Min/Original Loss Ratio'),
    (f'{loss_function}_euclidean_distance_best_to_original', 'Distance to Optimum'),
    (f'{loss_function}_is_original_loss_lowest', 'Original is Optimal (%)'),
]

# Create plots
for i, (metric_col, title) in enumerate(viz_metrics):
    if i < len(axes):
        if 'is_original_loss_lowest' in metric_col:
            # For boolean data, show percentage
            pct_true = df_viz[metric_col].mean() * 100
            axes[i].bar(['False', 'True'], 
                       [100-pct_true, pct_true], 
                       color=['red', 'green'], alpha=0.7)
            axes[i].set_ylabel('Percentage (%)')
            axes[i].set_title(f'{title}\n({pct_true:.1f}% are optimal)')
        else:
            # For numeric data, show histogram
            axes[i].hist(df_viz[metric_col], bins=15, alpha=0.7, color='skyblue', edgecolor='black')
            axes[i].set_xlabel('Value')
            axes[i].set_ylabel('Frequency')
            axes[i].set_title(f'{title}\n(μ={df_viz[metric_col].mean():.4f}, σ={df_viz[metric_col].std():.4f})')
        
        axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Loss Landscape Metrics Distribution', fontsize=16, y=1.02)
plt.show()


# Part 2: Automated Post-Processing with Script

Now let's see how to accomplish the same post-processing using the automated script.

## 9. Create Configuration for Automated Post-Processing

We need to create a configuration file for the automated post-processing script.


In [None]:
# Create configuration for automated post-processing
print("Creating Post-Processing Configuration:")
print("=" * 40)

# Configuration for post-processing
post_process_config = {
    'folder': landscape_folder,  # The folder containing loss landscape results
    'featurize': 'True',  # Set to 'True' to enable chemical/structural features
    'n_jobs': 4,  # Number of parallel jobs for featurization
    'output_format': 'pkl',  # Output format
    'verbose': True  # Enable detailed logging
}

# Save configuration file
config_file_path = os.path.join('demo', 'demo3_post_process_config.yml')

with open(config_file_path, 'w') as f:
    yaml.dump(post_process_config, f, default_flow_style=False)

print(f"\nConfiguration saved to: {config_file_path}")

# Display the generated YAML file
print(f"\nContents of {config_file_path}:")
print("-" * 40)
with open(config_file_path, 'r') as f:
    print(f.read())

print("\nConfiguration file created successfully!")


### Configuration Explanation

**Key Parameters for Post-Processing:**

- **`folder`**: Path to the directory containing loss landscape results (must contain .pkl and .yml files)
- **`featurize`**: String 'True'/'False' - whether to compute chemical/structural features using matminer
- **`n_jobs`**: Number of CPU cores to use for parallel feature computation  
- **`output_format`**: Output file format (currently only 'pkl' supported)
- **`verbose`**: Enable detailed progress logging

**Important Notes:**
- Setting `featurize: 'True'` requires matminer library and significantly increases computation time
- All results are saved back to the same folder


## 10. Execute Automated Post-Processing Script

Now let's run the automated post-processing script with our configuration.

```bash
python post_process_loss_landscapes.py demo3_post_process_config.yml
```


In [None]:
import subprocess
import time

print("Starting automated post-processing...")
print("This may take 1-2 minutes depending on the dataset size and featurization settings.")

start_time = time.time()

try:
    # Run the post-processing script
    cmd = ['python', 'post_process_loss_landscapes.py', config_file_path]
    print(f"\nExecuting: {' '.join(cmd)}")
    
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=600)  # 10 min timeout
    
    end_time = time.time()
    computation_time = end_time - start_time
    
    if result.returncode == 0:
        print(f"\nPost-processing completed successfully!")
        print(f"Total time: {computation_time:.2f} seconds ({computation_time/60:.2f} minutes)")
        print(f"\nScript output:")
        print(result.stdout)
    else:
        print(f"\nError occurred during post-processing:")
        print(f"Return code: {result.returncode}")
        print(f"Error output: {result.stderr}")
        print(f"Standard output: {result.stdout}")
        
except subprocess.TimeoutExpired:
    print(f"\nScript timed out after 10 minutes")
except Exception as e:
    print(f"\nUnexpected error: {str(e)}")


## 11. Verify Automated Post-Processing Results

Let's examine what files were created by the automated post-processing script.


In [None]:
# Check the automated post-processing results
print("Checking Automated Post-Processing Results:")
print("=" * 50)

# Expected output files from the automated script
expected_files = [
    'feat_sample_df.pkl',
    'feat_sample_composition_df.pkl', 
    'feat_sample_structure_df.pkl',
    'processed_loss_function_dict.pkl'
]

print(f"Checking output directory: {landscape_folder}")

# Check what files were created
created_files = []
for filename in expected_files:
    file_path = os.path.join(landscape_folder, filename)
    if os.path.exists(file_path):
        size = os.path.getsize(file_path)
        created_files.append(filename)
        print(f"  Found: {filename} ({size:,} bytes)")
    else:
        print(f"  Missing: {filename}")

print(f"\nSuccessfully created {len(created_files)}/{len(expected_files)} expected files")


In [None]:
# Load and examine the processed loss function dictionary
processed_dict_path = os.path.join(landscape_folder, 'processed_loss_function_dict.pkl')
if os.path.exists(processed_dict_path):
    print(f"\nLoading processed loss function dictionary...")
    
    with open(processed_dict_path, 'rb') as f:
        import pickle
        automated_loss_dict = pickle.load(f)
    
    print(f"Loss functions processed: {list(automated_loss_dict.keys())}")
    
    for loss_func, processed_df in automated_loss_dict.items():
        print(f"\nLoss function: {loss_func}")
        print(f"  DataFrame shape: {processed_df.shape}")
        print(f"  Columns ({len(processed_df.columns)}):")
        for col in processed_df.columns:
            print(f"    - {col}")
            
        # Compare with our manual results
        if loss_func in loss_function_dict:
            manual_df = loss_function_dict[loss_func]
            print(f"\n  Comparison with manual processing:")
            print(f"    Manual shape: {manual_df.shape}")
            print(f"    Automated shape: {processed_df.shape}")
            print(f"    Columns match: {set(manual_df.columns) == set(processed_df.columns)}")

# Load enhanced sample data
sample_data_path = os.path.join(landscape_folder, 'feat_sample_df.pkl')
if os.path.exists(sample_data_path):
    print(f"\nLoading enhanced sample data...")
    
    with open(sample_data_path, 'rb') as f:
        automated_sample_df = pickle.load(f)
    
    print(f"Enhanced sample data shape: {automated_sample_df.shape}")
    print(f"Columns: {list(automated_sample_df.columns)}")

# Load and examine the feat_sample_composition_df
composition_data_path = os.path.join(landscape_folder, 'feat_sample_composition_df.pkl')
if os.path.exists(composition_data_path):
    print(f"\nLoading sample composition data...")
    
    with open(composition_data_path, 'rb') as f:
        feat_sample_composition_df = pickle.load(f)
    
    print(f"Sample composition data shape: {feat_sample_composition_df.shape}")
    print(f"Columns: {list(feat_sample_composition_df.columns)}")

# Load and examine the feat_sample_structure_df
structure_data_path = os.path.join(landscape_folder, 'feat_sample_structure_df.pkl')
if os.path.exists(structure_data_path):
    print(f"\nLoading sample structure data...")
    
    with open(structure_data_path, 'rb') as f:
        feat_sample_structure_df = pickle.load(f)
    
    print(f"Sample structure data shape: {feat_sample_structure_df.shape}")
    print(f"Columns: {list(feat_sample_structure_df.columns)}")


print(f"\nAutomated post-processing verification completed!")
