# Variational Autoencoder for Materials Discovery

## Workshop: Machine Learning Methods for Materials Science at Centre for Space Research,NWU
https://mesfind.github.io/matgnn/

### Project 1: Materials Representation Learning using Variational Autoencoders

**Learning Objectives:**
- Understand VAE principles and implementation
- Learn materials representation learning
- Implement property prediction from latent space
- Explore the latent space for materials discovery


**VAE Task:** Learn a compact latent representation of materials that captures:
- Chemical composition patterns
- Formation energy relationships
- Materials similarity metrics

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import random
import os
import json
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Deep learning libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

# Plotting style
plt.style.use('default')
sns.set_theme(style="whitegrid")
sns.set_palette("husl")

print("🧠 Variational Autoencoder for Materials Discovery")
print("🎯 Goal: Learn meaningful representations of materials for discovery")


Using device: cpu
🧠 Variational Autoencoder for Materials Discovery
🎯 Goal: Learn meaningful representations of materials for discovery


In [3]:
# Create output directories
def create_output_directories():
    """Create organized output directories for materials, images, and models."""
    directories = ['materials', 'images', 'models']
    for directory in directories:
        Path(directory).mkdir(exist_ok=True)
    print("📁 Created output directories: materials/, images/, models/")
    
create_output_directories()

📁 Created output directories: materials/, images/, models/


## 1. Introduction to Variational Autoencoders for Materials

Variational Autoencoders (VAEs) are powerful generative models that learn:

### Key Concepts:
- **Encoder**: Maps materials features to latent distribution parameters (μ, σ²)
- **Latent Space**: Low-dimensional probabilistic representation of materials
- **Decoder**: Reconstructs materials features from latent codes
- **Reparameterization Trick**: Enables gradient flow through random sampling

### VAE Loss Function:
- **Reconstruction Loss**: How well can we reconstruct input materials?
- **KL Divergence**: How close is the latent distribution to a prior (N(0,I))?
- **Property Loss**: Can we predict materials properties from latent codes?

### Applications for Materials Science:
- **Materials Discovery**: Sample new materials from latent space
- **Property Prediction**: Predict properties from compact representations
- **Materials Similarity**: Use latent distance as materials similarity
- **Feature Engineering**: Use latent codes as features for downstream tasks
"""

In [4]:
data_dir = Path("./data") # common data directory
    
processed_file = data_dir / "formation_energy_processed.csv"
if processed_file.exists():
    print("📂 Loading processed formation energy dataset...")
    df = pd.read_csv(processed_file)
    print(f"✅ Loaded {len(df)} materials with processed features")
    print(f"📊 Dataset shape: {df.shape}")
    print(f"🎯 Formation energy range: {df['e_form_per_atom'].min():.3f} to {df['e_form_per_atom'].max():.3f} eV/atom")

📂 Loading processed formation energy dataset...
✅ Loaded 5000 materials with processed features
📊 Dataset shape: (5000, 42)
🎯 Formation energy range: -4.343 to 2.443 eV/atom
