# Data Preparation and Atom Embeddings

This notebook covers:
1. Generating elemental property dictionary (atom_init.json)
2. Creating atom embedding configuration
3. Building atom embeddings from properties and configuration

**Purpose**: Create a lookup table of numerical embeddings for each element that can be used as node features in graph neural networks.

## Part 1: Generate Elemental Property Dictionary

We'll extract fundamental physical, chemical, and electronic properties for all known elements using pymatgen's Element class.

In [1]:
# Import required libraries
from pymatgen.core.periodic_table import Element
from collections import defaultdict
import warnings
import math
import json
import numpy as np

### Helper Functions for Property Extraction

In [2]:
# One-hot encode block (s/p/d/f)
def block_one_hot(block_char):
    blocks = ['s', 'p', 'd', 'f']
    encoding = [0] * 4
    if block_char is None:
        return encoding
    b = str(block_char).lower()
    if b in blocks:
        encoding[blocks.index(b)] = 1
    return encoding

In [3]:
# Normalize subshell label from full_electronic_structure
def subshell_label(subshell):
    if hasattr(subshell, "name"):
        return subshell.name.lower()
    if hasattr(subshell, "label"):
        return subshell.label.lower()
    return str(subshell).lower()

In [4]:
# Safely fetch a property: ignore UserWarnings, return None on error/NaN
def safe_prop(getter):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", UserWarning)
        try:
            val = getter()
        except Exception:
            return None
    # Convert NaN to None
    try:
        if isinstance(val, float) and math.isnan(val):
            return None
        if np is not None and isinstance(val, np.floating) and np.isnan(val):
            return None
    except Exception:
        pass
    return val

### Extract Properties for All Elements

In [5]:
# Initialize dictionary to store element properties
elements_dict = defaultdict(dict)

# Iterate over known atomic numbers (1-118)
for Z in range(1, 119):
    try:
        el = Element.from_Z(Z)
    except Exception:
        continue  # skip undefined elements

    # full_electronic_structure: sequence of (n, subshell, electron_count)
    cfg = safe_prop(lambda: el.full_electronic_structure) or []
    s_count = sum(e[2] for e in cfg if subshell_label(e[1]) == 's')
    p_count = sum(e[2] for e in cfg if subshell_label(e[1]) == 'p')
    d_count = sum(e[2] for e in cfg if subshell_label(e[1]) == 'd')
    f_count = sum(e[2] for e in cfg if subshell_label(e[1]) == 'f')

    ionizations = safe_prop(lambda: el.ionization_energies) or []
    first_ie = ionizations[0] if ionizations else None

    elements_dict[el.symbol] = {
        "Z": el.Z,
        "group": safe_prop(lambda: el.group),
        "period": safe_prop(lambda: el.period),
        "atomic_mass": safe_prop(lambda: el.atomic_mass),
        "covalent_radius": safe_prop(lambda: el.covalent_radius),
        "van_der_waals_radius": safe_prop(lambda: el.van_der_waals_radius),
        "pauling_electronegativity": safe_prop(lambda: el.X),
        "electron_affinity": safe_prop(lambda: el.electron_affinity),
        "first_ionization_energy": first_ie,
        "valence_electrons": {"s": s_count, "p": p_count, "d": d_count, "f": f_count},
        "is_metal": 1 if safe_prop(lambda: el.is_metal) else 0,
        "block": el.block,
        "mendeleev_number": safe_prop(lambda: el.mendeleev_no),
    }

print(f"Extracted properties for {len(elements_dict)} elements")

Extracted properties for 118 elements


### Verify Data - Example for Carbon

In [6]:
# Example: print Carbon properties
if "C" in elements_dict:
    print("Carbon (C) properties:")
    print(json.dumps(elements_dict["C"], indent=2))

Carbon (C) properties:
{
  "Z": 6,
  "group": 14,
  "period": null,
  "atomic_mass": 12.0107,
  "covalent_radius": null,
  "van_der_waals_radius": 1.7,
  "pauling_electronegativity": 2.55,
  "electron_affinity": 1.262122611,
  "first_ionization_energy": 11.260288,
  "valence_electrons": {
    "s": 4,
    "p": 2,
    "d": 0,
    "f": 0
  },
  "is_metal": 0,
  "block": "p",
  "mendeleev_number": 95.0
}


### Save Elemental Properties to JSON

In [7]:
# Save nested file
with open("atom_init.json", "w") as f:
    json.dump(elements_dict, f, indent=2)

print(f"✅ Wrote atom_init.json with feature length = {len(next(iter(elements_dict.values())))} for {len(elements_dict)} elements.")

✅ Wrote atom_init.json with feature length = 13 for 118 elements.


## Part 2: Create Atom Embedding Configuration

This configuration file defines how each elemental property should be encoded into ML-ready embeddings.

In [8]:
CONFIG_PATH = "atom_embed_config.json"

# Configuration for encoding each feature
# You can tweak this anytime and rebuild the embeddings
config = {
    # --- categorical/ordinal as one-hot ---
    "Z":            {"use": True,  "type": "onehot",  "size": 119},  # 0..118, use Z as index
    "group":        {"use": True,  "type": "onehot",  "size": 19},   # 0..18 (1..18 groups; 0=unknown)
    "period":       {"use": True,  "type": "onehot",  "size": 8},    # 0..7  (1..7 periods; 0=unknown)
    "block":        {"use": True,  "type": "onehot",  "categories": ["s","p","d","f"]},

    # --- continuous as Gaussian RBF (auto-range from data) ---
    "atomic_mass":             {"use": False,  "type": "gaussian", "bins": 32, "range": "auto"},
    "covalent_radius":         {"use": True,  "type": "gaussian", "bins": 32, "range": "auto"},
    "van_der_waals_radius":    {"use": False,  "type": "gaussian", "bins": 32, "range": "auto"},
    "pauling_electronegativity":{"use": False, "type": "gaussian", "bins": 32, "range": "auto"},
    "electron_affinity":       {"use": False,  "type": "gaussian", "bins": 32, "range": "auto"},
    "first_ionization_energy": {"use": False,  "type": "gaussian", "bins": 32, "range": "auto"},
    "mendeleev_number":        {"use": False,  "type": "gaussian", "bins": 32, "range": "auto"},

    # --- valence counts as linear scalars (you can switch to onehot later) ---
    "valence_s":   {"use": True,  "type": "linear"},
    "valence_p":   {"use": True,  "type": "linear"},
    "valence_d":   {"use": True,  "type": "linear"},
    "valence_f":   {"use": True,  "type": "linear"},

    # --- binary/indicator ---
    "is_metal":    {"use": True,  "type": "linear"},

    # --- optional behavior knobs ---
    "_behavior": {
        "missing_scalar_fill": 0.0,    # fallback for missing linear values
        "gaussian_gamma": None         # None = auto from bin spacing; or set a float manually
    }
}

In [9]:
# Save configuration file
with open(CONFIG_PATH, "w") as f:
    json.dump(config, f, indent=2)

print(f"✅ Saved configuration file to '{CONFIG_PATH}'")

✅ Saved configuration file to 'atom_embed_config.json'


## Part 3: Build Atom Embeddings

Generate numeric embeddings by combining elemental properties with encoding rules.

In [10]:
# File paths
ATOM_INIT_PATH = "atom_init.json"
CONFIG_PATH    = "atom_embed_config.json"
EMB_PATH       = "atom_embedding.json"

### Define Encoding Helper Functions

In [11]:
def safe_float(x, default=None):
    """Convert safely to float, replacing None/NaN with a default value."""
    if x is None:
        return default
    try:
        xf = float(x)
        if math.isnan(xf):
            return default
        return xf
    except Exception:
        return default

def as_int(x, default=None):
    """Convert safely to int via rounding, or return default."""
    xf = safe_float(x, None)
    if xf is None:
        return default
    try:
        return int(round(xf))
    except Exception:
        return default

In [12]:
def one_hot(index, size):
    """Return one-hot vector of given size."""
    v = [0] * size
    if index is not None and 0 <= index < size:
        v[index] = 1
    return v

In [13]:
def gaussian_expand(x, min_v, max_v, bins=32, gamma=None):
    """
    Gaussian (RBF) expansion for scalar values.
    Generates 'bins' Gaussian basis values between min_v and max_v.
    """
    if x is None or (isinstance(x, float) and math.isnan(x)):
        return [0.0] * bins
    if max_v <= min_v:
        centers = [min_v] * bins
    else:
        step = (max_v - min_v) / max(1, bins - 1)
        centers = [min_v + i * step for i in range(bins)]
    if gamma is None:
        spacing = (max_v - min_v) / max(1, bins - 1)
        sigma = spacing if spacing > 0 else 1.0
        gamma = 1.0 / (2.0 * sigma * sigma)
    return [math.exp(-gamma * ((x - c) ** 2)) for c in centers]

### Load Data and Configuration

In [14]:
# Load element properties
with open(ATOM_INIT_PATH, "r") as f:
    ELEMENTS = json.load(f)  # {symbol -> dict}

# Load configuration
with open(CONFIG_PATH, "r") as f:
    CFG = json.load(f)

behavior = CFG.get("_behavior", {})
missing_scalar_fill = behavior.get("missing_scalar_fill", 0.0)
gauss_gamma_override = behavior.get("gaussian_gamma", None)

print(f"Loaded {len(ELEMENTS)} elements and configuration")

Loaded 118 elements and configuration


### Precompute Ranges for Gaussian Features

In [15]:
def gather_values(key_path):
    """
    Collect all valid float values from nested dicts given a key path.
    Example:
      ("atomic_mass",) → d["atomic_mass"]
      ("valence_electrons", "s") → d["valence_electrons"]["s"]
    """
    vals = []
    for sym, d in ELEMENTS.items():
        cur = d
        ok = True
        for k in key_path:
            if isinstance(cur, dict) and (k in cur):
                cur = cur[k]
            else:
                ok = False
                break
        if not ok:
            continue
        v = safe_float(cur, None)
        if v is not None:
            vals.append(v)
    return vals

In [16]:
# Determine min/max for each Gaussian-type field
gauss_ranges = {}
for field, spec in CFG.items():
    if field.startswith("_"):
        continue
    if not spec.get("use", True):
        continue
    if spec.get("type") == "gaussian":
        if spec.get("range") == "auto":
            vals = gather_values((field,))
            if vals:
                gauss_ranges[field] = (min(vals), max(vals))
            else:
                gauss_ranges[field] = (0.0, 1.0)
        else:
            rng = spec.get("range")
            if isinstance(rng, (list, tuple)) and len(rng) == 2:
                gauss_ranges[field] = (float(rng[0]), float(rng[1]))
            else:
                gauss_ranges[field] = (0.0, 1.0)

print(f"Computed ranges for {len(gauss_ranges)} Gaussian features")

Computed ranges for 1 Gaussian features


### Core Encoder Function

In [17]:
def embedding_length_template():
    """Compute total embedding length based on config definition."""
    length = 0
    for field, spec in CFG.items():
        if field.startswith("_") or not spec.get("use", True):
            continue
        t = spec["type"]
        if t == "onehot":
            if "categories" in spec:
                length += len(spec["categories"])
            else:
                length += int(spec["size"])
        elif t == "gaussian":
            length += int(spec.get("bins", 32))
        elif t == "linear":
            length += 1
    return length

In [18]:
def encode_symbol(symbol):
    """Encode one element symbol (e.g., 'H') into a numeric feature vector."""
    d = ELEMENTS.get(symbol)
    if d is None:
        return [0.0] * embedding_length_template()

    out = []
    for field, spec in CFG.items():
        if field.startswith("_"):
            continue
        if not spec.get("use", True):
            continue

        ftype = spec["type"]

        # --- ONEHOT ---
        if ftype == "onehot":
            if "categories" in spec:
                cats = spec["categories"]
                raw = d.get(field, None)
                idx = None
                if raw is not None:
                    s = str(raw).lower()
                    if s in cats:
                        idx = cats.index(s)
                out += one_hot(idx, len(cats))
            else:
                size = int(spec["size"])
                idx = as_int(d.get(field), 0)
                if idx is None or idx < 0 or idx >= size:
                    idx = 0
                out += one_hot(idx, size)

        # --- GAUSSIAN ---
        elif ftype == "gaussian":
            bins = int(spec.get("bins", 32))
            min_v, max_v = gauss_ranges[field]
            x = safe_float(d.get(field), None)
            out += gaussian_expand(x, min_v, max_v, bins=bins, gamma=gauss_gamma_override)

        # --- LINEAR ---
        elif ftype == "linear":
            if field.startswith("valence_"):
                orb = field.split("_", 1)[1]
                x = safe_float(d.get("valence_electrons", {}).get(orb), missing_scalar_fill)
            elif field == "is_metal":
                x = 1.0 if int(safe_float(d.get("is_metal"), 0)) == 1 else 0.0
            else:
                x = safe_float(d.get(field), missing_scalar_fill)
            out.append(0.0 if x is None else float(x))

        else:
            raise ValueError(f"Unknown encoding type '{ftype}' for field '{field}'")

    return out

### Build Embeddings for All Elements

In [19]:
# Build embeddings for all elements
embeddings = {sym: encode_symbol(sym) for sym in ELEMENTS.keys()}
emb_len = embedding_length_template()

print(f"Generated embeddings for {len(embeddings)} elements with length {emb_len}")

Generated embeddings for 118 elements with length 187


### Verify Embedding Consistency

In [20]:
# Sanity check
assert all(len(v) == emb_len for v in embeddings.values()), "Inconsistent embedding length!"

print("✅ All embeddings have consistent length")

✅ All embeddings have consistent length


### Save Embedding Lookup Table

In [21]:
# Save embedding lookup table
with open(EMB_PATH, "w") as f:
    json.dump({"embedding_length": emb_len, "embeddings": embeddings}, f, indent=2)

print(f"✅ Saved '{EMB_PATH}' with {len(embeddings)} symbols; embedding_length = {emb_len}")

✅ Saved 'atom_embedding.json' with 118 symbols; embedding_length = 187


### Example: View Embedding for a Specific Element

In [22]:
# Example: view embedding for Carbon
if "C" in embeddings:
    c_embedding = embeddings["C"]
    print(f"Carbon embedding (first 20 values): {c_embedding[:20]}")
    print(f"Total embedding length: {len(c_embedding)}")

Carbon embedding (first 20 values): [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Total embedding length: 187


## Summary

We have successfully:
1. ✅ Extracted elemental properties for 118 elements → `atom_init.json`
2. ✅ Created encoding configuration → `atom_embed_config.json`
3. ✅ Generated numeric embeddings → `atom_embedding.json`

These embeddings can now be used as node features in graph neural networks for materials property prediction.