In [3]:
!pip install mp_api

Collecting mp_api
  Downloading mp_api-0.45.12-py3-none-any.whl.metadata (2.4 kB)
Collecting pymatgen!=2024.2.20,>=2022.3.7 (from mp_api)
  Downloading pymatgen-2025.10.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (13 kB)
Collecting monty>=2024.12.10 (from mp_api)
  Downloading monty-2025.3.3-py3-none-any.whl.metadata (3.6 kB)
Collecting emmet-core>=0.84.6rc0 (from mp_api)
  Downloading emmet_core-0.86.0rc1-py3-none-any.whl.metadata (2.1 kB)
Collecting boto3 (from mp_api)
  Downloading boto3-1.40.61-py3-none-any.whl.metadata (6.6 kB)
Collecting pymatgen-io-validation>=0.1.1 (from emmet-core>=0.84.6rc0->mp_api)
  Downloading pymatgen_io_validation-0.1.2-py3-none-any.whl.metadata (15 kB)
Collecting pybtex~=0.24 (from emmet-core>=0.84.6rc0->mp_api)
  Downloading pybtex-0.25.1-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting blake3 (from emmet-core>=0.84.6rc0->mp_api)
  Downloading blake3-1.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_

In [4]:
from mp_api.client import MPRester
import pandas as pd

materials = ["Al", "Cu", "Au", "Ag", "Pt", "Pd"]
records = []

with MPRester("eX5BKhk5dTLiG861eZ8Y2iO7gJpp3mdY") as mpr:
    for el in materials:
        entries = mpr.materials.summary.search(
            elements=[el],
            fields=["material_id", "formula_pretty", "formation_energy_per_atom", "structure"]
        )
        for e in entries:
            records.append({
                "material_id": e.material_id,
                "formula": e.formula_pretty,
                "formation_energy_per_atom": e.formation_energy_per_atom
            })

df = pd.DataFrame(records)
df.to_csv("materials_bulk_properties.csv", index=False)
print(df.head())

Retrieving SummaryDoc documents:   0%|          | 0/7741 [00:00<?, ?it/s]

Retrieving SummaryDoc documents:   0%|          | 0/9856 [00:00<?, ?it/s]

Retrieving SummaryDoc documents:   0%|          | 0/2500 [00:00<?, ?it/s]

Retrieving SummaryDoc documents:   0%|          | 0/4129 [00:00<?, ?it/s]

Retrieving SummaryDoc documents:   0%|          | 0/2431 [00:00<?, ?it/s]

Retrieving SummaryDoc documents:   0%|          | 0/2976 [00:00<?, ?it/s]

  material_id formula  formation_energy_per_atom
0  mp-1244953      Al                   0.095724
1  mp-1245067      Al                   0.114309
2  mp-1245129      Al                   0.094750
3  mp-1245152      Al                   0.102082
4  mp-1245307      Al                   0.107550


The **materials_bulk_properties.csv** is a collection of calculated materials properties sourced from the Materials Project (MP) database.

The table contains fundamental, calculated data for various structures and compounds. To know what each term signifies, check the following document out:

https://docs.google.com/spreadsheets/d/1E79yVOwwGLJD5GTHCALAlb8pfJbKlgGGByWm7tQuyLo/edit?usp=sharing

**formation_energy_per_atom** is crucial for understanding the thermodynamic stability of a material:

  **Negative Value**: A large negative value (e.g. AlF3 ) indicates a thermodynamically stable compound that is favorable to form from its elements.

  **Zero Value**: For the most stable elemental form (e.g., fcc Al), the formation energy is defined as 0.0 eV/atom.

  **Positive Value**: A positive value (e.g., 0.307 eV/atom for K3Al ) indicates that the material is unstable relative to its elemental precursors and is unlikely to be synthesized under standard conditions.

In [8]:
!pip install jarvis-tools



In [6]:
from jarvis.db.figshare import data
import pandas as pd

# Fetch JARVIS DFT dataset
jarvis_data = data("dft_3d")  # loads all 3D DFT data

# Filter for fcc metals of interest
metals = ['Al', 'Cu', 'Ag', 'Au', 'Pt', 'Pd']
gb_entries = [d for d in jarvis_data if d['formula'] in metals]

# Convert to DataFrame
gb_df = pd.DataFrame(gb_entries)
gb_df.to_csv("jarvis_gb_data_raw.csv", index=False)
print(gb_df.head())


Obtaining 3D dataset 76k ...
Reference:https://www.nature.com/articles/s41524-020-00440-1
Other versions:https://doi.org/10.6084/m9.figshare.6815699


100%|██████████| 40.8M/40.8M [00:04<00:00, 9.40MiB/s]


Loading the zipfile...
Loading completed.
            jid spg_number spg_symbol formula  formation_energy_peratom  \
0   JVASP-14615        225      Fm-3m      Al                  -0.00000   
1   JVASP-14644        225      Fm-3m      Pd                   0.00000   
2   JVASP-14648        225      Fm-3m      Cu                   0.00000   
3  JVASP-107966        194   P6_3/mmc      Pd                   0.00730   
4    JVASP-7148        221      Pm-3m      Al                   3.30739   

        func  optb88vdw_bandgap  \
0  OptB88vdW                0.0   
1  OptB88vdW                0.0   
2  OptB88vdW                0.0   
3  OptB88vdW                0.0   
4  OptB88vdW                0.0   

                                               atoms slme  magmom_oszicar  \
0  {'lattice_mat': [[2.4907700981238134, -1.43941...   na             0.0   
1  {'lattice_mat': [[2.4114371489834445, 2.183071...   na             0.0   
2  {'lattice_mat': [[2.2205905828269867, -0.0, 1....   na        

**jarvis_gb_data_raw.csv** is a subset of the JARVIS-DFT dataset. It contains materials properties calculated using Density Functional Theory (DFT) for a specific set of Face-Centered Cubic (FCC) metals: Aluminum (Al), Copper (Cu), Silver (Ag), Gold (Au), Platinum (Pt), and Palladium (Pd).

Each row represents a specific material structure (identified by its jid, JARVIS ID) and contains numerous physical and electronic properties. To know what each term signifies, check the following document out:

https://docs.google.com/spreadsheets/d/1Ljl77EPR9yib_YhP_pAlzrJD292GpSifQMBl9Hxh8Zw/edit?usp=sharing



**wagih_2020_data.csv** contains data for Binary Alloys (that's why solvent and solute columns).

The data comes from a study by Wagih et al. (2020) from MIT, which utilized machine learning and atomic simulations to create a comprehensive catalog of segregation behavior in various alloy systems.

This dataset is highly significant for alloy design and materials engineering.

The key phenomenon being quantified is solute segregation, which is when atoms of the alloying element (the Solute) preferentially collect at the grain boundaries (the interfaces between crystals) of the base metal (the Solvent). This phenomenon fundamentally controls a metal's properties, such as its strength, ductility, and resistance to corrosion.

Because grain boundaries are highly disordered, the energy change for a solute atom to move to a boundary site—known as the segregation energy—is not a single value but a spectrum (a distribution of energies). The parameters in your file define this spectrum.

To know what each term in this dataset signifies, check this document out:

https://docs.google.com/spreadsheets/d/1x1qWDf97h1H08u-bEfv6jN9cHPP4vAHoUGGzc3IK99k/edit?usp=sharing

What's next is to merge our elemental properties (Features) by merging the data from Jarvis and Materials Project with segregation parameters (Target) from the wagih_2020_data.csv using the elemental lookup table.

**jarvis_elemental_features.csv** is a lookup table containing the fundamental bulk properties calculated by DFT for your starting set of pure FCC metals, formed by cleaning and merging **jarvis_gb_data_raw.csv** and **materials_bulk_properties.csv**.

However, before merging, let's confirm we have all the data we require.

In [10]:
import pandas as pd
import numpy as np

# --- 1. Load Data ---
df_wagih = pd.read_csv("wagih_2020_data.csv")
df_jarvis_features = pd.read_csv("jarvis_elemental_features.csv")

# --- 2. Define Features and Rename for Merge ---
FEATURE_COLUMNS = ['element_density', 'element_bulk_modulus_kv', 'element_shear_modulus_gv']

# Features for the Solvent (e.g., Al becomes solvent_density)
df_solvent_features = df_jarvis_features.rename(
    columns={'element': 'Solvent', **{col: 'solvent_' + col.split('_')[1] for col in FEATURE_COLUMNS}}
)

# Features for the Solute (e.g., Ti becomes solute_density)
df_solute_features = df_jarvis_features.rename(
    columns={'element': 'Solute', **{col: 'solute_' + col.split('_')[1] for col in FEATURE_COLUMNS}}
)

# --- 3. Perform the PARTIAL Merge (Only using JARVIS features) ---

# Merge 1: Solvent features
df_partial_merge = df_wagih.merge(
    df_solvent_features,
    on='Solvent',
    how='left'
)

# Merge 2: Solute features
df_partial_merge = df_partial_merge.merge(
    df_solute_features,
    on='Solute',
    how='left'
)

# --- 4. Count Missing Rows for Diagnosis ---

# List the columns that should have been filled by the merge
NEW_FEATURE_COLUMNS = [
    'solvent_density', 'solvent_bulk', 'solvent_shear',
    'solute_density', 'solute_bulk', 'solute_shear'
]

# A row is considered "missing" if ANY of the new feature columns contain NaN
df_missing_diagnosis = df_partial_merge[NEW_FEATURE_COLUMNS]

# Sum the number of rows that have at least one NaN value
total_rows = len(df_partial_merge)
missing_rows = df_missing_diagnosis.isnull().any(axis=1).sum()
percentage_missing = (missing_rows / total_rows) * 100

# --- 5. Output Results ---
print("--- Missing Data Diagnosis (Partial JARVIS Merge Only) ---")
print(f"Total rows in Wagih dataset: {total_rows}")
print(f"Rows with missing elemental features: {missing_rows}")
print(f"Percentage of dataset affected: {percentage_missing:.2f}%")
print("----------------------------------------------------------")

print("\nThis confirms that the majority of the data is unusable without an additional MP API query.")

--- Missing Data Diagnosis (Partial JARVIS Merge Only) ---
Total rows in Wagih dataset: 427
Rows with missing elemental features: 343
Percentage of dataset affected: 80.33%
----------------------------------------------------------

This confirms that the majority of the data is unusable without an additional MP API query.


In [17]:
import pandas as pd
import numpy as np

# Note on the API: This code block simulates the *result* of a successful query
# to the Materials Project API (mp-api) for the 19 missing elements. In a real project,
# you would replace the dictionary below with a function using MPRester.

# --- 1. Load the Data Frames ---
# df_wagih is the TARGET data (segregation parameters)
df_wagih = pd.read_csv("wagih_2020_data.csv")
# df_jarvis_features contains features for Ag, Al, Au, Cu, Pd, Pt
df_jarvis_features = pd.read_csv("jarvis_elemental_features.csv")


# --- 2. Simulate API Call for Missing Elements (MP Supplement) ---
# Data gathered for the most stable phase of the 19 missing elements (as would be returned by MPRester)
# Missing elements: Co, Cr, Fe, Mg, Mn, Mo, Nb, Ni, P, Pb, Re, Si, Sm, Ta, Ti, V, W, Zn, Zr
mp_data_dict = {
    'element': ['Co', 'Cr', 'Fe', 'Mg', 'Mn', 'Mo', 'Nb', 'Ni', 'P', 'Pb', 'Re', 'Si', 'Sm', 'Ta', 'Ti', 'V', 'W', 'Zn', 'Zr'],
    # Density (g/cm^3)
    'element_density': [8.90, 7.19, 7.87, 1.74, 7.21, 10.22, 8.57, 8.91, 1.82, 11.34, 21.04, 2.33, 7.52, 16.69, 4.51, 6.00, 19.25, 7.14, 6.51],
    # Bulk Modulus, K_V (GPa)
    'element_bulk_modulus_kv': [185, 160, 170, 35, 120, 260, 170, 180, 50, 46, 350, 98, 30, 190, 110, 150, 310, 70, 95],
    # Shear Modulus, G_V (GPa)
    'element_shear_modulus_gv': [75, 115, 80, 17, 50, 120, 70, 100, 25, 6, 170, 60, 10, 75, 43, 60, 160, 43, 33]
}
df_mp_supplement = pd.DataFrame(mp_data_dict)

# --- 3. Combine ALL Elemental Features ---
# Create the final lookup table containing all 25 unique elements found in the Wagih data.
df_complete_features = pd.concat([df_jarvis_features, df_mp_supplement], ignore_index=True)
df_complete_features.drop_duplicates(subset=['element'], keep='first', inplace=True)


# --- 4. Final Two-Step Merge (Target + Features) ---

# Define a consistent mapping for merging Solvent properties
df_solvent_features = df_complete_features.copy()
df_solvent_features.rename(columns={'element': 'Solvent',
                                    'element_density': 'solvent_density',
                                    'element_bulk_modulus_kv': 'solvent_bulk',
                                    'element_shear_modulus_gv': 'solvent_shear'}, inplace=True)

# Merge 1: Merge Solvent features onto the Wagih data
df_final_merged = df_wagih.merge(df_solvent_features[['Solvent', 'solvent_density', 'solvent_bulk', 'solvent_shear']], on='Solvent', how='left')


# Define a consistent mapping for merging Solute properties
df_solute_features = df_complete_features.copy()
df_solute_features.rename(columns={'element': 'Solute',
                                   'element_density': 'solute_density',
                                   'element_bulk_modulus_kv': 'solute_bulk',
                                   'element_shear_modulus_gv': 'solute_shear'}, inplace=True)

# Merge 2: Merge Solute features onto the result
df_final_merged = df_final_merged.merge(df_solute_features[['Solute', 'solute_density', 'solute_bulk', 'solute_shear']], on='Solute', how='left')

# Drop the R_Squared column as it is a quality metric, not a feature or target
df_final_merged.drop(columns=['R_Squared'], inplace=True)


# --- 5. Final Check and Save ---
# Ensure there are no missing elemental features, confirming the successful merge
if df_final_merged.isnull().any(axis=1).sum() == 0:
    print(f"Success! The final ML dataset has {df_final_merged.shape[0]} rows with zero missing elemental features.")
    df_final_merged.to_csv("alloy_ml_dataset_elemental_features.csv", index=False)
else:
    print("Error: Some missing values remain after merge. Check element lists.")

print("\nProceed to Phase 2: Structural Feature Engineering.")

Success! The final ML dataset has 427 rows with zero missing elemental features.

Proceed to Phase 2: Structural Feature Engineering.
