# Dataset Documentation: matbench_mp_gap

**Name:** `matbench_mp_gap`  
**Benchmark:** MatBench v0.1  
**Source repository:** Materials Project / MatBench via Matminer

---

## 1. Overview  
The `matbench_mp_gap` dataset is a supervised regression dataset for predicting the **DFT-PBE computed electronic band gap** (in eV) of inorganic materials from their crystal structures.  
Each row corresponds to one material structure (bulk inorganic crystal) represented via a `pymatgen.Structure` object.  

**Task type:** Regression  
**Input:** `structure` (crystal structure)  
**Target:** `band_gap_eV` (original column name `gap pbe`)  
**Typical size:** ~106,000 entries.  

---

## 2. Data origin & construction  
- Derived from the **Materials Project (MP)** database, filtered and curated by the MatBench authors.  


---

## 3. Column descriptions  
| Column       | Type                     | Description                                                 |
|--------------|--------------------------|-------------------------------------------------------------|
| `structure`  | `pymatgen.Structure`     | Crystal structure of the material (atoms + lattice)        |
| `band_gap_eV`| `float`                  | Electronic band gap in eV computed using PBE DFT           |

Note: After loading via Matminer, the column `gap pbe` is renamed to `band_gap_eV` in this project.

---

## 4. Dataset statistics (as processed in this project)  
*Derived from the full dataset imported via Matminer and processed through our pipeline:*

- Number of rows: ~106,000  
- Mean band gap: ≈ 1.21 eV  
- Standard deviation: ≈ 1.59 eV  
- Median band gap: ≈ 0.25 eV  
- Percentage of “metals” (gap ≈ 0): ≈ 46%  

These values were computed after initial load (before any feature drop or splitting) and saved into `artifacts/eda_summary.json`.

---

## 5. Usage notes & limitations  
- The band gaps are **DFT-PBE** values, which typically **underestimate experimental band gaps** by ~0.5–1 eV or more depending on material chemistry. Therefore, ML models trained on this dataset are approximating DFT-level gaps—not necessarily experimental values.  
- Because this is a **structure‐based task**, it is suitable for models that take crystal structures → features (or graph representations).  
- The dataset contains a **large fraction of “metals”** (gap ≈ 0) and also insulators/semiconductors (nonzero gaps). Care must be taken in splitting to avoid leakage between similar chemical systems.  
- Missing data may appear in derived descriptors (e.g., when featurizers cannot compute some elemental property). In our pipeline we drop high‐missing‐rate features, then impute missing values using median + indicator columns, fitted only on the **training set** to avoid leakage.

---

## 6. Suggested train/val/test split strategy (used in this project)  
- Split by **chemical system group** (unique unordered element-sets) to avoid overlapping chemistries between train, val, and test.  
- Stratify based on target (band gap) deciles to preserve distribution across splits.  
- Recommended final split proportions: ~60–70% train, ~15–20% validation, ~15–20% test.

---

## 7. Citation  
If you use this dataset in your work, please cite:  
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. “Benchmarking materials property prediction methods: the MatBench test set and Automatminer reference algorithm.” *npj Computational Materials* 6, 138 (2020). DOI:10.1038/s41524-020-00406-3.  [oai_citation:6‡arXiv](https://arxiv.org/abs/2005.00707?utm_source=chatgpt.com)  

---

## 8. Licensing & access  
- The Materials Project data underlying this dataset is made available under the terms of the Materials Genome Initiative / MP data sharing policy.  
- The MatBench benchmark (including this dataset) is open and available via Matminer (MIT licensed).  [oai_citation:7‡GitHub](https://github.com/materialsproject/matbench?utm_source=chatgpt.com)  
- In this project we use the dataset via `matminer.datasets.load_dataset("matbench_mp_gap")`.

---

End of documentation.