# The NIST JARVIS Dataset

---

## Modules

In [1]:
import gzip
import json
import os
from pathlib import Path

import numpy as np
import pandas as pd
import pymatgen as pmg

## Reading the data

### Data sources

Listed as a kind of key, value pair. The "keys" are the names I gave to the data files after downloading. The "values" are the links on FigShare where I was able to download the data.

*   `jdft_3d_2018-07-07.json`: <https://figshare.com/articles/jdft_3d-7-7-2018_json/6815699>
*   `jdft_2d_2018-07-07.json`: <https://figshare.com/articles/jdft_2d-7-7-2018_json/6815705>
*   `jarvisml_cfid.json`: <https://figshare.com/articles/JARVIS-ML-CFID-descriptors_and_material_properties/6870101>

After downloading and renaming the data files, I compressed them using `gzip` and moved them to this repository's `data/` directory. You can reproduce my results if you do the same.

### Data dictionary

Source: https://github.com/usnistgov/jarvis/blob/master/jarvis/db/static/jarvis_ml-train.ipynb

| predictor      | description                                                                                                                                      |
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| desc           | 1557 descriptors                                                                                                                                 |
| form_enp       | formation energy per atom (eV/atom)                                                                                                              |
| op_gap         | OptB88vdW functional based bandgap (eV)                                                                                                          |
| mbj_gap        | TBmBJ functional based bandgap (eV)                                                                                                              |
| kv             | Voigt bulk mod. (GPa)                                                                                                                            |
| gv             | Shear bulk mod. (GPa)                                                                                                                            |
| elastic        | elastic tensor instring format, use the function 'get_et' to convert into pymatgen elastictensor object                                          |
| epsx           | Static dielctric function value in x-direction based on OptB88vdW (no unit)                                                                      |
| mepsx          | Static dielctric function value in x-direction based on TBmBJ (no unit)                                                                          |
| magmom         | Magnetic moment (Bohr magneton) [from OUTCAR and from OSZICAR], generally OSZICAR value is preferred                                             |
| kp_leng        | Kpoint automatic line density obtained after automatic convergence (Angstrom), substract 25 because 5 extra points were taken during convergence |
| encut          | Plane wave cut-off value obtained after automatic convergence                                                                                    |
| exfoliation_en | exfoliation energy                                                                                                                               |
| strt           | final structure after relaxation with OptB88vdW                                                                                                  |
| el/hl_mass     | effective mass with BoltzTrap at 300K for electrons/holes                                                                                        |

### Loading the data

Define the paths to the data relative to the Jupyter notebook.

In [2]:
datadir = Path("data/")
jdft_2d = datadir / "jdft_2d-7-7-2018.json"
jdft_3d = datadir / "jdft_3d-7-7-2018.json"
jdft_3d_cfid = datadir / "jarvisml_cfid.json"
# os.listdir(datadir)

Read the JSON files on disk and convert them into `pandas` data frames.

In [3]:
with open(jdft_2d) as json_data:
   dft_2d_df = json.load(json_data)

In [4]:
with open(jdft_3d) as json_data:
   dft_3d_df = json.load(json_data)

In [5]:
with open(jdft_3d_cfid) as json_data:
   dft_3d_with_desc_df = json.load(json_data)

In [6]:
# with gzip.open(jdft_2d, "rt") as f:
#     dft_2d_df = pd.DataFrame(json.load(f))

In [7]:
# with gzip.open(jdft_3d, "rt") as f:
#     dft_3d_df = pd.DataFrame(json.load(f))

In [8]:
# with gzip.open(jdft_3d_cfid, "rt") as f:
#     dft_3d_with_desc_df = pd.DataFrame(json.load(f))

## Data cleaning and reshaping

We rename the columns to be more descriptive and human-readable and then reorder them to match my preferred (and arbitrary) aesthetic.

In [9]:
rename_columns = {
    "eff_mass": "effective_mass",
    "elastic": "elastic_tensor",
    "encut": "vasp_plane_wave_cutoff",
    "epsx": "dielectric_x_direction_optb88vdw",
    "epsy": "dielectric_y_direction_optb88vdw",
    "epsz": "dielectric_z_direction_optb88vdw",
    "fin_en": "final_energy",
    "final_str": "final_structure_parameters",
    "form_enp": "formation_energy_per_atom",
    "gv": "bulk_modulus_shear_gpa",
    "icsd": "icsd_id",
    "incar": "vasp_incar_parameters",
    "initial_str": "initial_structure_parameters",
    "jid": "jarvis_calculation_id",
    "kp_leng": "kpoint_line_density",
    "kpoints": "kpoints_parameters",
    "kv": "bulk_modulus_voigt_gpa",
    "magmom": "magnetic_moment",
    "mbj_gap": "band_gap_mbj",
    "mepsx": "dielectric_x_direction_mbj",
    "mepsy": "dielectric_y_direction_mbj",
    "mepsz": "dielectric_z_direction_mbj",
    "mpid": "materials_project_id",
    "op_gap": "band_gap_optb88vdw",
}
column_order = [
    "jarvis_calculation_id",
    "icsd_id",
    "materials_project_id",
    "initial_structure_parameters",
    "final_structure_parameters",
    "vasp_incar_parameters",
    "vasp_plane_wave_cutoff",
    "kpoints_parameters",
    "kpoint_line_density",
    "final_energy",
    "formation_energy_per_atom",
    "magnetic_moment",
    "band_gap_optb88vdw",
    "band_gap_mbj",
    "elastic_tensor",
    "effective_mass",
    "bulk_modulus_shear_gpa",
    "bulk_modulus_voigt_gpa",
    "dielectric_x_direction_optb88vdw",
    "dielectric_y_direction_optb88vdw",
    "dielectric_z_direction_optb88vdw",
    "dielectric_x_direction_mbj",
    "dielectric_y_direction_mbj",
    "dielectric_z_direction_mbj",
]

Cleaning protocol.

In [10]:
dft_3d_df_clean = dft_3d_df \
    .rename(index=str, columns=rename_columns) \
    .loc[:, column_order] \
    .replace(to_replace=r"na", value=np.nan, regex=True)

AttributeError: 'list' object has no attribute 'rename'

In [None]:
dft_3d_df_clean["magnetic_moment"] = pd.DataFrame(dft_3d_df_clean["magnetic_moment"].tolist()) \
    .loc[:, "magmom_osz"] \
    .values

In [None]:
dft_3d_df_clean["icsd_id"] = pd.DataFrame(dft_3d_df_clean["icsd_id"].tolist()) \
    .replace(to_replace=r"[\[\]]", value="", regex=True) \
    .replace(to_replace=r"None", value=np.nan, regex=True) \
    .values

In [None]:
# dft_3d_df_clean = dft_3d_df \
#     .rename(index=str, columns=rename_columns) \
#     .loc[:, column_order] \
#     .replace(to_replace=r"na", value=np.nan, regex=True)
# dft_3d_df_clean["magnetic_moment"] = pd.DataFrame(dft_3d_df_clean["magnetic_moment"].tolist()) \
#     .loc[:, "magmom_osz"] \
#     .values
# dft_3d_df_clean["icsd_id"] = pd.DataFrame(dft_3d_df_clean["icsd_id"].tolist()) \
#     .replace(to_replace=r"[\[\]]", value="", regex=True) \
#     .replace(to_replace=r"None", value=np.nan, regex=True) \
#     .values

Check current progress of data frame.

In [None]:
dft_3d_df_clean.head()