## Data Collection and Preparation Notebook

The goal of this notebook will be to collect as much training data as possible from different sources, clean it, process it, and present it in a format that will be easily used in the training process.

## Notes

#### Links
 - Visit https://www.crystallography.net/cod/ for perusal

#### Utilities to verify if a Structure or CIF file is valid:

1. Pymatgen's built-in checks:
Pymatgen provides several methods to check if a Structure object is properly constituted:

Check if the structure is periodic: The Structure.is_valid() method ensures that the structure is periodic and has no overlapping atoms.

```python
is_valid = structure.is_valid()
```

2. Symmetry validation:
You can check if the structure has a valid space group and crystal symmetry using SpacegroupAnalyzer:

```python
from pymatgen.symmetry.analyzer import SpacegroupAnalyzer

analyzer = SpacegroupAnalyzer(structure)
symmetry_valid = analyzer.get_space_group_symbol() is not None
```

This checks if the structure is consistent with known space groups and can validate symmetry.

3. Reading the CIF back (for CIF validation):
Once you've written a CIF file, you can read it back using Pymatgen’s CifParser to ensure that the file format is correct and can be parsed:

```python
from pymatgen.io.cif import CifParser

try:
    parser = CifParser("your_structure.cif")
    cif_structure = parser.get_structures()[0]
    valid_cif = True
except Exception as e:
    valid_cif = False
    print(f"Invalid CIF: {e}")
```

4. Visualization:
Visual inspection is often useful. You can use a tool like VESTA, Avogadro, or Pymatgen's built-in Matplotlib-based plotting to render the structure:

```python
structure.to(fmt="poscar")  # Export structure to POSCAR format
structure.to(fmt="cif")     # Export structure to CIF format
```

5. Volume and bond length sanity checks:
You can check if the structure's volume and bond lengths are within reasonable ranges:

Volume check: Ensure that the volume isn't unusually small or large for the system.

```python
volume = structure.volume
```

Minimum distance between atoms: Ensure that the bond lengths between atoms are reasonable to avoid overlapping atoms.

```python
min_dist = structure.distance_matrix.min()
```

## Globals and Utilities

In [9]:
## Module Installs
# pip install mp_api
# pip install python-dotenv
# pip install requests
# pip install python-slugify
# pip install mysql-connector-python

In [31]:
## Import modules
import os
import io
import re
import json
import lzma
import gzip
import random
import zipfile
import requests
from tqdm import tqdm
import mysql.connector
from slugify import slugify
from mp_api.client import MPRester
from emmet.core.summary import HasProps
from dotenv import load_dotenv

load_dotenv()

True

In [32]:
## Variables
MPKEY = os.getenv('MPKEY')

MYSQL_USER = os.getenv('MYSQL_USER')
MYSQL_PASS = os.getenv('MYSQL_PASS')
MYSQL_HOST = os.getenv('MYSQL_HOST')
MYSQL_DB_NAME = os.getenv('MYSQL_DB_NAME')

ROOT_DIR = os.getcwd()
TEMP_DIR = os.path.join(ROOT_DIR, 'tmp')
DATA_DIR = os.path.join(ROOT_DIR, 'data')

In [34]:
## Create non-existant directories
for _dir in [TEMP_DIR, DATA_DIR]:
    if not os.path.exists(_dir):
        print(f"Not found dir: {_dir}, creating one...")
        os.makedirs(_dir)

In [4]:
def pretty_file_size(size_bytes: int) -> str:
    units = ['TB', 'GB', 'MB', 'KB', 'B']
    for unit in units:
        if size_bytes >= 1024 ** (len(units) - units.index(unit)):
            return f"{size_bytes / 1024 ** (len(units) - units.index(unit)):.2f} {unit}"
    return "0 B"

In [5]:
def zip_dir(directory_path: str, zip_filename: str = None) -> str:

    # Create the zip file in the same parent directory as the zipped dir and name it the same as the source dir if no name is given
    parent_dir_path = os.path.dirname(directory_path)
    sorce_dir_name = os.path.basename(directory_path)
    
    if zip_filename is None:
        zip_filename = os.path.join(parent_dir_path, f"{slugify(sorce_dir_name, separator='_')}.zip")
        
    # Check if the input is a valid directory
    if not os.path.isdir(directory_path):
        raise ValueError(f"'{directory_path}' is not a valid directory.")

    # Create an in-memory buffer for compression
    buf = io.BytesIO()
    
    # Create the zip file and write its contents to it
    print("Creating the zip file and write its contents to it...")
    with zipfile.ZipFile(buf, 'w') as zip_file:
        for root, dirs, files in os.walk(directory_path):
            for file in files:
                file_path = os.path.join(root, file)
                rel_path = os.path.relpath(file_path, directory_path)
                zip_file.write(file_path, rel_path)

    # Compress the contents of the zip file using LZMA
    print("Compressing the contents of the zip file using LZMA...")
    compressed_data = lzma.compress(buf.getvalue())

    # Save the compressed data to a new .zip file
    print("Saving the compressed data to a new .zip file")
    with open(zip_filename, 'wb') as f:
        f.write(compressed_data)
    
    return zip_filename

In [6]:
def unzip(path_to_zip_file: str, directory_to_extract_to: str = None) -> str:
    """
    Extracts a zip file (or a .gz compressed file) to a specified directory.
    
    If no destination is set, the function will create a new directory with a slugified name in the same path as the input file.
    
    Args:
        path_to_zip_file (str): Path to the input zip (.zip or .gz) file
        directory_to_extract_to (str): Optional path to extract the files to
    
    Returns:
        str: The path where the files were extracted to
    """

    ## Extract to the same directory if a destination is not set:
    if directory_to_extract_to is None:
        new_dir_name = slugify(os.path.splitext(os.path.basename(path_to_zip_file))[0], separator='_')
        directory_to_extract_to = os.path.join(os.path.dirname(path_to_zip_file), new_dir_name)

    ## Create the destination path if not exist
    if not os.path.exists(directory_to_extract_to):
        os.makedirs(directory_to_extract_to)
    
    # Check if it's a .gz file and extract accordingly
    if path_to_zip_file.endswith('.gz'):
        with gzip.open(path_to_zip_file, 'rb') as gz_ref:
            with open(os.path.join(directory_to_extract_to, os.path.basename(path_to_zip_file)), 'wb') as dest:
                dest.write(gz_ref.read())
        
        print(f"Finished decompressing {path_to_zip_file} to: {directory_to_extract_to}")
    else:
        # If it's not a .gz file, assume it's a zip and extract using zipfile
        with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
            zip_ref.extractall(directory_to_extract_to)

        print(f"Finished extracting files to: {directory_to_extract_to}")
    
    return directory_to_extract_to

In [46]:
def download_file(url: str, file_name: str, destination: str = None, overwrite: bool = False) -> str:

    ## Slugify the file name
    file_name = slugify(file_name, separator='_', replacements={r'\.': ''})
    
    ## Create the full destinantion path if given
    download_path = os.path.join(destination, file_name) if destination else os.path.join(TEMP_DIR, file_name)

    ## Verify the destination path exists
    if destination is not None:
        if (not os.path.exists(destination)) and (os.path.isdir(destination)):
            print(f"Creating download destination directories... [{destination}]")
            os.makedirs(destination)
    
    with requests.get(url, stream=True) as response:
        response.raise_for_status()
        total_size = int(response.headers.get('content-length', 0))
        downloaded_size = 0
        block_size = 8192  # 8 Kilobytes

        ## Skip download if file exists of same size and no overwrite flag
        if os.path.exists(download_path) and not overwrite:
            print(f"File of the same size as remote file already exists and overwrite is set to False\nSkipping Download...")
            return download_path
        
        with open(download_path, 'wb') as file:
            for chunk in response.iter_content(chunk_size=block_size):
                file.write(chunk)
                downloaded_size += len(chunk)
                if total_size > 0:
                    print(f"\rDownloading: {pretty_file_size(downloaded_size)} of {pretty_file_size(total_size)} ({downloaded_size * 100.0 / total_size:.2f}%)", end='')
                else:
                    print(f"\rDownloading: {pretty_file_size(downloaded_size)} bytes", end='')

    print(f"\nFinished downloading file: {file_name}")
    print(
        f"Destination: {download_path}\n"
        f"File: {file_name}\n"
        f"File Size: {pretty_file_size(total_size)}\n"
        f"Downloaded Size: {pretty_file_size(os.path.getsize(download_path))}\n"
    )

    return download_path

## 1. Materials Project Data

In this section, the goal will be to pull in all the data from the materials project and store a local copy for further analysis.

In [68]:
with MPRester(MPKEY, monty_decode=False, use_document_model=True) as mpr:
    # Query for stable materials
    docs = mpr.summary.search(
        material_ids=["mp-149", "mp-13", "mp-22526"]
        # is_stable=True,
        # _limit=10,
        # fields=[
        #     "material_id", 'builder_meta', 'nsites', 
        #     'elements', 'nelements', 'composition', 
        #     'composition_reduced', 'formula_pretty', 
        #     'formula_anonymous', 'chemsys', 'volume', 
        #     'density', 'density_atomic', 'symmetry', 
        #     'property_name', 'deprecated', 'deprecation_reasons', 
        #     'last_updated', 'origins', 'warnings', 'structure', 
        #     'task_ids', 'uncorrected_energy_per_atom', 'energy_per_atom', 
        #     'formation_energy_per_atom', 'energy_above_hull', 'is_stable', 
        #     'equilibrium_reaction_energy_per_atom', 'decomposes_to', 
        #     'xas', 'grain_boundaries', 'band_gap', 'cbm', 'vbm', 'efermi', 
        #     'is_gap_direct', 'is_metal', 'es_source_calc_id', 'bandstructure', 
        #     'dos', 'dos_energy_up', 'dos_energy_down', 'is_magnetic', 
        #     'ordering', 'total_magnetization', 'total_magnetization_normalized_vol', 
        #     'total_magnetization_normalized_formula_units', 'num_magnetic_sites', 
        #     'num_unique_magnetic_sites', 'types_of_magnetic_species', 'bulk_modulus', 
        #     'shear_modulus', 'universal_anisotropy', 'homogeneous_poisson', 
        #     'e_total', 'e_ionic', 'e_electronic', 'n', 'e_ij_max', 
        #     'weighted_surface_energy_EV_PER_ANG2', 'weighted_surface_energy', 
        #     'weighted_work_function', 'surface_anisotropy', 'shape_factor', 
        #     'has_reconstructed', 'possible_species', 'has_props', 'theoretical', 'database_IDs'
        # ]
    )

  docs = mpr.summary.search(


Retrieving SummaryDoc documents:   0%|          | 0/3 [00:00<?, ?it/s]

In [61]:
len(docs)

3

In [62]:
type(docs[0])

dict

In [69]:
rand_material = docs[random.randint(0, len(docs)-1)]
rand_material

[4m[1mMPDataDoc<SummaryDoc>[0;0m[0;0m(
[1mbuilder_meta[0;0m=EmmetMeta(emmet_version='0.72.20', pymatgen_version='2023.11.12', run_id=None, database_version='2023.11.1', build_date=datetime.datetime(2023, 11, 22, 19, 46, 57, 168000), license='BY-C'),
[1mnsites[0;0m=2,
[1melements[0;0m=[Element Si],
[1mnelements[0;0m=1,
[1mcomposition[0;0m=Composition('Si2'),
[1mcomposition_reduced[0;0m=Composition('Si1'),
[1mformula_pretty[0;0m='Si',
[1mformula_anonymous[0;0m='A',
[1mchemsys[0;0m='Si',
[1mvolume[0;0m=40.32952684741405,
[1mdensity[0;0m=2.312800253345134,
[1mdensity_atomic[0;0m=20.164763423707026,
[1msymmetry[0;0m=SymmetryData(crystal_system=<CrystalSystem.cubic: 'Cubic'>, symbol='Fd-3m', number=227, point_group='m-3m', symprec=0.1, version='2.0.2'),
[1mproperty_name[0;0m='summary',
[1mmaterial_id[0;0m=MPID(mp-149),
[1mdeprecated[0;0m=False,
[1mdeprecation_reasons[0;0m=None,
[1mlast_updated[0;0m=datetime.datetime(2023, 11, 22, 19, 46, 57, 169000),


In [67]:
if type(rand_material) == dict:
    x = json.dumps(
        rand_material,
        indent=2
    )
    
    # print(x)
    print(rand_material.keys())



In [50]:
rand_material.formula_pretty

'LiCoO2'

In [51]:
from pymatgen.core import Structure
from pymatgen.io.cif import CifWriter

# Assuming you have a Structure object called 'structure'
structure = rand_material.structure

# Create a CifWriter object
cif_writer = CifWriter(structure)

# Write the CIF file
# cif_writer.write_file(f"my_cif_file.cif")

In [52]:
import nglview as nv
from pymatgen.core import Structure

# pick a random material
material = rand_material

# Load the structure from a CIF file
# structure = Structure.from_file('/home/alen/projects/Inverse-Design-of-Materials-with-AI/my_cif_file.cif')

# Load the structure directly from the MP results
structure = rand_material.structure

# Create a visualization widget for the structure
print(rand_material.formula_pretty)
view = nv.show_pymatgen(structure)

# Display the structure in the notebook
view


LiCoO2


NGLWidget()

In [53]:
import nglview as nv
from pymatgen.core import Structure
from pymatgen.io.ase import AseAtomsAdaptor

# Load your structure (replace with your actual structure loading code)
# structure = Structure.from_file("your_structure_file.cif")

# Convert the structure to ASE atoms
atoms = AseAtomsAdaptor.get_atoms(structure)

# Create the nglview widget
view = nv.show_ase(atoms)

# Customize the view (optional)
view.add_unitcell()
view.center()

# Display the widget
display(view)

NGLWidget()

## 2. Open Quantum Materials Database

#### OQMD API

OQMD Provides a simple to use use API ([docs](https://static.oqmd.org/static/docs/restful.html#querying)) that can be used to querry the data they have hosted there.

Quering [Examples](https://static.oqmd.org/static/docs/restful.html#more-example-queries)

#### OQMD Python Module

OQMD Also provide a python SDK to interact with their Data, similar to the Materials Project version.

The documentation can be found on [Github](https://github.com/mohanliu/qmpy_rester) or [PyPi](https://pypi.org/project/qmpy-rester/)

 - Installation:
`pip install qmpy-rester`

#### OQMD SQL DATA DUMP

OQMD Provides a dump of their whole database as MySQL database dumps.

Instructions are available [here](https://static.oqmd.org/static/docs/getting_started.html#setting-up-the-database)

 - Instructions from OQMD:

The MySQL data folder (e.g. "/var/lib/mysql" for the system-MySQL on GNU/Linux systems) may occupy around 100GB of additional disk space when the OQMD database is imported

For a better user convenience, the latest version of the database is also available for direct download at http://oqmd.org/static/downloads/qmdb.sql.gz

Once you have the database file, you need to unzip it and load it into a database MySQL. On a typical linux installation this process will look like:

``` bash
$ wget http://oqmd.org/static/downloads/qmdb.sql.gz
$ gunzip qmdb.sql.gz
$ mysql < qmdb.sql
```

*Note*
Assuming your install is on linux, and assuming you haven’t used MySQL at all, you will need to enter a mysql session as root (`mysql -u root -p`), create a user within MySQL (`CREATE USER ‘newuser’@’localhost’;`), grant that user permissions (`GRANT ALL PRIVILEGES ON * . * TO ‘newuser’@’localhost’; FLUSH PRIVILEGES;`).

The name of the deployed database has changed since previous releases (`qmdb_prod`).

To verify that the database is properly installed and has appropriate permissions, run:

``` sql
mysql> select count(*) from entries;
+----------+
| count(*) |
+----------+
|   815654 |
+----------+
```

The number may not match what is shown above, but as long as you don’t recieve any errors, your database should be working properly.

#### We will go the API route.

In [42]:
## Get the base OQMD API url and Setup any other variables
base_oqmd_url = "https://oqmd.org/oqmdapi/formationenergy?limit=10&offset=0&sort_offset=0&noduplicate=False&desc=False"

In [43]:
response = requests.get(base_oqmd_url)

In [45]:
sub = response.json()
# sub.pop('data')
print(json.dumps(sub, indent=2))

{
  "links": {
    "next": "http://oqmd.org/oqmdapi/formationenergy?desc=False&limit=10&noduplicate=False&offset=10&sort_offset=0",
    "previous": null,
    "base_url": {
      "href": "https://oqmd.org/oqmdapi",
      "meta": {
        "_oqmd_version": "1.0"
      }
    }
  },
  "resource": {},
  "data": [
    {
      "name": "Lu",
      "entry_id": 1216058,
      "calculation_id": 2454,
      "icsd_id": null,
      "formationenergy_id": 4061142,
      "duplicate_entry_id": null,
      "composition": "Lu1",
      "composition_generic": "A",
      "prototype": "C19_alpha_Sm",
      "spacegroup": "R-3m",
      "volume": 86.3513,
      "ntypes": 1,
      "natoms": 3,
      "unit_cell": [
        [
          -1.733802,
          3.003035,
          0.0
        ],
        [
          3.467606,
          0.0,
          0.0
        ],
        [
          1.733802,
          -1.001012,
          -8.292372
        ]
      ],
      "sites": [
        "Lu @ 0 0 0",
        "Lu @ 0.778 0.222 0.3

In [None]:
## Loopover all the 

## 3. AFLOW

In [51]:
from aflow import search
import time

# Search for all materials
results = search()

start_time = time.time()
total_materials = len(results)
print(f"Total materials found: {total_materials}")

for i, material in enumerate(results, 1):
    print(f"Processing material {i} of {total_materials}: {material.auid}")
    print(f"  Composition: {material.composition}")
    print(f"  Space Group: {material.spacegroup_relax}")
    
    # Retrieve additional properties safely
    properties = [
        ('Energy', 'energy_atom', 'eV/atom'),
        ('Volume', 'volume_cell', 'Å³'),
        ('Density', 'density', 'g/cm³'),
        ('Bulk modulus', 'ael_bulk_modulus_vrh', 'GPa'),
        ('Shear modulus', 'ael_shear_modulus_vrh', 'GPa')
    ]
    
    for prop_name, prop_key, unit in properties:
        value = getattr(material, prop_key, None)
        if value is not None:
            print(f"  {prop_name}: {value} {unit}")
    
    print()
    
    # Print progress every 100 materials
    if i % 100 == 0:
        elapsed_time = time.time() - start_time
        avg_time_per_material = elapsed_time / i
        estimated_total_time = avg_time_per_material * total_materials
        print(f"Processed {i} of {total_materials} materials.")
        print(f"Estimated total time: {estimated_total_time/60:.2f} minutes")
        print(f"Elapsed time: {elapsed_time/60:.2f} minutes")
        print(f"Estimated time remaining: {(estimated_total_time - elapsed_time)/60:.2f} minutes")
        print()


[31mERROR: http://aflowlib.duke.edu/search/API/?,paging(1,100)

Lux Fail: Expected token named DATUM and type STR, found BINAL instead.
[0m


TypeError: 'NoneType' object cannot be interpreted as an integer