## Data Collection and Preparation Notebook

The goal of this notebook will be to collect as much training data as possible from different sources, clean it, process it, and present it in a format that will be easily used in the training process.

## Notes

#### Links
 - Visit https://www.crystallography.net/cod/ for perusal

#### Utilities to verify if a Structure or CIF file is valid:

1. Pymatgen's built-in checks:
Pymatgen provides several methods to check if a Structure object is properly constituted:

Check if the structure is periodic: The Structure.is_valid() method ensures that the structure is periodic and has no overlapping atoms.

```python
is_valid = structure.is_valid()
```

2. Symmetry validation:
You can check if the structure has a valid space group and crystal symmetry using SpacegroupAnalyzer:

```python
from pymatgen.symmetry.analyzer import SpacegroupAnalyzer

analyzer = SpacegroupAnalyzer(structure)
symmetry_valid = analyzer.get_space_group_symbol() is not None
```

This checks if the structure is consistent with known space groups and can validate symmetry.

3. Reading the CIF back (for CIF validation):
Once you've written a CIF file, you can read it back using Pymatgen’s CifParser to ensure that the file format is correct and can be parsed:

```python
from pymatgen.io.cif import CifParser

try:
    parser = CifParser("your_structure.cif")
    cif_structure = parser.get_structures()[0]
    valid_cif = True
except Exception as e:
    valid_cif = False
    print(f"Invalid CIF: {e}")
```

4. Visualization:
Visual inspection is often useful. You can use a tool like VESTA, Avogadro, or Pymatgen's built-in Matplotlib-based plotting to render the structure:

```python
structure.to(fmt="poscar")  # Export structure to POSCAR format
structure.to(fmt="cif")     # Export structure to CIF format
```

5. Volume and bond length sanity checks:
You can check if the structure's volume and bond lengths are within reasonable ranges:

Volume check: Ensure that the volume isn't unusually small or large for the system.

```python
volume = structure.volume
```

Minimum distance between atoms: Ensure that the bond lengths between atoms are reasonable to avoid overlapping atoms.

```python
min_dist = structure.distance_matrix.min()
```

## Globals and Utilities

In [1]:
## Module Installs
# pip install mp_api
# pip install python-dotenv
# pip install requests
# pip install python-slugify
# pip install mysql-connector-python

In [54]:
## Import modules
import os
import io
import re
import gc
import math
import json
import lzma
import gzip
import time
import pickle
import random
import zipfile
import requests
import pandas as pd
import nglview as nv
from tqdm import tqdm
import mysql.connector
import qmpy_rester as qr
import concurrent.futures
from slugify import slugify
from typing import List, Dict
from mp_api.client import MPRester
from emmet.core.summary import HasProps
from IPython.display import clear_output
from pymatgen.core import Structure
from dotenv import load_dotenv

load_dotenv()

True

In [3]:
## Variables
MPKEY = os.getenv('MPKEY')

MYSQL_USER = os.getenv('MYSQL_USER')
MYSQL_PASS = os.getenv('MYSQL_PASS')
MYSQL_HOST = os.getenv('MYSQL_HOST')
MYSQL_DB_NAME = os.getenv('MYSQL_DB_NAME')

ROOT_DIR = os.getcwd()
TEMP_DIR = os.path.join(ROOT_DIR, 'tmp')
DATA_DIR = os.path.join(ROOT_DIR, 'data')

raw_mp_data_compressed_pickle_path = os.path.join(DATA_DIR, 'raw_mp_data_dump.xz')
raw_oqmd_data_compressed_pickle_path = os.path.join(DATA_DIR, 'raw_oqmd_data_dump.xz')

In [4]:
## Create non-existant directories
for _dir in [TEMP_DIR, DATA_DIR]:
    if not os.path.exists(_dir):
        print(f"Not found dir: {_dir}, creating one...")
        os.makedirs(_dir)

In [5]:
def pretty_file_size(size_bytes: int) -> str:
    units = ['TB', 'GB', 'MB', 'KB', 'B']
    for unit in units:
        if size_bytes >= 1024 ** (len(units) - units.index(unit)):
            return f"{size_bytes / 1024 ** (len(units) - units.index(unit)):.2f} {unit}"
    return "0 B"

In [6]:
def zip_dir(directory_path: str, zip_filename: str = None) -> str:

    # Create the zip file in the same parent directory as the zipped dir and name it the same as the source dir if no name is given
    parent_dir_path = os.path.dirname(directory_path)
    sorce_dir_name = os.path.basename(directory_path)
    
    if zip_filename is None:
        zip_filename = os.path.join(parent_dir_path, f"{slugify(sorce_dir_name, separator='_')}.zip")
        
    # Check if the input is a valid directory
    if not os.path.isdir(directory_path):
        raise ValueError(f"'{directory_path}' is not a valid directory.")

    # Create an in-memory buffer for compression
    buf = io.BytesIO()
    
    # Create the zip file and write its contents to it
    print("Creating the zip file and write its contents to it...")
    with zipfile.ZipFile(buf, 'w') as zip_file:
        for root, dirs, files in os.walk(directory_path):
            for file in files:
                file_path = os.path.join(root, file)
                rel_path = os.path.relpath(file_path, directory_path)
                zip_file.write(file_path, rel_path)

    # Compress the contents of the zip file using LZMA
    print("Compressing the contents of the zip file using LZMA...")
    compressed_data = lzma.compress(buf.getvalue())

    # Save the compressed data to a new .zip file
    print("Saving the compressed data to a new .zip file")
    with open(zip_filename, 'wb') as f:
        f.write(compressed_data)
    
    return zip_filename

In [7]:
def unzip(path_to_zip_file: str, directory_to_extract_to: str = None) -> str:
    """
    Extracts a zip file (or a .gz compressed file) to a specified directory.
    
    If no destination is set, the function will create a new directory with a slugified name in the same path as the input file.
    
    Args:
        path_to_zip_file (str): Path to the input zip (.zip or .gz) file
        directory_to_extract_to (str): Optional path to extract the files to
    
    Returns:
        str: The path where the files were extracted to
    """

    ## Extract to the same directory if a destination is not set:
    if directory_to_extract_to is None:
        new_dir_name = slugify(os.path.splitext(os.path.basename(path_to_zip_file))[0], separator='_')
        directory_to_extract_to = os.path.join(os.path.dirname(path_to_zip_file), new_dir_name)

    ## Create the destination path if not exist
    if not os.path.exists(directory_to_extract_to):
        os.makedirs(directory_to_extract_to)
    
    # Check if it's a .gz file and extract accordingly
    if path_to_zip_file.endswith('.gz'):
        with gzip.open(path_to_zip_file, 'rb') as gz_ref:
            with open(os.path.join(directory_to_extract_to, os.path.basename(path_to_zip_file)), 'wb') as dest:
                dest.write(gz_ref.read())
        
        print(f"Finished decompressing {path_to_zip_file} to: {directory_to_extract_to}")
    else:
        # If it's not a .gz file, assume it's a zip and extract using zipfile
        with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
            zip_ref.extractall(directory_to_extract_to)

        print(f"Finished extracting files to: {directory_to_extract_to}")
    
    return directory_to_extract_to

In [8]:
def download_file(url: str, file_name: str, destination: str = None, overwrite: bool = False) -> str:

    ## Slugify the file name
    file_name = slugify(file_name, separator='_', replacements={r'\.': ''})
    
    ## Create the full destinantion path if given
    download_path = os.path.join(destination, file_name) if destination else os.path.join(TEMP_DIR, file_name)

    ## Verify the destination path exists
    if destination is not None:
        if (not os.path.exists(destination)) and (os.path.isdir(destination)):
            print(f"Creating download destination directories... [{destination}]")
            os.makedirs(destination)
    
    with requests.get(url, stream=True) as response:
        response.raise_for_status()
        total_size = int(response.headers.get('content-length', 0))
        downloaded_size = 0
        block_size = 8192  # 8 Kilobytes

        ## Skip download if file exists of same size and no overwrite flag
        if os.path.exists(download_path) and not overwrite:
            print(f"File of the same size as remote file already exists and overwrite is set to False\nSkipping Download...")
            return download_path
        
        with open(download_path, 'wb') as file:
            for chunk in response.iter_content(chunk_size=block_size):
                file.write(chunk)
                downloaded_size += len(chunk)
                if total_size > 0:
                    print(f"\rDownloading: {pretty_file_size(downloaded_size)} of {pretty_file_size(total_size)} ({downloaded_size * 100.0 / total_size:.2f}%)", end='')
                else:
                    print(f"\rDownloading: {pretty_file_size(downloaded_size)} bytes", end='')

    print(f"\nFinished downloading file: {file_name}")
    print(
        f"Destination: {download_path}\n"
        f"File: {file_name}\n"
        f"File Size: {pretty_file_size(total_size)}\n"
        f"Downloaded Size: {pretty_file_size(os.path.getsize(download_path))}\n"
    )

    return download_path

## Data Collection - Materials Project Data

In this section, the goal will be to pull in all the data from the materials project and store a local copy for further analysis.

In [17]:
## Get all the materials in the MP Db or load a local copy.

if os.path.exists(raw_mp_data_compressed_pickle_path):
    # get the data from my local copy
    print("Got a local copy of the MP data.\nDecompressing and loading the data...")
    with lzma.open(raw_mp_data_compressed_pickle_path, "rb") as f:
        raw_mp_data = pickle.load(f, fix_imports=False)
    print("Done reading and loading the file.")
        
else:
    # get the data from MP
    with MPRester(MPKEY, monty_decode=False, use_document_model=False) as mpr:
        raw_mp_data = mpr.materials.summary.search(
            fields=[
                'elements', 'composition', 'composition_reduced', 'formula_pretty', 'volume',
                'density', 'density_atomic', 'material_id', 'energy_above_hull', 'band_gap', 'efermi',
                'shear_modulus', 'bulk_modulus', 'structure'
            ]
        )

Got a local copy of the MP data.
Decompressing and loading the data...
Done reading and loading the file.


In [18]:
len(raw_mp_data)

155361

In [19]:
## Now compress and store the data locally for future use:
if not os.path.exists(raw_mp_data_compressed_pickle_path):
    with lzma.open(raw_mp_data_compressed_pickle_path, "wb") as f:
        pickle.dump(raw_mp_data, f, fix_imports=False)
        
print(f"Stored ({len(raw_mp_data)} materials)")

Stored (155361 materials)


In [20]:
rand_material = raw_mp_data[random.randint(0, len(raw_mp_data)-1)]
rand_material

{'elements': ['Mn', 'P', 'Tb'],
 'composition': {'Tb': 2.0, 'Mn': 12.0, 'P': 7.0},
 'composition_reduced': {'Tb': 2.0, 'Mn': 12.0, 'P': 7.0},
 'formula_pretty': 'Tb2Mn12P7',
 'volume': 270.86192945737486,
 'density': 7.319436663297077,
 'density_atomic': 12.89818711701785,
 'material_id': 'mp-1191573',
 'structure': {'@module': 'pymatgen.core.structure',
  '@class': 'Structure',
  'charge': 0,
  'lattice': {'matrix': [[0.0, 0.0, -3.604837],
    [-4.659918, -8.068325, 0.0],
    [-4.654614, 8.065277, 0.0]],
   'pbc': [True, True, True],
   'a': 3.604837,
   'b': 9.317333528019107,
   'c': 9.312041912262046,
   'alpha': 120.00111049590895,
   'beta': 90.0,
   'gamma': 90.0,
   'volume': 270.86192945737486},
  'properties': {},
  'sites': [{'species': [{'element': 'Tb', 'occu': 1}],
    'abc': [0.0, 0.666822, 0.333267],
    'xyz': [-4.6585650845340005, -2.6922459431910006, 0.0],
    'properties': {'magmom': -0.072},
    'label': 'Tb'},
   {'species': [{'element': 'Tb', 'occu': 1}],
    'ab

In [21]:
if type(rand_material) == dict:
    x = json.dumps(
        rand_material,
        indent=2
    )
    
    print(x)
    print(rand_material.keys())

{
  "elements": [
    "Mn",
    "P",
    "Tb"
  ],
  "composition": {
    "Tb": 2.0,
    "Mn": 12.0,
    "P": 7.0
  },
  "composition_reduced": {
    "Tb": 2.0,
    "Mn": 12.0,
    "P": 7.0
  },
  "formula_pretty": "Tb2Mn12P7",
  "volume": 270.86192945737486,
  "density": 7.319436663297077,
  "density_atomic": 12.89818711701785,
  "material_id": "mp-1191573",
  "structure": {
    "@module": "pymatgen.core.structure",
    "@class": "Structure",
    "charge": 0,
    "lattice": {
      "matrix": [
        [
          0.0,
          0.0,
          -3.604837
        ],
        [
          -4.659918,
          -8.068325,
          0.0
        ],
        [
          -4.654614,
          8.065277,
          0.0
        ]
      ],
      "pbc": [
        true,
        true,
        true
      ],
      "a": 3.604837,
      "b": 9.317333528019107,
      "c": 9.312041912262046,
      "alpha": 120.00111049590895,
      "beta": 90.0,
      "gamma": 90.0,
      "volume": 270.86192945737486
    },
    

In [22]:
structure = Structure.from_dict(rand_material['structure'])
print(structure)

view = nv.show_pymatgen(structure)

# Set the representation to ball+stick for better visibility
view.add_ball_and_stick()

# Display the unit cell
view.add_unitcell()

# # Optionally, set colors for specific elements (e.g., coloring Silicon atoms differently)
# view.update_ball_and_stick(color_scheme="element", color_value={"Si": "orange"})

# Adjust background color
view.background = "black"

# Add labels to the atoms
for i, site in enumerate(structure):
    view.add_label(position=site.coords, color="yellow", label=str(site.species_string))

# Display the enhanced view
display(view)




Full Formula (Tb2 Mn12 P7)
Reduced Formula: Tb2Mn12P7
abc   :   3.604837   9.317334   9.312042
angles: 120.001110  90.000000  90.000000
pbc   :       True       True       True
Sites (21)
  #  SP      a         b         c    magmom
---  ----  ---  --------  --------  --------
  0  Tb    0    0.666822  0.333267    -0.072
  1  Tb    0.5  0.333408  0.666725    -0.093
  2  Mn    0    0.377101  0.951052     1.504
  3  Mn    0    0.048935  0.425909     1.506
  4  Mn    0    0.573894  0.622898     1.502
  5  Mn    0    0.879065  0.723855    -0.415
  6  Mn    0    0.27637   0.155382    -0.386
  7  Mn    0    0.844584  0.120805    -0.417
  8  Mn    0.5  0.954374  0.564723     1.791
  9  Mn    0.5  0.435112  0.389471     1.786
 10  Mn    0.5  0.610239  0.045506     1.782
 11  Mn    0.5  0.124107  0.896479     2.547
 12  Mn    0.5  0.103792  0.227733     2.544
 13  Mn    0.5  0.772209  0.87615      2.553
 14  P     0    0.118952  0.709628    -0.061
 15  P     0    0.290476  0.409514    -0.061
 1

NGLWidget(background='black')

In [23]:
for i, site in enumerate(structure):
    print(site.species_string)

Tb
Tb
Mn
Mn
Mn
Mn
Mn
Mn
Mn
Mn
Mn
Mn
Mn
Mn
P
P
P
P
P
P
P


In [26]:
rand_material.get('formula_pretty')

'Tb2Mn12P7'

In [51]:
from pymatgen.core import Structure
from pymatgen.io.cif import CifWriter

# Assuming you have a Structure object called 'structure'
structure = rand_material.structure

# Create a CifWriter object
cif_writer = CifWriter(structure)

# Write the CIF file
# cif_writer.write_file(f"my_cif_file.cif")

In [27]:
import nglview as nv
from pymatgen.core import Structure
from pymatgen.io.ase import AseAtomsAdaptor

# Load your structure (replace with your actual structure loading code)
# structure = Structure.from_file("your_structure_file.cif")

# Convert the structure to ASE atoms
atoms = AseAtomsAdaptor.get_atoms(structure)

# Create the nglview widget
view = nv.show_ase(atoms)

# Customize the view (optional)
view.add_unitcell()
view.center()

# Display the widget
display(view)

NGLWidget()

## Data Collection - Open Quantum Materials Database

#### OQMD API

OQMD Provides a simple to use use API ([docs](https://static.oqmd.org/static/docs/restful.html#querying)) that can be used to querry the data they have hosted there.

Quering [Examples](https://static.oqmd.org/static/docs/restful.html#more-example-queries)

#### OQMD Python Module

OQMD Also provide a python SDK to interact with their Data, similar to the Materials Project version.

The documentation can be found on [Github](https://github.com/mohanliu/qmpy_rester) or [PyPi](https://pypi.org/project/qmpy-rester/)

 - Installation:
`pip install qmpy-rester`

#### OQMD SQL DATA DUMP

OQMD Provides a dump of their whole database as MySQL database dumps.

Instructions are available [here](https://static.oqmd.org/static/docs/getting_started.html#setting-up-the-database)

 - Instructions from OQMD:

The MySQL data folder (e.g. "/var/lib/mysql" for the system-MySQL on GNU/Linux systems) may occupy around 100GB of additional disk space when the OQMD database is imported

For a better user convenience, the latest version of the database is also available for direct download at http://oqmd.org/static/downloads/qmdb.sql.gz

Once you have the database file, you need to unzip it and load it into a database MySQL. On a typical linux installation this process will look like:

``` bash
$ wget http://oqmd.org/static/downloads/qmdb.sql.gz
$ gunzip qmdb.sql.gz
$ mysql < qmdb.sql
```

*Note*
Assuming your install is on linux, and assuming you haven’t used MySQL at all, you will need to enter a mysql session as root (`mysql -u root -p`), create a user within MySQL (`CREATE USER ‘newuser’@’localhost’;`), grant that user permissions (`GRANT ALL PRIVILEGES ON * . * TO ‘newuser’@’localhost’; FLUSH PRIVILEGES;`).

The name of the deployed database has changed since previous releases (`qmdb_prod`).

To verify that the database is properly installed and has appropriate permissions, run:

``` sql
mysql> select count(*) from entries;
+----------+
| count(*) |
+----------+
|   815654 |
+----------+
```

The number may not match what is shown above, but as long as you don’t recieve any errors, your database should be working properly.

#### We will go the API route.

In [56]:
## Get the base OQMD API url and Setup any other variables
#base_oqmd_url = "https://oqmd.org/oqmdapi/formationenergy?limit=10&offset=0&sort_offset=0&noduplicate=False&desc=False"
RECORDS_PER_REQUEST = 1000
BASE_URL = "https://oqmd.org/oqmdapi/formationenergy"

In [None]:
# Check for local cache first
if os.path.exists(raw_oqmd_data_compressed_pickle_path):
    print(f"Loading data from local storage: {raw_oqmd_data_compressed_pickle_path}")
    with lzma.open(raw_oqmd_data_compressed_pickle_path, "rb") as f:
        raw_oqmd_data = pickle.load(f, fix_imports=False)
    print(f"Loaded {len(raw_oqmd_data)} records from local storage.")

else:
    # Initialize session for better performance
    session = requests.Session()

    # Reconnaissance request to get total count
    recon_params = {
        'limit': 1,
        'offset': 0,
        'sort_offset': 0,
        'noduplicate': False,
        'desc': False
    }
    
    print("Performing reconnaissance request...")
    response = session.get(BASE_URL, params=recon_params)
    
    if response.status_code != 200:
        raise Exception(f"API Error: {response.status_code}")
        
    recon_data = response.json()
    total_records = recon_data['meta']['data_available']
    print(f"Total records available: {total_records:,}")
    print(f"Will fetch in batches of {RECORDS_PER_REQUEST:,} records")
    
    # Initialize data collection
    raw_oqmd_data = []
    current_url = BASE_URL
    
    # Calculate total number of iterations for progress bar
    total_iterations = (total_records + RECORDS_PER_REQUEST - 1) // RECORDS_PER_REQUEST
    
    # Setup progress bar
    pbar = tqdm(total=total_records, desc="Fetching records", 
                unit="records", dynamic_ncols=True)
    
    while current_url:
        try:
            # Update parameters for each request
            params = {
                'limit': RECORDS_PER_REQUEST,
                'offset': len(raw_oqmd_data),
                'sort_offset': 0,
                'noduplicate': False,
                'desc': False
            }
            
            response = session.get(BASE_URL, params=params)
            
            if response.status_code != 200:
                print(f"\nError on request: {response.status_code}")
                time.sleep(30)  # Back off on error
                continue
                
            response_data = response.json()
            
            # Extend data list with new records
            batch_data = response_data['data']
            raw_oqmd_data.extend(batch_data)
            
            # Update progress bar
            pbar.update(len(batch_data))
            
            # Clear previous output and show current status
            clear_output(wait=True)
            pbar.display()
            
            # Check if more data is available
            if not response_data['meta']['more_data_available']:
                current_url = None
            else:
                # Small delay to be nice to the API
                time.sleep(10)
                
        except Exception as e:
            print(f"\nError occurred: {str(e)}")
            time.sleep(60)  # Back off on error
            continue
    
    pbar.close()
    
    # Save to compressed pickle file
    print(f"\nSaving {len(raw_oqmd_data):,} records to {raw_oqmd_data_compressed_pickle_path}")
    with lzma.open(raw_oqmd_data_compressed_pickle_path, "wb") as f:
        pickle.dump(raw_oqmd_data, f, protocol=pickle.HIGHEST_PROTOCOL)
    
    print("Data fetch and save complete!")

# Data is now available in raw_oqmd_data
print(f"Total records in memory: {len(raw_oqmd_data):,}")

Fetching records:   1%|          | 7000/1226781 [18:38<102:10:08,  3.32records/s]


Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on request: 502

Error on r

In [None]:
# Function that will get the data from the API
def get_materials(url: str) -> dict:

    st = time.time()
    recon_response = requests.get(url)
    et = time.time()
    print(f"Finished request in {(et - st)} Sec")

    return recon_response.json()

In [None]:
## Get the materials

# Check for local data first
if os.path.exists(raw_oqmd_data_compressed_pickle_path):
    print(f"Found local data at {raw_oqmd_data_compressed_pickle_path}")
    with lzma.open(raw_oqmd_data_compressed_pickle_path, "rb") as f:
        raw_oqmd_data = pickle.load(f)
    print(f"Successfully loaded {len(raw_oqmd_data):,} records from local storage")

else:
    # Make reconnaissance request
    response = requests.get(base_oqmd_url)
    print(response)
    metadata = response.json()["meta"]
    
    print("\nAPI Metadata:")
    print(f"Total records available: {metadata['data_available']:,}")
    print(f"API version: {metadata['api_version']}")
    print(f"Timestamp: {metadata['time_stamp']}")
    print(f"Total batches: {(metadata['data_available'] / records_per_request)}")
    
    total_records = metadata["data_available"]

    # Calculate chunks for parallel processing
    num_chunks = math.ceil(total_records / chunk_size)
    chunk_params_list = []

    for i in range(num_chunks):
        start_offset = i * chunk_size
        end_offset = min((i + 1) * chunk_size, total_records)
        chunk_params_list.append({
            'start_offset': start_offset,
            'end_offset': end_offset,
            'pbar': pbar
        })

    # Fetch data using thread pool
    raw_oqmd_data = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all chunks to the thread pool
        future_to_chunk = {
            executor.submit(fetch_chunk, chunk_params): chunk_params
            for chunk_params in chunk_params_list
        }
        
        # Collect results as they complete
        for future in concurrent.futures.as_completed(future_to_chunk):
            chunk_data = future.result()
            raw_oqmd_data.extend(chunk_data)

    pbar.close()
    
    # Sort data by offset to ensure consistent ordering
    raw_oqmd_data.sort(key=lambda x: x.get('offset', 0))
    
    # Save the collected data
    if raw_oqmd_data:
        print(f"\nSaving {len(raw_oqmd_data):,} records to {raw_oqmd_data_compressed_pickle_path}")
        with lzma.open(raw_oqmd_data_compressed_pickle_path, "wb") as f:
            pickle.dump(raw_oqmd_data, f)
        print("Save complete!")

print(f"\nTotal records available: {len(raw_oqmd_data):,}")

In [19]:
len(list_of_data.get('data'))

50

In [21]:
list_of_data['meta']

{'query': {'representation': 'http://oqmd.org/oqmdapi/formationenergy'},
 'api_version': '1.0',
 'time_stamp': '2024-11-04 03:31:33',
 'data_returned': 50,
 'data_available': 1226781,
 'comments': '',
 'query_tree': '',
 'more_data_available': True}

In [45]:
# Get the ststs on the available data from the API
st = time.time()
recon_response = requests.get(base_oqmd_url)
et = time.time()
print(f"Finished request in {(et - st)} Sec")

metadata = recon_response.json()["meta"]

Finished request in 9.3672935962677 Sec


In [46]:
print(json.dumps(metadata, indent=2))

{
  "query": {
    "representation": "http://oqmd.org/oqmdapi/formationenergy?noduplicate=True&desc=False"
  },
  "api_version": "1.0",
  "time_stamp": "2024-11-04 04:12:29",
  "data_returned": 50,
  "data_available": 1093389,
  "comments": "",
  "query_tree": "",
  "more_data_available": true
}


In [47]:
number = int(1093389 / (2**10))
number = 500
url = f"{base_oqmd_url}&limit={number}"

print(f"{number = }")
print(f"{url = }")

input()

st = time.time()
response = requests.get(url)
et = time.time()
print(f"Finished request in {(et - st)} Sec")

print(response)

r_metadata = response.json()["meta"]

number = 500
url = 'https://oqmd.org/oqmdapi/formationenergy?noduplicate=True&desc=False&limit=500'


 


Finished request in 22.011234998703003 Sec
<Response [200]>


In [29]:
response

<Response [502]>

In [17]:
# Function to fetch a chunk of data
def fetch_chunk(chunk_params: Dict) -> List:
    start_offset = chunk_params['start_offset']
    end_offset = chunk_params['end_offset']
    chunk_data = []
    
    for offset in range(start_offset, end_offset, records_per_request):
        try:
            url = f"{base_oqmd_url}&limit={records_per_request}&offset={offset}"
            response = requests.get(url)
            response.raise_for_status()
            batch_data = response.json()["data"]
            chunk_data.extend(batch_data)
            
            # Update progress bar
            chunk_params['pbar'].update(len(batch_data))
            
            # Rate limiting per thread
            time.sleep(rate_limit_delay)
            
        except Exception as e:
            print(f"\nError fetching data at offset {offset}: {str(e)}")
            break
            
    return chunk_data

In [11]:
response = requests.get(base_oqmd_url)

In [18]:
len(response.json())

5

In [22]:
sub1 = response.json()
#sub1['data'] = []
print(json.dumps(sub1['data'][3], indent=2))

{
  "name": "La",
  "entry_id": 8130,
  "calculation_id": 3588,
  "icsd_id": 43568,
  "formationenergy_id": 4061151,
  "duplicate_entry_id": 8130,
  "composition": "La1",
  "composition_generic": "A",
  "prototype": "W",
  "spacegroup": "Im-3m",
  "volume": 37.7862,
  "ntypes": 1,
  "natoms": 1,
  "unit_cell": [
    [
      2.113933,
      2.113933,
      2.113933
    ],
    [
      2.113933,
      -2.113933,
      -2.113933
    ],
    [
      -2.113933,
      2.113933,
      -2.113933
    ]
  ],
  "sites": [
    "La @ 0 0 0"
  ],
  "band_gap": 0.0,
  "delta_e": 0.131232160000001,
  "stability": 0.131232160000001,
  "fit": "standard",
  "calculation_label": "static"
}


In [83]:
sub['data'][random.randint(0, len(sub['data'])-1)]

{'name': 'Tm',
 'entry_id': 1216098,
 'calculation_id': 2689,
 'icsd_id': None,
 'formationenergy_id': 4061145,
 'duplicate_entry_id': None,
 'composition': 'Tm1',
 'composition_generic': 'A',
 'prototype': 'C19_alpha_Sm',
 'spacegroup': 'R-3m',
 'volume': 88.6111,
 'ntypes': 1,
 'natoms': 3,
 'unit_cell': [[1.747609, -3.026948, 0.0],
  [1.747609, 3.026948, 0.0],
  [0.0, -2.017965, 8.375468]],
 'sites': ['Tm @ 0 0 0', 'Tm @ 0.222 0.778 0.334', 'Tm @ 0.778 0.222 0.666'],
 'band_gap': 0.0,
 'delta_e': 0.00639353000000042,
 'stability': 0.00639353000000042,
 'fit': 'standard',
 'calculation_label': 'static'}

## Data Collection - AFLOW

In [None]:
from aflow import search
import time

# Search for all materials
results = search()

start_time = time.time()
total_materials = len(results)
print(f"Total materials found: {total_materials}")

for i, material in enumerate(results, 1):
    print(f"Processing material {i} of {total_materials}: {material.auid}")
    print(f"  Composition: {material.composition}")
    print(f"  Space Group: {material.spacegroup_relax}")
    
    # Retrieve additional properties safely
    properties = [
        ('Energy', 'energy_atom', 'eV/atom'),
        ('Volume', 'volume_cell', 'Å³'),
        ('Density', 'density', 'g/cm³'),
        ('Bulk modulus', 'ael_bulk_modulus_vrh', 'GPa'),
        ('Shear modulus', 'ael_shear_modulus_vrh', 'GPa')
    ]
    
    for prop_name, prop_key, unit in properties:
        value = getattr(material, prop_key, None)
        if value is not None:
            print(f"  {prop_name}: {value} {unit}")
    
    print()
    
    # Print progress every 100 materials
    if i % 100 == 0:
        elapsed_time = time.time() - start_time
        avg_time_per_material = elapsed_time / i
        estimated_total_time = avg_time_per_material * total_materials
        print(f"Processed {i} of {total_materials} materials.")
        print(f"Estimated total time: {estimated_total_time/60:.2f} minutes")
        print(f"Elapsed time: {elapsed_time/60:.2f} minutes")
        print(f"Estimated time remaining: {(estimated_total_time - elapsed_time)/60:.2f} minutes")
        print()


## Clean the data pulled