# Generating data

All the code for data generation is in the module `generate_lib`.

The quantum chemistry data used in this project are produced using the quantum chemistry packagePsi4 (http://www.psicode.org/).
Thanks to the interface package `openfermionpsi4`, the resulting molecular data can be stored in a molecule `.hdf5` file, to be directly loaded inside an `openfermion.MolecularData` object.

The molecule files are stored in this repository's `data/molecules/` directory.
The name of each file is determined by univocally converting the molecule's geometry to a string (see `MoleculeDataGenerator._generate_filename`).

For each molecule, the relevant data for the ML model are also saved as a dictionary in a `.json` with the same name, in the structured directory `data/json/`.
The data in the json files is accessible and usable without dependency on any of the aforementioned packages.

To generate and save a molecule and relative data, given the geometry, it is sufficient to instantiate the object `MoleculeDataGenerator(geometry)`.

In [2]:
from generate_lib import MoleculeDataGenerator

# Set the geometry and generate the molecule

geometry = (
    ('H', (0., 0., 0.)),
    ('H', (1., 0., 0.)),
    ('H', (0., 1., 0.)),
    ('H', (0., 0., 1.))
)

# instantiating a MoleculeDataGenerator is enough to create molecule+data files
gen = MoleculeDataGenerator(geometry) 

# the object contains the molecule and the relative data dictionary
print(gen.molecule)
print(gen.data_dict.keys())

<openfermion.hamiltonians._molecular_data.MolecularData object at 0x10b638650>
dict_keys(['geometry', 'multiplicity', 'canonical_orbitals', 'canonical_to_oao', 'orbital_energies', 'exact_energy', 'ground_states'])


## Chosen molecule family

The first chosen molecule family for this project is $\mathrm{H}_4$ in various geometries.
Some physical limits for the geometry are set:
- For each pair of H atoms, the interatomic distance is not smaller than $0.4Å$. This avoids exaggerate orbital ovelaps
- For each pair of adjacent atoms (in the ordered in which they're listed in the geometry) the interatomic distance is no more than $1.5Å$. This avoids completely dissociated molecules.

Additionally, parameters that are irrelevant for the calculation of translation-invariant and rotation-invariant properties are (for now) fixed:
- the fist atom is always at position $(0, 0, 0)$
- the second atom is always on the positive X half-axis $(x_1, 0, 0)$, $x_1>0$
- the third atom is always on the XY plane $(x_2, y_2, 0)$
- the fourth can be anywhere in space, within the previously set limits $(x_3, y_3, z_3)$

Finally, for convenience in file naming and data exchange, we keep only 4 decimals in all the $x_i, y_i, z_i$ values:
- all positions are forced on a grid with resolution $0.001Å$

In [5]:
from generate_lib import H4_generate_random_molecule
help(H4_generate_random_molecule)

Help on function H4_generate_random_molecule in module generate_lib:

H4_generate_random_molecule(rng=Generator(PCG64) at 0x1019573650)
    Generate and save molecule and data for a valid random geomertry of H4.
    For detailed help see: 
        `H4_generate_valid_geometry`
        `check_geometry` 
        `MoleculeDataGenerator`
        
    Args:
        rng: a numpy.random generator
        
    Returns:
        MoleculeDataGenerator
        
    Raises:
        FailedGeneration



In [6]:
from generate_lib import H4_generate_valid_geometry
H4_generate_valid_geometry()

[('H', (0.0, 0.0, 0.0)),
 ('H', (0.4023, 0.0, 0.0)),
 ('H', (-0.024, -0.499, 0.0)),
 ('H', (-0.1734, 0.3813, -0.4048))]

## Generate random H4 data

In [1]:
from tqdm import tqdm
from generate_lib import H4_generate_random_molecule, FailedGeneration

n_molecules_to_generate = 100

for _ in tqdm(range(n_molecules_to_generate)):
    for attempt in range(10):
        try:
            H4_generate_random_molecule()
        except FailedGeneration as exc:
            print('Failed to generate random molecule because of:\n' + str(exc))
        else:
            break

100%|██████████| 100/100 [05:28<00:00,  3.29s/it]


# Loading data for QML model

Only the function `load_data` in `load_lib` is needed to load the relevant data for the QML model.
`load_lib` also defined `JSON_DIR` and `MOLECULES_DIR` for convenience.


In [8]:
from load_lib import *

To load all data:

In [9]:
dataset = [load_data(filename) 
           for filename in os.listdir(JSON_DIR)
           if filename.endswith('.json')]

print('length of the dataset:', len(dataset))

length of the dataset: 4


**Example:** count how many of the molecules in the dataset have a singlet ground state and how many have a triplet

In [10]:
multiplicities = [load_data(filename)['multiplicity']
                  for filename in os.listdir(JSON_DIR)
                  if filename.endswith('.json')]

from collections import Counter
print(Counter(multiplicities))

Counter({1: 2, 3: 2})


# What are the saved data 

Let's take as an example one data dictionary:

In [11]:
from load_lib import * 
import os

filename = 'H,0,0,0.0;H,1,0,0.0;H,0,1,0.0;H,0,0,1.0'
data_dict = load_data(filename)

print('\ncontent of each data dictionary\n')
print(f"{'KEY:':20} {'VALUE TYPE:':20}\n{'-'*60}")
for k, v in data_dict.items():
    print(f'{k:20} {str(type(v)):20}',
          f'with shape {v.shape}' if isinstance(v, np.ndarray) else "")


content of each data dictionary

KEY:                 VALUE TYPE:         
------------------------------------------------------------
geometry             <class 'list'>       
multiplicity         <class 'int'>        
canonical_orbitals   <class 'numpy.ndarray'> with shape (4, 4)
canonical_to_oao     <class 'numpy.ndarray'> with shape (4, 4)
orbital_energies     <class 'numpy.ndarray'> with shape (4,)
exact_energy         <class 'float'>      
ground_states        <class 'numpy.ndarray'> with shape (256, 3)


`geometry` is a list of tuples ('atom_symbol', (x, y, z)):

In [12]:
data_dict['geometry']

[['H', [0.0, 0.0, 0.0]],
 ['H', [1.0, 0.0, 0.0]],
 ['H', [0.0, 1.0, 0.0]],
 ['H', [0.0, 0.0, 1.0]]]

`multiplicity` indicates wether the ground state of this molecule is a singlet (1) or triplet (3).
All the ground states are saved as complex **column vectors** in a matrix of shape ($2^n$, `multiplicity` )

In [14]:
print('multiplicity: ', data_dict['multiplicity'])
print('ground states: \n', data_dict['ground_states'].round(3))

multiplicity:  3
ground states: 
 [[ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.   +0.j     0.   +0.j   ]
 [ 0.   +0.j     0.  

`exact_energy` encodes the exact ground-state energy of the molecule, i.e. the target output of the QML model for this first stage of the project.

In [21]:
data_dict['exact_energy']

-1.8917479285527619

## Atomic and molecular orbitals

Atomic orbitals are a physically-motivated, non-complete and non-orthogonal parametrization of scalar funcions of space (fields $\phi(\vec{x})$).

To each atom we can assign a set of spherical harmonics centered in the atom's position and with radii dependent on the atom's charge. 
Each (infinite) set of spherical harmonics forms a orthonormal base for fields.
If we model an atom as a point charge, these spherical harmonics will be the eigen-wavefunctions for a single electron in the atom's field.
Truncating the number of spherical harmonics allows us to construct a *finite* parametrization of space that can approximate the low-energy electron field, hopefully also in the interacting and poly-atomic case.
The basis functions of these parametrization are called atomic orbitals. 

The *minimal basis* approximation, in the case of Hydrogen atoms, prescribes that we take a single spherically symmetric *atomic orbital* for each Hydrogen - to which we have to add spin information.
This results in two *spin-orbitals* per each Hydrogen atom: a total of 4 atomic orbitals (i.e. 8 spin-orbitals) for an H4 molecule.

To construct an orthonormal (of course still incomplete) parametrization of fields, we take linear combinations of these 4 orbitals. The *canonical orbitals* a a special orthonormal combination of atomic orbitals, obtained through the Hartree-Fock method.
The ground-state Slater determinant constructed on the canonical orbitals minimizes the total energy, accounting for electron-electron interactions with a mean-field approach.

The `canonical_orbitals` matrix encodes the linear combination of atomic orbitals taken to construct the canonical orbitals:

In [16]:
data_dict['canonical_orbitals'].round(3)

array([[ 0.474, -0.   ,  0.   ,  1.287],
       [ 0.282,  0.105,  0.963, -0.563],
       [ 0.282, -0.887, -0.391, -0.563],
       [ 0.282,  0.782, -0.572, -0.563]])

The rows relate to each atomic orbital, ordered as the respective atoms appear in `geometry`.
The columns represent molecular orbitals, ordered by increasing single-particle energy.
These are the energies saved in `orbital_energies`.

In [20]:
data_dict['orbital_energies'].round(3)

array([-0.682,  0.043,  0.043,  0.569])

Typically on a quantum computer, under Jordan-Wigner encoding, each pair of qubits will represent a canonical orbital (2 qubits because for each orbital there are two spins, i.e. two spin-orbitals).
The `ground_states` saved in these data are encoded in this way.

The `canonical_to_oao` matrix encodes which linear combination of atomic orbitals needs to be taken to construct a orthonormal version of the atomic orbitals (Orthogonal Atomic Orbitals, OAO).
This might be useful in later stages of the project: using a Givens rotations circuit we can change the state encoding such that each pair of qubits corresponds to one OAO.
This would allow to directly connect the quantum state to the geometry, as each orbital would be "localized" at the position of the respective atom.

In [19]:
data_dict['canonical_to_oao'].round(3)

array([[ 0.696,  0.414,  0.414,  0.414],
       [-0.   ,  0.088, -0.747,  0.659],
       [ 0.   ,  0.812, -0.329, -0.482],
       [ 0.797, -0.349, -0.349, -0.349]])

# Utilities

## List  files

In [21]:
import os
from load_lib import *
    
molecule_files = sorted(os.listdir(MOLECULES_DIR))
data_files = sorted(os.listdir(JSON_DIR))
    
print(f'MOLECULES_DIR content: {len(molecule_files)} files')
print(*molecule_files[:5], '...', sep='\n')
print(f'\nDATA_DIR content: {len(data_files)} files')
print(*data_files[:5], '...', sep='\n')

MOLECULES_DIR content: 107 files
H,0,0,0.0;H,0.4146,0,0.0;H,-0.1098,-0.7803,0.0;H,-0.3798,-0.2336,-0.3309.hdf5
H,0,0,0.0;H,0.4151,0,0.0;H,-0.3414,0.4347,0.0;H,0.9643,0.1957,0.1622.hdf5
H,0,0,0.0;H,0.4176,0,0.0;H,0.9165,-0.0483,0.0;H,0.284,1.2326,-0.0431.hdf5
H,0,0,0.0;H,0.4182,0,0.0;H,-0.491,0.2999,0.0;H,-0.3502,0.1393,-0.3468.hdf5
H,0,0,0.0;H,0.4304,0,0.0;H,0.5835,-0.8789,0.0;H,0.16,-0.4559,-0.5024.hdf5
...

DATA_DIR content: 107 files
H,0,0,0.0;H,0.4146,0,0.0;H,-0.1098,-0.7803,0.0;H,-0.3798,-0.2336,-0.3309.json
H,0,0,0.0;H,0.4151,0,0.0;H,-0.3414,0.4347,0.0;H,0.9643,0.1957,0.1622.json
H,0,0,0.0;H,0.4176,0,0.0;H,0.9165,-0.0483,0.0;H,0.284,1.2326,-0.0431.json
H,0,0,0.0;H,0.4182,0,0.0;H,-0.491,0.2999,0.0;H,-0.3502,0.1393,-0.3468.json
H,0,0,0.0;H,0.4304,0,0.0;H,0.5835,-0.8789,0.0;H,0.16,-0.4559,-0.5024.json
...


## Prompt to delete molecule and data files 

In [3]:
def clean_data():
    print('remove all data files from MOLECULE_DIR? [y/n]')
    inp = input()
    if inp == 'y':
        for f in os.listdir(MOLECULES_DIR):
            os.remove(MOLECULES_DIR + f)
    print('remove all data files from JSON_DIR? [y/n]')
    inp = input()
    if inp == 'y':
        for f in os.listdir(JSON_DIR):
            os.remove(JSON_DIR + f)
clean_data() 

remove all data files from MOLECULE_DIR? [y/n]
y
remove all data files from JSON_DIR? [y/n]
y
