# QCPortal-Next

This notebook explores how to access qcarchive using the "next" branch of the code, i.e., the new development branch. 

* Preliminary documentation of the new version can be found here:
https://molssi.github.io/QCFractal/index.html

* How to install:
https://molssi.github.io/QCFractal/user_guide/client_setup.html

## Notes: 
There doesn't appear to be a way to download the data locally at the moment (no hdf5 view), only in-memory cache.  The current version on github does seem to allow setting a cache directory (uses dbm to create a local database, which can be retrieved later to avoid downloading again), but the "stable" release on conda does not yet support this. 

I will have to see if the "next" branch is planned to have hdf5 support for local access.  I think the plan to generate our own hdf5 files from data in qcarchive is probably the way to go for fast access. 

In [1]:
from qcportal import PortalClient
import h5py
import numpy as np

In [2]:
class QM9:
    
    def __init__(self, name='QM9_b3lyp_def2-svp'):
        self._set_dataset_handle(dataset_type='singlepoint', dataset_name='QM9')
        self.name = name
        self.n_datapoints = len(self.ds.entry_names)
    
    @property
    def name(self):
        return self._name
        
    @name.setter
    def name(self, name):
        self._name = name 

    @property
    def n_datapoints(self):
        return self._n_datapoints
        
    @n_datapoints.setter
    def n_datapoints(self, n_points):
        self._n_datapoints = n_points 

    def description(self):
        return self.ds.dict()['description']

    def citations(self):
        return self.ds.metadata['citations']
        
    def _set_dataset_handle(self, dataset_type='singlepoint', dataset_name='QM9'):
        """Load the QM9 datset sourced from qcportal
        
        Parameters
        ----------
        dataset_type : str, required, default=None
            singlepoint
        dataset_name : str, required, default=None
            The name defining where to save the dataset locally.
            If input_file and output_file are both None, the dataset will be deleted upon exit.
    
        """
        client = PortalClient()
    
        # to get QM9 from qcportal, we need to define which collection and QM9
        # we will first check to see if it exists 
        qcportal_data = {'dataset_type': 'singlepoint', 'dataset_name': 'QM9'}
    
        try: 
            self.ds = client.get_dataset(dataset_type=dataset_type, dataset_name=dataset_name)
        except:
            print(f"Dataset {dataset_name} is not available in collection {dataset_type}.")
        
        self.entry_names = self.ds.entry_names

    def get_record(self, index):
        if index >= self.n_datapoints:
            raise Exception(f'{self.n_datapoints} datapoints in dataset.')
        
        record_name = self.entry_names[self.n]
        geometry = self.ds.get_entry(record_name).dict()['molecule']['geometry']
        atomic_numbers = self.ds.get_entry(record_name).dict()['molecule']['atomic_numbers']
        mol_form = self.ds.get_entry(record_name).dict()['molecule']['identifiers']['molecular_formula']

        #note seems to be a little bit of a bug here in the setup of this dataset.  specification_name resolves to what seems to be 
        # default assigned names, i.e. spec_1, spec_2, instead of the actual methods
        # in this case, spec_2 corresponds to: 'method': 'b3lyp' 'basis': 'def2-svp'
        
        energy = self.ds.get_record(record_name, specification_name='spec_2').dict()['properties']['dft functional total energy']
        result = {'record_name': record_name, 'molecular_formula':mol_form, 'energy': energy, 'atomic_numbers': atomic_numbers, 'geometry': geometry}
        return result
    
    def __iter__(self):
        self.n = 0
        return self

    def __next__(self):
        if self.n < self.n_datapoints:
            result =  self.get_record(self.n)
            self.n += 1
            return result
        else:
            raise StopIteration
    

In [3]:
qm9 = QM9()

In [4]:
qm9.description()

'Small organic molecules with up to 9 heavy atoms sampled from GDB-17, optimized at the B3LYP/6-31G(2df,p) level of theory. Ground state, orbital, and thermodynamic properties are available (at the B3LYP/6-31G(2df,p) level). All molecules are neutral singlets. This dataset was sourced from <a href="http://quantum-machine.org/datasets/">quantum-machine.org</a> and <a href="https://qmml.org/datasets.html">qmml.org</a>.'

In [5]:
qm9.citations()

[{'doi': '10.1021/ja902302h',
  'url': 'https://pubs.acs.org/doi/abs/10.1021/ja902302h',
  'bibtex': '\n@article{blum2009970,\n  title={970 million druglike small molecules for virtual screening in the chemical universe database GDB-13},\n  author={Blum, Lorenz C and Reymond, Jean-Louis},\n  journal={Journal of the American Chemical Society},\n  volume={131},\n  number={25},\n  pages={8732--8733},\n  year={2009},\n  publisher={ACS Publications}\n}\n',
  'acs_citation': ' Blum, L. C. &amp; Reymond, J.-L. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. <em>JACS, </em><b>2009</b><i>, 131</i>, 8732-8733.'},
 {'doi': '10.1038/sdata.2014.22',
  'url': 'https://www.nature.com/articles/sdata201422',
  'bibtex': '\n@article{ramakrishnan2014quantum,\n  title={Quantum chemistry structures and properties of 134 kilo molecules},\n  author={Ramakrishnan, Raghunathan and Dral, Pavlo O and Rupp, Matthias and Von Lilienfeld, O Anatole},\n  journal={S

Print out the first 5 records, providing the formula, energy, then elements/geometry info

In [6]:
for i, record in enumerate(qm9):
    if i < 5:
        for item in record.items(): print(item)
        print('----')
    else:
        break

('record_name', 'dsgdb9nsd_000001')
('molecular_formula', 'CH4')
('energy', -40.48768186718216)
('atomic_numbers', array([6, 1, 1, 1, 1], dtype=int16))
('geometry', array([[-2.39960000e-02,  2.05187248e+00,  1.51196900e-02],
       [ 4.06370000e-03, -1.13975400e-02,  3.73433000e-03],
       [ 1.91189421e+00,  2.76608881e+00,  5.22650000e-04],
       [-1.02199236e+00,  2.73542886e+00, -1.65661653e+00],
       [-9.89864310e-01,  2.71729888e+00,  1.71284265e+00]]))
----
('record_name', 'dsgdb9nsd_000002')
('molecular_formula', 'H3N')
('energy', -56.50942558886665)
('atomic_numbers', array([7, 1, 1, 1], dtype=int16))
('geometry', array([[-0.07639417,  1.93528318,  0.11822845],
       [ 0.03261188,  0.023707  , -0.05173533],
       [ 1.73059109,  2.56765629, -0.05434429],
       [-0.98318243,  2.53890776, -1.46556314]]))
----
('record_name', 'dsgdb9nsd_000003')
('molecular_formula', 'H2O')
('energy', -76.35828223086327)
('atomic_numbers', array([8, 1, 1], dtype=int16))
('geometry', array([[

In [None]:
qm9.ds.get_entry('dsgdb9nsd_000001').dict()['molecule']['atomic_numbers']


Example of grabbing info from an individual record:

In [None]:
qm9.ds.get_record('dsgdb9nsd_000001', specification_name='spec_2').dict()


Main info in the top level dataset

In [None]:
qm9.ds.dict()

Generate a minimal hdf5 file from the first 20 entries as a test. 

In [None]:
dt = h5py.special_dtype(vlen=str) 

f = h5py.File('n20_qm9.hdf5', 'w')

for i, record in enumerate(qm9):
    if i < 20:
        group = f.create_group(record['record_name'])
    
        group.create_dataset(name='energy', data=record['energy'])
        group.create_dataset(name='molecular_formula', data=record['molecular_formula'], dtype=dt)
        group.create_dataset(name='index', data=i)
        group.create_dataset(name='geometry', data=record['geometry'], chunks=True, compression='gzip')
        group.create_dataset(name='atomic_numbers', data=np.array(record['atomic_numbers']), chunks=True, compression='gzip')
    else:
        break
    
f.close()

Read in the file we just created and shuffle the entires

In [None]:
with h5py.File('n20_qm9.hdf5','r') as hf: 
    print('n_entries ', len(hf.keys()))

    mols = [mol for mol in hf.keys()]
    
    np.random.shuffle(mols)
    for mol in mols:
        print(f'---\n{mol}')
        for it in hf[mol].keys():
            print(f'{it} {hf[mol][it][()]}')

In [None]:
dt = h5py.special_dtype(vlen=str) 

f = h5py.File('test_qm9.hdf5', 'w')

for i, record in enumerate(qm9):
    group = f.create_group(record['record_name'])

    group.create_dataset(name='energy', data=record['energy'])
    group.create_dataset(name='molecular_formula', data=record['molecular_formula'], dtype=dt)
    group.create_dataset(name='index', data=i)
    group.create_dataset(name='geometry', data=record['geometry'], chunks=True, compression='gzip')
    group.create_dataset(name='atomic_numbers', data=np.array(record['atomic_numbers']), chunks=True, compression='gzip')
    
f.close()