# Preparing and loading your data
This tutorial introduces how SchNetPack stores and loads data.
Before we can start training neural networks with SchNetPack, we need to prepare our data.
This is because SchNetPack has to stream the reference data from disk during training in order to be able to handle large datasets.
Therefore, it is crucial to use data format that allows for fast random read access.
We found that the [ASE database format](https://wiki.fysik.dtu.dk/ase/ase/db/db.html) fulfills perfectly.
To further improve the performance, we internally encode properties in binary.
However, as long as you only access the ASE database via the provided SchNetPack `AtomsData` class, you don't have to worry about that.

In [1]:
from schnetpack import AtomsData

## Predefined datasets
SchNetPack supports several benchmark datasets that can be used without preparation.
Each one can be accessed using a corresponding class that inherits from `DownloadableAtomsData`, which supports automatic download and conversion. Here, we show how to use these data sets at the example of the QM9 benchmark.

First, we have to import the dataset class and instantiate it. This will automatically download the data to the specified location.

In [2]:
from schnetpack.datasets import QM9

qm9data = QM9('./qm9.db', download=True)

Let's have a closer look at this dataset.
We can find out how large it is and which properties it supports:

In [3]:
print('Number of reference calculations:', len(qm9data))
print('Available properties:')

for p in qm9data.available_properties:
    print('-', p)

Number of reference calculations: 133885
Available properties:
- rotational_constant_A
- rotational_constant_B
- rotational_constant_C
- dipole_moment
- isotropic_polarizability
- homo
- lumo
- gap
- electronic_spatial_extent
- zpve
- energy_U0
- energy_U
- enthalpy_H
- free_energy
- heat_capacity


We can load data points  using zero-base indexing. The result is a dictionary containing the geometry and properties:

In [4]:
example = qm9data[0]
print('Properties:')

for k, v in example.items():
    print('-', k, ':', v.shape)

Properties:
- rotational_constant_A : torch.Size([1])
- rotational_constant_B : torch.Size([1])
- rotational_constant_C : torch.Size([1])
- dipole_moment : torch.Size([1])
- isotropic_polarizability : torch.Size([1])
- homo : torch.Size([1])
- lumo : torch.Size([1])
- gap : torch.Size([1])
- electronic_spatial_extent : torch.Size([1])
- zpve : torch.Size([1])
- energy_U0 : torch.Size([1])
- energy_U : torch.Size([1])
- enthalpy_H : torch.Size([1])
- free_energy : torch.Size([1])
- heat_capacity : torch.Size([1])
- _atomic_numbers : torch.Size([5])
- _positions : torch.Size([5, 3])
- _cell : torch.Size([3, 3])
- _neighbors : torch.Size([5, 4])
- _cell_offset : torch.Size([5, 4, 3])
- _idx : torch.Size([1])


We see that all available properties have been loaded as torch tensors with the given shapes. Keys with an underscore indicate that these names are reserved for internal use. This includes the geometry (`_atomic_numbers`, `_positions`, `_cell`), the index within the dataset (`_idx`) as well as information about neighboring atoms and periodic boundary conditions (`_neighbors`, `_cell_offset`). 

<div class="alert alert-info">
**Note:** Neighbors are collected using an `EnvironmentProvider`, that can be passed to the `AtomsData` constructor. The default is the `SimpleEnvironmentProvider`, which constructs the neighbor list using a full distance matrix. This is suitable for small molecules. We supply environment providers using a cutoff (`AseEnvironmentProvider`, `TorchEnvironmentProvider`) that are able to handle larger molecules and periodic boundary conditions.
</div>

We can directly obtain an ASE atoms object as follows:

In [5]:
at = qm9data.get_atoms(idx=0)
print('Atoms object:', at)

at2, props = qm9data.get_properties(idx=0)
print('Atoms object (not the same):', at2)
print('Equivalent:', at2 == at, '; not the same object:', at2 is at)

Atoms object: Atoms(symbols='CH4', pbc=False)
Atoms object (not the same): Atoms(symbols='CH4', pbc=False)
Equivalent: True ; not the same object: False


Alternatively, all property names are pre-defined as class-variable for convenient access:

In [6]:
print('Total energy at 0K:', props[QM9.U0])
print('HOMO:', props[QM9.homo])

Total energy at 0K: tensor([-1101.4878])
HOMO: tensor([-10.5499])


## Preparing your own data
In the following we will create an ASE database from our own data.
For this tutorial, we will use a dataset containing a molecular dynamics (MD) trajectory of ethanol, which can be downloaded [here](http://quantum-machine.org/gdml/data/xyz/ethanol_dft.zip).

In [7]:
import os
if not os.path.exists('./ethanol_dft.zip'):
    !wget http://quantum-machine.org/gdml/data/xyz/ethanol_dft.zip
        
if not os.path.exists('./ethanol.xyz'):
    !unzip ./ethanol_dft.zip

The data set is in xyz format with the total energy given in the comment row. For this kind of data, we supply a script that converts it into the SchNetPack ASE DB format.
```
spk_parse.py ./ethanol.xyz ./ethanol.db --atomic_properties Properties=species:S:1:pos:R:3:forces:R:3 --molecular_properties energy
```
It is generally possible to use the parsing script for other data sets, too. 
Currently the script supports **xyz** and **extended xyz** file formats (use this 
[link](https://libatoms.github.io/QUIP/io.html#extendedxyz) for further information).
In general, both file formats consist of single or multiple time steps of some 
molecular dynamics trajectory with different atomic and/or molecular properties. One 
time step starts with the number of atoms in the first line and is followed by a 
comment line. The following lines contain the atomic properties, starting with the 
first atom of the molecule. Different properties are separated by tabs. The comment 
line differentiates between the **basic** and the **extended** file format. While 
**basic xyz** files only have unlabeled molecular properties in their comment line, 
**extended xyz** files provide the molecular properties in a dict style manner and 
also provide further information about the atomic properties via a property string. 
The property string should start with "Properties=" and is followed by the column 
names, data types and numbers. The default property string with atomic numbers and 
forces is `Properties=species:S:1:pos:R:3`. If your **xyz** file also contains other 
atomic properties, you need to append them to the property string with the 
``--atomic_properties <property name>:<property type>:<number of columns>``. The only
two property types are ``S`` for **strings** and ``R`` for **numeric data types**
. It is also possible to pass the full property string to the script. If your file 
contains molecular data, you can define the property names with 
``--molecular_properties p1 p2 ...``. If you use an **extended xyz** file, all 
information is already stored in the comment line, so the additional keywords of the 
parsing script can be ignored.

In the following, we show how this can be done in general, so that you apply this to any other data format.

First, we need to parse our data. For this we use the IO functionality supplied by ASE.
In order to create a SchNetPack DB, we require a **list of ASE `Atoms` objects** as well as a corresponding **list of dictionaries** `[{property_name1: property1_molecule1}, {property_name1: property1_molecule2}, ...]` containing the mapping from property names to values.

In [8]:
from ase.io import read
import numpy as np

# load atoms from xyz file. Here, we only parse the first 10 molecules
atoms = read('./ethanol.xyz', index=':10')

# comment line is weirdly stored in the info dictionary as key by ASE. here it corresponds to the energy
print('Energy:', atoms[0].info)
print()

# parse properties as list of dictionaries
property_list = []
for at in atoms:
    # All properties need to be stored as numpy arrays.
    # Note: The shape for scalars should be (1,), not ()
    # Note: GPUs work best with float32 data
    energy = np.array([float(list(at.info.keys())[0])], dtype=np.float32)    
    property_list.append(
        {'energy': energy}
    )
    
print('Properties:', property_list)

Energy: {'-97208.40600498248': True}

Properties: [{'energy': array([-97208.41], dtype=float32)}, {'energy': array([-97208.375], dtype=float32)}, {'energy': array([-97208.04], dtype=float32)}, {'energy': array([-97207.5], dtype=float32)}, {'energy': array([-97206.84], dtype=float32)}, {'energy': array([-97206.1], dtype=float32)}, {'energy': array([-97205.266], dtype=float32)}, {'energy': array([-97204.29], dtype=float32)}, {'energy': array([-97203.16], dtype=float32)}, {'energy': array([-97201.875], dtype=float32)}]


Once we have our data in this format, it is straightforward to create a new SchNetPack DB and store it.

In [9]:
%rm './new_dataset.db'
new_dataset = AtomsData('./new_dataset.db', available_properties=['energy'])
new_dataset.add_systems(atoms, property_list)

Now we can have a look at the data in the same way we did before for QM9:

In [10]:
print('Number of reference calculations:', len(new_dataset))
print('Available properties:')

for p in new_dataset.available_properties:
    print('-', p)
print()    

example = new_dataset[0]
print('Properties of molecule with id 0:')

for k, v in example.items():
    print('-', k, ':', v.shape)

Number of reference calculations: 10
Available properties:
- energy

Properties of molecule with id 0:
- energy : torch.Size([1])
- _atomic_numbers : torch.Size([9])
- _positions : torch.Size([9, 3])
- _cell : torch.Size([3, 3])
- _neighbors : torch.Size([9, 8])
- _cell_offset : torch.Size([9, 8, 3])
- _idx : torch.Size([1])


The same way, we can store multiple properties, including atomic properties such as forces, or tensorial properties such as polarizability tensors.

In the following tutorials, we will describe how these datasets can be used to train neural networks.