# Writing Data

## Foreword on availability and potential data loss

The database is hosted on a single machine in scinet's infrastructure. When you are accessing, it is through the computer science departement of UofT's network. This means there is a lot of potential point of failure (cs network, scinet's network, hardware failure). Please keep that in mind when using the database.

Of course, there is a daily backup. However, if we have a failure just before the daily backup, we could loose a day of work. So please, be careful and indulging and most importantly: **keep the data on your side too**.

## Overview of the schema

The figure here-below is showing the tables of the schema with their relationships. For now all relationships are required, meaning that a `experiment_machine` can not be added without being linked to a `lab`, or a `calculation` can not be existing without referencing a `conformer`. The direction of the arrow is here to show the direction of the dependancy. For instance a `conformer` needs to reference a `molecule` and not the opposite.

For most relationship, the columns references share the same names. For instance both `lab` and `experiment_machine` have a column name `lab_id`. All ids in this schema have the type [UUID](https://en.wikipedia.org/wiki/Universally_unique_identifier).

A complete overview of the schema can be found in the [Schema section](06_schema.rst).

## JSON fields

Most tables contains a columns called `metadata` of type `json`. These columns are the dynamic part of the schema that can be defined by the users. It is possible to put whatever data that can be serialized as a [json](https://en.wikipedia.org/wiki/JSON) in these columns.

**We will need a naming convention in order to all use the same keys and structure.**

When updating a `json` column, the database will concatenate the current content of the row with the new keys. If the key provided match an existing key, then its content will be updated. 

## From an empty database to an experiment

In this section, we will show how to add data to an empty database and consider only the "experimentalist" side of the schema. Our goal is to add a row to the table `experiment` with its `xy_data`. We need to fill rows in the following tables: `experiment_machine`, `experiment_type`, `lab`, `synthesis`, `synthesis_machine` and `data_units` to satisfy the relationship contraints.

First we need to connect to a database. Here I am using a database that is running on my computer. You can safely ignore the warning when initializing the client.

In [1]:
import mdb

client = mdb.MDBClient(hostname='localhost',
                       username='postgres',
                       password='',
                       database='mdb')

  "Did not recognize type '%s' of column '%s'" % (attype, name)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)
  % (item.__module__, item.__name__)


Then, let's add a lab. A lab has a name and a short_name that have to be unique. The short_name is just 3 letters that are here for convenience when it comes to select your lab.

In [2]:
rec = client.add_lab(name='Mad Lab', short_name='MAD')

All the `add_*` methods return their corresponding row in the [eventstore](05_event_sourcing.ipynb). This row contains all the information regarding the edition of the data in the database:

In [3]:
print(f'event:    {rec.event}')
print(f'type:     {rec.type}')
print(f'uuid:     {rec.uuid}')
print(f'data:     {rec.data}')
print(f'timestamp {rec.timestamp}')

event:    create
type:     lab
uuid:     2b8e9ae2-a5ce-4075-af85-8fdcedbd27fd
data:     {'name': 'Mad Lab', 'short_name': 'MAD'}
timestamp 2020-03-31 14:05:51.974644


Returning this information has the advantage that we do not need to query the database again to know what is the `lab_id` of the lab we just added. Indeed `rec.uuid` contains this information. Thus we can go on and add a `synthesis_machine`.

In [4]:
rec = client.add_synthesis_machine(name='Mad Machine Doing Chemistry',
                                   make='MadChem',
                                   model='v2048',
                                   metadata={'brand': 'Mad Brand'},
                                   lab_id=rec.uuid)

Note that a `synthesis_machine` could also be a real human chemist as well!

In order to add a `synthesis` we need some molecules:

In [5]:
event = client.add_molecule_type('fragment')
client.add_molecule(smiles='A', molecule_type_id=event.uuid)
client.add_molecule(smiles='B', molecule_type_id=event.uuid)
client.add_molecule(smiles='C', molecule_type_id=event.uuid)

event = client.add_molecule_type('rule_based_molecule')
rec = client.add_molecule(smiles='A-UGLY-TYPO-C', 
                          molecule_type_id=event.uuid,
                          reactant_id=[client.get_id('molecule', smiles='A'),
                                        client.get_id('molecule', smiles='B'),
                                        client.get_id('molecule', smiles='C')])

There is a lot going on in this last chunk of code. Let's decompose it step-by-step:

First I added 3 fragments `A`, `B` and `C`. The only check the database makes on the smiles is to verify whether they are unique. Making them valid smiles and unique is up to the user.

Then I am adding a molecule made from the 3 fragments. This time, instead of getting the fragment_ids through the output of the `add_fragment`, I used the `get_id` method. It takes as an argument the name of a table followed by column valued used to fetch the id. We could do the same for a lab:

In [6]:
client.get_id('lab', short_name='MAD')

'2b8e9ae2-a5ce-4075-af85-8fdcedbd27fd'

Lastly, I made an ugly typo. :(

Let's fix it!

First let's get all the molecules in the database (this can be slow if there is a lot of molecules):

In [7]:
df = client.get('molecule')
df

Unnamed: 0,molecule_id,inchi,cid,molecule_type_id,created_on,metadata,iupac_name,cas,smiles,updated_on
0,d2c69e72-ff8a-47f1-b034-3b100a194f87,,,61c3157b-e646-4225-982c-c7bd5c63fa8c,2020-03-31 14:06:21.403313,{},,,A-UGLY-TYPO-C,2020-03-31 14:06:21.403313
1,553d4524-67a8-447a-b592-60f58488128e,InChI=1S/CH4/h1H4,297.0,8399fe19-2523-4b86-b504-3df5bdf39c9b,2020-03-31 14:06:10.040051,{},methane,74-82-8,C,2020-03-31 14:06:10.040051
2,b209c5b1-ea69-47ef-8df0-5f7de2f710f2,InChI=1S/BH3/h1H3,6331.0,8399fe19-2523-4b86-b504-3df5bdf39c9b,2020-03-31 14:06:01.454134,{},borane,13283-31-3,B,2020-03-31 14:06:01.454134
3,172cb0af-7b94-406a-9197-91aa1f40a74b,,,8399fe19-2523-4b86-b504-3df5bdf39c9b,2020-03-31 14:05:55.782201,{},,,A,2020-03-31 14:05:55.782201


The `get` function of the client returns by default a pandas.DataFrame. 

Then let's edit the ugly typo and give it back to the client.

In [8]:
df.at[0, 'smiles'] = 'ABC'
rec = client.update('molecule', df)

print(f'event:    {rec[0].event}')
print(f'type:     {rec[0].type}')
print(f'uuid:     {rec[0].uuid}')
print(f'data:     {rec[0].data}')
print(f'timestamp {rec[0].timestamp}')

4it [00:00, 58.23it/s]

event:    update
type:     molecule
uuid:     d2c69e72-ff8a-47f1-b034-3b100a194f87
data:     {'cas': None, 'cid': None, 'inchi': None, 'smiles': 'ABC', 'metadata': {}, 'iupac_name': None, 'molecule_type_id': '61c3157b-e646-4225-982c-c7bd5c63fa8c'}
timestamp 2020-03-31 14:06:23.050206





Similarly to the `add_*` methods, the `update` method returns a list of records with the data of the change. Note the event is now `update` and not `create`.

The `update` method can accept dictionary, list of dictinary and pandas.DataFrames as second argument

Now that we have molecule with no typo, we can add a synthesis:

In [9]:
rec = client.add_synthesis(synthesis_machine_id=client.get_id('synthesis_machine', 
                                                              name='Mad Machine Doing Chemistry'),
                           targeted_molecule_id=client.get_id('molecule', smiles='ABC'),
                           xdl='<xdl><recipe></recipe></xdl>',
                           notes='')

When data is inserted on the `synthesis` table, some database magic happens and a human readable id is added to the row:

In [10]:
rec.data['hid']

'MAD_2020-03-31_0'

This hid has a format LAB_YYYY_MM_DD_XXX where LAB is the `lab.short_name` and XXX is an increasing number. The idea is that you can use this `hid` to identify your samples.

Now, maybe you will have some observation during the synthesis that you wish to record.

Here is another way to update a row with its `uuid`:

In [11]:
notes = """
# Synthesis of ABC

Everything went so **smoooothly**, my yield is at **110%**!

"""

# This is another way of using the update method:
client.update('synthesis', {'notes': notes}, id=rec.uuid)

1it [00:00, 17.32it/s]


[<sqlalchemy.ext.automap.eventstore at 0x1111fdc18>]

Now, with a bit of python magic, we could render those notes:

In [12]:
from IPython.display import HTML
from markdown import markdown

synthesis = client.get('synthesis', filters=[client.models.synthesis.synthesis_id == rec.uuid])

HTML(markdown(synthesis.at[0, 'notes']))

Isn't that cool?

Let's keep going and add a `experiment_type`, `experiment_machine` and an `experiment`

In [13]:
# getting the lab_id
lab_id = client.get_id('lab', short_name='MAD')

# Adding an experiment type
exp_type = client.add_experiment_type('NMR')

# Adding a machine
exp_machine = client.add_experiment_machine(name='Mad NMR', 
                                            make='Super NMR',
                                            model='V3000',
                                            metadata={},
                                            experiment_type_id=exp_type.uuid,
                                            lab_id=lab_id)

# Adding an experiment
experiment = client.add_experiment(synthesis_id=synthesis.at[0, 'synthesis_id'],
                                   experiment_machine_id=exp_machine.uuid,
                                   raw_data_path='C:\On\The\Computer\The\Machine\Is\Connected\To.raw',
                                   metadata={'param1': 100, 'knob 22': 'on'},
                                   notes='')


The last step is to add data to this experiment, let's generate some dummy data with numpy:

In [14]:
import numpy as np

x = np.linspace(0, 10, num=100)
y = np.cos(x)

Let's add a dummy `data_type` and the `xy_data`. You might have data that is not (x, y) data, but maybe (x, y1, y2) then you can just add (x, y1) and (x, y2) as two separate entries in the database.

In [15]:
units = client.add_data_unit('stupid_units')

client.add_xy_data_experiment(experiment_id=experiment.uuid,
                              name='NMR v1',
                              x=x.tolist(),
                              y=y.tolist(),
                              x_units_id=units.uuid,
                              y_units_id=units.uuid)

<sqlalchemy.ext.automap.eventstore at 0x111493780>

## Fom an empty database to a calculation

See [API Reference](07_client_api.rst). (Yes, you've just been RTFM'ed in a manual, deal with it).