Skip to content

UnixJunkie/propbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

                      propbox 0.5

Summary
=======

Propbox is a Python package for computing molecular properties and
models, and handing the dependencies between the calculations.

The dependencies form a workflow. For example, the steps in building a
consensus model may look like this:

  - the input is a SMILES string
  - turn the SMILES into a molecule
  - desalt it and standardize the charge model
  - use the clean molecule to compute logP,
      molecular weight, and a few other desciptors
  - use the descriptors to compute model-1,
      model-2, and model-3
  - use model-1, model-2, and model-3 to compute
      a consensus model

Rather than arrange the steps by hand, propbox uses a set of resolvers
to fill out a table of properties. The table starts with the input
data - one row per record. You ask the table for the output columns
you want. If a property isn't available, the table asks the resolver
to fill in the missing column. That operation may require additional
data, in which case the resolver goes back to the table to ask for
those columns. This process continues recursively until it gets to
available data. (Or if there's a cycle, until Python's reaches its
maximum recursion depth and throws an exception.) Each resolver then
resolves the column data and the process unwinds until all of the
needed columns are filled in.


Installation
============

This package does not yet support the standard Python installer. You
can run it from the current directory, or copy/move/link the 'propbox'
subdirectory to your location of choice.

License
=======

The propbox package is distributed under the MIT license. (See
COPYING.) The package includes a distribution of the third-party
pylru.py, which is copyright Jay Hutchinson and distributed under the
GPLv2 or later. (See COPYING.pylru.)



'rdprops' command-line tool
===========================

The 'rdprops' command-line program computes molecular descriptors
using the RDKit cheminformatics toolkit from rdkit.org . It implements
the descriptors from rdkit.Chem.Descriptors as well as a few versions
of SMILES strings.

By default it reads a SMILES file from stdin and writes the results to
stdout. I'll ask it to read from a named SMILES file instead, and only
show the first few lines of output::

  % ./rdprops tests/benzodiazepine.smi | head
  id	smiles	MolWt
  1688	CN1C(=O)CN=C(c2ccc(Cl)cc2)c2cc(Cl)ccc21	319.191
  1963	OCc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1Cl)=NC2	359.216
  2118	Cc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1)=NC2	308.772
  2802	O=C1CN=C(c2ccccc2Cl)c2cc([N+](=O)[O-])ccc2N1	315.716
  2809	O=C(O)C1N=C(c2ccccc2)c2cc(Cl)ccc2NC1=O	314.728
  2997	O=C1CN=C(c2ccccc2)c2cc(Cl)ccc2N1	270.719
  3016	CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21	284.746
  3261	Clc1ccc2c(c1)C(c1ccccc1)=NCc1nncn1-2	294.745
  3299	CCOC(=O)C1N=C(c2ccccc2F)c2cc(Cl)ccc2NC1=O	360.772


The default output contains the record identifier ("id"), the
canonical isomeric SMILES ("smiles"), and the molecular weight
("MolWt"). Use the `--columns` option to specify different columns::

  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' | head
  id	HeavyAtomCount	MolWt
  1688	21	319.191
  1963	24	359.216
  2118	22	308.772
  2802	22	315.716
  2809	22	314.728
  2997	19	270.719
  3016	20	284.746
  3261	21	294.745
  3299	25	360.772

Propbox uses the RDKit descriptor names for the columns, and by
default uses the names for the column headers. You might prefer
a different header::

  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --headers 'ID,HEAVIES,MW' | head
  ID	HEAVIES	MW
  1688	21	319.191
  1963	24	359.216
  2118	22	308.772
  2802	22	315.716
  2809	22	314.728
  2997	19	270.719
  3016	20	284.746
  3261	21	294.745
  3299	25	360.772

or perhaps don't want a header at all::

  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --no-header | head
  1688	21	319.191
  1963	24	359.216
  2118	22	308.772
  2802	22	315.716
  2809	22	314.728
  2997	19	270.719
  3016	20	284.746
  3261	21	294.745
  3299	25	360.772
  3369	21	302.736

The default output is tab-separated, but you can change that with the
`--dialect` option, which can be one of 'tab', 'space', 'whitespace',
'excel' or 'excel-tab'. (The 'whitespace' option is the same as
'space', and the Excel dialects are as defined by Python's csv
module, and include the special rules for quoting)::

  % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --dialect excel | head
  id,HeavyAtomCount,MolWt
  1688,21,319.191
  1963,24,359.216
  2118,22,308.772
  2802,22,315.716
  2809,22,314.728
  2997,19,270.719
  3016,20,284.746
  3261,21,294.745
  3299,25,360.772


List the available descriptors
------------------------------

use the `--list` option to get a list of the available descriptors::

  % ./rdprops --list | wc -l
       124

That's rather a lot, so I'll elide some of them::

  % ./rdprops --list
  _chargeDescriptors
  BalabanJ
  BertzCT
  cansmiles
  chargeDescriptorVersion
  Chi0
  Chi0n
  Chi0v
     ...
  ExactMolWt
  FractionCSP3
  HallKierAlpha
  HeavyAtomCount
  HeavyAtomMolWt
  id
  input_format
  input_mol
  input_record
     ...
  mol
  MolLogP
  MolMR
  MolWt
  MolWt_version
  nci_iupac_name
  nci_names
    ...
  TPSA
    ...
  VSA_EState8
  VSA_EState9

A future version will include a way to get a description of each
descriptor.

What's also missing is a naming convention or some other mechanism to
describe if it makes sense to print a descriptor as text. For example,
the 'mol' property is the RDKit molecule object for the input
structure, after de-salting. It doesn't make sense to display the
opaque text representation of a molecule object ::

  % ./rdprops tests/benzodiazepine.smi --columns 'id,mol' | head -5 
  id	mol
  1688	<rdkit.Chem.rdchem.Mol object at 0x105c44910>
  1963	<rdkit.Chem.rdchem.Mol object at 0x105c44980>
  2118	<rdkit.Chem.rdchem.Mol object at 0x105c449f0>
  2802	<rdkit.Chem.rdchem.Mol object at 0x105c44a60>

Similarly, the _chargeDescriptors property is another internal
property that shouldn't really be exposed. (I'll use this as an
example of how the quoting rules work for the 'excel' dialect.)::

  % ./rdprops tests/benzodiazepine.smi --columns 'id,_chargeDescriptors' --dialect excel | head -3
  id,_chargeDescriptors
  1688,"ChargeDescriptor(minCharge=-0.31319991842931816, maxCharge=0.24791727974294836)"
  1963,"ChargeDescriptor(minCharge=-0.38834256479943147, maxCharge=0.16298797813009208)"

I may move to the convention that a leading '_', and perhaps also a
leading lowercase character, indicate an internal variable. Or I may
have some way to mark certain descriptors as only being for internal
use. Then again, I like how IPython supports adapters to, for example,
show inline images for a molecule in a table. Perhaps I'll do that.

Specify the format
------------------

Propbox uses the filename extension to determine the file format, and
to see if the file is gzip compressed. The following case-insensitive
extensions are supported:

  .smi, .ism, .isosmi - SMILES
  .smi.gz, .ism.gz, .isosmi.gz - gzip compressed SMILES

  .sdf, .sd, .mdl - SD file
  .sdf.gz, .sd.gz, .mdl.gz - gzip compressed SD file

If propbox does not recognize the file format extension, or if the
input comes from stdin, then it will assume the input is an
uncompressed file format.

You can specify the format directly using `--format` instead of
depending on propbox's auto-detection code. For example, since rdprops
expects a SMILES file from stdin, pipeing in an SD file will cause a
problem::


  % ./rdprops < tests/CHEMBL11862.sdf
  [01:33:10] SMILES Parse Error: syntax error for input: CHEMBL11862
  [01:33:10] SMILES Parse Error: syntax error for input: SciTegic11101117232D
  Traceback (most recent call last):
    File "rdprops", line 9, in <module>
      rdprops.main()
    File "/Users/dalke/cvses/propbox/propbox/rdprops.py", line 174, in main
      ids_and_mols = list(batch_reader)
    File "/Users/dalke/cvses/propbox/propbox/rdkit_toolkit.py", line 183, in _read_smiles
      raise ValueError("Line %d is empty" % (lineno,))
  ValueError: Line 3 is empty
  

I'll instead tell it the input is an uncompressed SD file::

  % ./rdprops --format sdf < tests/CHEMBL11862.sdf
  id	smiles	MolWt
  CHEMBL11862	Oc1cc2c(cc1O)CNCC2	165.192

The supported formats are 'smi', 'smi.gz', 'sdf', and 'sdf.gz', with
the expected meanings.


Use an SD tag as a title
------------------------

By default propbox will use the title line of the SD file as the
identifier. Sometimes the identifier is in one of the tags, as ChEBI
and older ChEMBL data sets, or if you want to use the InChI or other
primary key stored in a tag.

For example, the title line in CHEMBL11862.sdf is "CHEMBL11862"::

  % ./rdprops tests/CHEMBL11862.sdf
  id	smiles	MolWt
  CHEMBL11862	Oc1cc2c(cc1O)CNCC2	165.192

while the SD tag 'nci_iupac_name' contains the IUPAC name that I got
from passing the structure over to NCI::

  % ./rdprops --id-tag nci_iupac_name tests/CHEMBL11862.sdf 
  id	smiles	MolWt
  1,2,3,4-tetrahydroisoquinoline-6,7-diol	Oc1cc2c(cc1O)CNCC2	165.192
  



Reader arguments
----------------

The RDKit SMILES and SDF readers support a few options:

  SMILES:
    has_header - Is the first line of the SMILES file a
       header line? (boolean, with default of False)
       
    delimiter - Specify how to parse the fields of a
       SMILES files? (One of 'space'/" ", 'tab'/"\t",
       'whitespace', or 'to-eol', with default of 'to-eol')

    sanitize - Should the newly parsed molecule be
       sanitized? (boolean, with default of True)


  SDF:
    strictParsing - Use strict parsing rules? (boolean,
       with default of True)
    
    removeHs - Should hydrogens be removed from the
       molecule? (boolean, with default of True)
    
    sanitize - same as in SMILES


The "delimiter" option is a bit unusual. Different people have a
different interpretation of what a SMILES file means. The orignal
Daylight definition was that the file contains a SMILES, followed by a
whitespace, and the rest of the line is the identifier.

In propbox (and in chemfp) this is called the 'to-eol' delimiter, and
is the default.

Other people think of a SMILES file as a space, tab, or whitespace
separated file, where the first column is the SMILES, the second
column is the identifier, and additional columns are ignored.  In
propbox these are refered to as the "space", "tab", and "whitespace"
delimiter styles, respectively. ("Whitespace" means that each word is
treated as its own field.)

You can specify these reader arguments on the command line. For
example, in "tests/drugs.smi" is a file I got from Daylight many years
ago::

  % cat tests/drugs.smi 
  N12CCC36C1CC(C(C2)=CCOC4CC5=O)C4C3N5c7ccccc76 Strychnine
  c1ccccc1C(=O)OC2CC(N3C)CCC3C2C(=O)OC cocaine
  COc1cc2c(ccnc2cc1)C(O)C4CC(CC3)C(C=C)CN34 quinine
  OC(=O)C1CN(C)C2CC3=CCNc(ccc4)c3c4C2=C1 lyseric acid
  CCN(CC)C(=O)C1CN(C)C2CC3=CNc(ccc4)c3c4C2=C1 LSD
  C123C5C(O)C=CC2C(N(C)CC1)Cc(ccc4O)c3c4O5 morphine
  C123C5C(OC(=O)C)C=CC2C(N(C)CC1)Cc(ccc4OC(=O)C)c3c4O5 heroin
  c1ncccc1C1CCCN1C nicotine
  CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine
  C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a

Two of the identifiers, "lyseric acid" and "vitamin a", contain a
space in them. The default delimiter style is 'to-eol', which is why
the following show the full names::

  % ./rdprops --columns 'id,MolWt' tests/drugs.smi
  id	MolWt
  Strychnine	334.419
  cocaine	303.358
  quinine	324.424
  lyseric acid	282.343
  LSD	323.44
  morphine	285.343
  heroin	369.417
  nicotine	162.236
  caffeine	194.194
  vitamin a	272.432

To specify the 'whitespace' delimiter style, use the `-R` parameter,
which takes a NAME=VALUE setting::

  % ./rdprops --columns 'id,MolWt' -R delimiter=whitespace tests/drugs.smi 
  id	MolWt
  Strychnine	334.419
  cocaine	303.358
  quinine	324.424
  lyseric	282.343
  LSD	323.44
  morphine	285.343
  heroin	369.417
  nicotine	162.236
  caffeine	194.194
  vitamin	272.432
  

The boolean reader args interpret the strings "True", "true", or "1" a
a true value, and "False", "false", or "0" for a false value. For
example, the following will skip the first line of drugs.smi on the
assumption that it's a header line::

  % ./rdprops --columns 'id,MolWt' -R has_header=true tests/drugs.smi
  id	MolWt
  cocaine	303.358
  quinine	324.424
  lyseric acid	282.343
  LSD	323.44
  morphine	285.343
  heroin	369.417
  nicotine	162.236
  caffeine	194.194
  vitamin a	272.432
  

Batch size
----------

The 'nci_iupac_name' uses the NCI web service API to turn a SMILES
into an IUPAC name. This is mostly a proof-of-concept API, and it's
rather slow since I make a request for each record. (Does the NCI
resolver have a batch mode API?) Still, let's give it a whirl::

  % ./rdprops --columns 'id,nci_iupac_name' tests/drugs.smi
  id	nci_iupac_name
  Strychnine	*
  cocaine	methyl 3-(benzoyloxy)-8-methyl-8-azabicyclo[3.2.1]octane-2-carboxylate
  quinine	(5-ethenyl-1-azabicyclo[2.2.2]octan-7-yl)-(6-methoxyquinolin-4-yl)methanol
  lyseric acid	*
  LSD	*
  morphine	*
  heroin	*
  nicotine	3-(1-methylpyrrolidin-2-yl)pyridine
  caffeine	1,3,7-trimethylpurine-2,6-dione
  vitamin a	*

This took about 3 seconds, but you'll notice that there was no output
until everything was ready. This is because propbox by default
processes the records in batches of 1,000 records. It will compute the
properties for the first 1,000 structures, then display the result,
then compute the properties for the second 1,000 structures, then
display those results, etc.

I can ask it to process one record at a time using the `--batch-size`
parameter::

  % ./rdprops --columns 'id,nci_iupac_name' --batch-size 1 tests/drugs.smi 
  id	nci_iupac_name
  Strychnine	*
  cocaine	methyl 3-(benzoyloxy)-8-methyl-8-azabicyclo[3.2.1]octane-2-carboxylate
  quinine	(5-ethenyl-1-azabicyclo[2.2.2]octan-7-yl)-(6-methoxyquinolin-4-yl)methanol
  lyseric acid	*
  LSD	*
  morphine	*
  heroin	*
  nicotine	3-(1-methylpyrrolidin-2-yl)pyridine
  caffeine	1,3,7-trimethylpurine-2,6-dione
  vitamin a	*


(Propbox uses a '*' for records which had a problem. There is currently
no way to use another symbol.)

In the NCI case there is no timing difference between a batch size of
1 and of 1,000 records because the propbox NCI client makes one
request at a time. Batch mode exists because in some cases it's faster
to process N molecules at once than to process each one
individually. Eg, in the future propbox might be able to send all of
the queries to the server in a single request, which would save a lot
of network overhead.

Use `--batch-size all` to process all of the structures in a single
batch.


Add a resolver
--------------

Use `-r` or `--resolver` to add a resolver to the built-in resolver.

I'll cover the details in the next section. For an example of how it
works, I'll create a simple model based on the molecular weight and
the number of hydrogen bond donors. The descriptor will be called
'model', and located in a file called "model.py" in the current
directory (or somewhere else on the Python path)::

  % cat model.py
  
  from propbox import calculate, collect_resolvers
  
  @calculate()
  def calc_model(MolWt, NumHDonors):
    return MolWt * 12.34 / (NumHDonors + 1)
  
  resolver = collect_resolvers()

This is a non-standard resolver, so I need to tell rdprops the path
for how to load it::

  % ./rdprops --columns 'id,model' -r model.resolver tests/CHEMBL11862.sdf
  id	model
  CHEMBL11862	509.61732


To double-check, I'll get the molecular weight and number of hbond
donors to do the math myself::

  % ./rdprops --columns 'id,MolWt,NumHDonors,model' -r model.resolver tests/CHEMBL11862.sdf
id	MolWt	NumHDonors	model
CHEMBL11862	165.192	3	509.61732

And what do you know, it matches!

  >>> 165.192 * 12.34 / (3 + 1)
  509.61732000000001
  

The propbox resolver framework
==============================


Propbox is built around two concepts: a table and a resolver. The rows
of the table are structure records, and the columns are molecular
properties, referenced by name. A resolver is an object which can fill
in columns of a table. A resolver may get columns from the table in
order to do its job.

Create a table
--------------

There are two ways to create a table; by rows ("records") or by
columns. I'll create a table with no resolver and a single column,
"smiles", with some SMILES data::

  >>> import propbox
  >>> table = propbox.make_table_from_columns(None, {"smiles": ["C", "O=O"]})
  >>> table.get_values("smiles")
  ['C', 'O=O']

Missing identifiers will be created automatically::

  >>> table.get_values("id")
  ['ID1', 'ID2']

or you can specify the identifiers yourself::

  >>> table = propbox.make_table_from_columns(None, 
  ...   {"smiles": ["C", "O=O"], "id": ["methane", "water"]})
  >>> table.get_values("smiles")
  ['C', 'O=O']
  >>> table.get_values("id")
  ['methane', 'water']
  
Use make_table_from_records() if you have per-record dictionary data::

  >>> table = propbox.make_table_from_records(None,
  ...   [{"smiles": "O=O", "id": "water"}, {"smiles": "c1ccccc1O", "id": "phenol"}])
  >>> table.get_values("smiles")
  ['O=O', 'c1ccccc1O']
  >>> table.get_values("id")
  ['water', 'phenol']


I used None as the resolver, but the None object doesn't support the
resolver protocol, so if I try to get a column that doesn't yet exist,
I'll get the following::

  >>> table.get_values("MW")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "propbox/__init__.py", line 709, in get_values
      futures = self.get_futures(name)
    File "propbox/__init__.py", line 693, in get_futures
      self.resolver.resolve_column(name, self)
  AttributeError: 'NoneType' object has no attribute 'resolve_column'


Define a resolver
-----------------

Here's a resolver which returns a constant value::

  from __future__ import print_function
  import propbox
  
  class Constant(propbox.Resolver):
      output_names = ["value"]
  
      def __init__(self, value):
          self.value = value
  
      def resolve_column(self, name, table):
          table.set_values("value", [self.value] * len(table))
  
  table = propbox.make_table_from_records(Constant(4), [{}, {}])
  print("ids", table.get_values("id"))
  print("values", table.get_values("value"))
  

This creates the following output::

  ids ['ID1', 'ID2']
  values [4, 4]

How this works is, the table doesn't know about the 'value' column, so
it asks the resolver to resolve the column 'value'. The table passes
itself as the table, so the resolver can use the table to get or set
data.

The Constant resolver uses len(table) to get the number of rows in the
table -- two in this case -- and create the list [4, 4], which it then
uses to set the table column named 'value', which is then available to
the table.

Thes 'output_names' attribute contains a list of the column names that
the resolver can compute. It isn't actually used in this case, since
the table will ask the resolver to handle any unknown column. I could,
for example, ask for 'xyzzy' and it would call the resolver::

  table = propbox.make_table_from_records(Constant(4), [{}, {}])
  print("ids", table.get_values("id"))
  print("xyzzy", table.get_values("xyzzy"))


However, the table does double-check that the resolver adds the
requested column, so the above will generate the following error::

  ids ['ID1', 'ID2']
  Traceback (most recent call last):
    File "tmp.py", line 15, in <module>
      print("xyzzy", table.get_values("xyzzy"))
    File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 709, in get_values
      futures = self.get_futures(name)
    File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 698, in get_futures
      % (self.resolver, name))
  AssertionError: Resolver <__main__.Constant object at 0x1007e74d0> did not set values for column 'xyzzy'



Define a Propbox
----------------

A Propbox is a resolver which contains other resolvers. It uses the
'output_names' of the other resolvers to figure out which resolver to
use. For example, I'll modify the Constant resolver so I can specify
which column it will set::

  from __future__ import print_function
  import propbox
  
  class Constant(propbox.Resolver):
      def __init__(self, descriptor, value):
          self.value = value
          self.descriptor = descriptor
          self.output_names = [descriptor]
  
      def resolve_column(self, name, table):
          table.set_values(self.descriptor, [self.value] * len(table))
  

then create a Propbox which contains two Constants; one which sets
'value' to 8 and the other which sets 'xyzzy' to 13::

  resolver = propbox.Propbox()
  resolver.add_resolver(Constant("value", 8))
  resolver.add_resolver(Constant("xyzzy", 13))
  
and finally create a table which uses that Propbox resolver::

  table = propbox.make_table_from_records(resolver, [{}, {}])
  print("value", table.get_values("value"))
  print("xyzzy", table.get_values("xyzzy"))
  print("unknown", table.get_values("unknown"))
  

When I run it, I get the following output::

  value [8, 8]
  xyzzy [13, 13]
  Traceback (most recent call last):
    File "tmp.py", line 22, in <module>
      print("unknown", table.get_values("unknown"))
    File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 709, in get_values
      futures = self.get_futures(name)
    File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 693, in get_futures
      self.resolver.resolve_column(name, self)
    File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 171, in resolve_column
      raise PropboxKeyError(self, name)
  propbox.PropboxKeyError: unknown

In case you were wondering, the PropboxKeyError inherits from the
regular KeyError, as well as from propbox.PropboxError.


A resolver that uses the table
------------------------------

The Constant resolver is pretty boring. What about a resolver which
returns the length of the SMILES string, stored in the 'smiles'
column, and sets the column 'len'?::

  from __future__ import print_function
  import propbox
  
  class Len(propbox.Resolver):
      output_names = ["len"]
      def resolve_column(self, name, table):
          smiles_list = table.get_values("smiles")
          table.set_values("len", [len(smiles) for smiles in smiles_list])
  
  
  resolver = Len()
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]})
                                          
  print(table.get_values("len"))

The output from this is::

  [1, 3, 9]

which I think you expected.

What's new here is that the resolver asked the table to get the
"smiles" column. This is a recursive call, since the table was the one
to call the resolver in the first place.

The recursion might go several levels deep. I'll also create a Len2,
which doubles the value of "len". Since I have to resolvers, I'll need
to put them into a Propbox::

  from __future__ import print_function
  import propbox
  
  class Len(propbox.Resolver):
      output_names = ["len"]
      def resolve_column(self, name, table):
          smiles_list = table.get_values("smiles")
          table.set_values("len", [len(smiles) for smiles in smiles_list])
  
  class DoubleLen(propbox.Resolver):
      output_names = ["len2"]
      def resolve_column(self, name, table):
          len_list = table.get_values("len")
          table.set_values("len2", [value*2 for value in len_list])
  
  resolver = propbox.Propbox([Len(), DoubleLen()])
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]})
  
  print(table.get_values("len2"))


Here's what happened:

  - The column "len2" does not exist, so the table asks the
      Propbox resolver to fill it in;
  - The Propbox resolver used the output_names to figure out
      that the DoubleLen resolver could resolve that column.
  - The DoubleLen resolver needs the values for the "len" column
      from the table;
  - The column 'len' doesn't exist, so the table asks the
      Propbox resolver to fill it in;
  - The Propbox resolver used the output_names to figure out
      that the Len resolver could resolve that column;
  - The Len resolver needs the values for the "smiles" column
      from the table;
  - The table returns "smiles" column;
  - The Len resolver computes the string lengths and sets the
      values for the "len" column;
  - The DoubleLen resolver doubles those values, and sets the
      "len2" column;
  - The calculations are complete and returned to the caller.


All of the intermediate values are stored in the table in case they
are needed for additional calculations.


Futures
-------

What if there was an error during the calculation? For that matter,
how does a resolver even indicate an error?

You'll need to understand 'futures' to understand how errors work in
propbox.

A future is something which wraps a return value, or raised
exception. It's often used in modern asynchronous I/O libraries,
including Python 3.4, where it is used as a placeholder for the actual
return value, which will be available in the future. (It's called a
'promise' in some libraries.)

Probox is not asynchronous, though I want it to go that way. I use the
'future' concept as a way to keep track of if something was a return
value or an exception.

Here's what it looks like, using part of the propbox API that should
only be used by resolvers. I'll store the value 12 as if it were a
successfully computed descriptor::

  >>> from propbox import simple_futures
  >>> future = simple_futures.new_future(12)
  >>> future
  <propbox.simple_futures.Future object at 0x100666490>
  >>> future.result()
  12

This being Python, I can store anything in the future's result::

  >>> future = simple_futures.new_future("twelve")
  >>> future.result()
  'twelve'

I can even store an exception instance as the value::

  >>> future = simple_futures.new_future(ValueError("must be a string"))
  >>> future.result()
  ValueError('must be a string',)


What if I have a "real" exception, that is, something which shouldn't
be treated as a return value? I'll create a future that contains an
exception::

  >>> future = simple_futures.new_future_exception(ValueError("must be a string"))
  >>> future.result()
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "propbox/simple_futures.py", line 34, in result
      raise self._exception
  ValueError: must be a string

I asked the future for its result, but since it contained an
exception, it raised the exception.

(One limitation is that the exception no longer has stack
information. If you enable propbox.DEBUG=1 then you'll see the stack
trace printed to stderr when a resolver raises an unhandled exception.)

If you want the exception value, without going through the try/except
mechanism, then ask the future for it::

  >>> future.exception()
  ValueError('must be a string',)

This will return None if there was no exception.


Setting descriptor exceptions
-----------------------------

Two sections earlier I used get_values() and set_values() to get and
set the column values. If there's an error then get_values() by
default will use None as a placeholder error value, and there's no way
to use set_values() to specify an error.

The get_values()/set_values() functions are really just wrappers
around the underlying futures data. You can access the futures
directly with get_futures() and set_futures(). In the following, I'll
modify the "len" resolver so it gives an error if the SMILES string
contains the letter 'O'::


  from __future__ import print_function
  import propbox
  from propbox import simple_futures
  
  class Len(propbox.Resolver):
      output_names = ["len"]
      def resolve_column(self, name, table):
          smiles_list = table.get_values("smiles")
          futures = []
          for smiles in smiles_list:
              if "O" in smiles:
                  err = ValueError("No 'O's allowed: %r" % (smiles,))
                  future = simple_futures.new_future_exception(err)
              else:
                  future = simple_futures.new_future(len(smiles))
              futures.append(future)
          table.set_futures("len", futures)
  
  resolver = Len()
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]})
  
  print(table.get_values("len"))

The result from when I run this is::

  [1, 3, None]

because None is the placeholder error value. I can change that to
something else. In the following I use zero::

  print(table.get_values("len", 0))

which creates the output::

  [1, 3, 0]


Exception chaining (advanced)
-----------------------------

This is an advanced topic. In almost all cases you can use the
'Calculator' class in the next section, which handles exception
chaining automatically.

Suppose you want a "doubled" property, which is twice the "len"
property. What if "len" has an error? Since "len" is supposed to be a
number, it's easy to check if it's the None value, and do something
different in that case. In the following, the 'Len' class is unchanged
from the previous section. What's new is the 'DoubleLen' class, the
propbox which includes both Len and DoubleLen, and the output, where
this time I show the future's exception for each record::

  from __future__ import print_function
  import propbox
  from propbox import simple_futures
  
  class Len(propbox.Resolver):
      output_names = ["len"]
      def resolve_column(self, name, table):
          smiles_list = table.get_values("smiles")
          futures = []
          for smiles in smiles_list:
              if "O" in smiles:
                  err = ValueError("No 'O's allowed: %r" % (smiles,))
                  future = simple_futures.new_future_exception(err)
              else:
                  future = simple_futures.new_future(len(smiles))
              futures.append(future)
          table.set_futures("len", futures)
  
  # This version does not implement exception chaining
  class DoubleLen(propbox.Resolver):
      output_names = ["doubled"]
      def resolve_column(self, name, table):
          len_list = table.get_values("len")
          futures = []
          for len_value in len_list:
              if len_value is None:
                  err = Exception("No 'len' available")
                  future = simple_futures.new_future_exception(err)
              else:
                  future = simple_futures.new_future(len_value*2)
              futures.append(future)
          table.set_futures("doubled", futures)
          
  resolver = propbox.Propbox([Len(), DoubleLen()])
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]})
  
  print("id  exception")
  for id, double_future in zip(table.get_values("id"),
                               table.get_futures("doubled")):
      print(id, double_future.exception())

This gives the following output::

  id  exception
  ID1 None
  ID2 None
  ID3 No 'len' available
  
It would be nice to know *why* 'len' isn't available. Propbox
implements exception chaining, which is where one resolver exception
can wrap another, all the way back to the actual exception that caused
the problem.

To do that correctly, I'll need to make two changes. The first is to
wrap the ValueException of the Len class inside of a ResolverError
exception, so that callers know the descriptor that caused the
original problem. That's one new line of code in the following::

  class Len(propbox.Resolver):
      output_names = ["len"]
      def resolve_column(self, name, table):
          smiles_list = table.get_values("smiles")
          futures = []
          for smiles in smiles_list:
              if "O" in smiles:
                  err = ValueError("No 'O's allowed: %r" % (smiles,))
                  # I added the next line for better exception chaining.
                  # It will include the name of the descriptor that has the problem.
                  err = propbox.ResolverError(err, table.table_name, "len")
                  future = simple_futures.new_future_exception(err)
              else:
                  future = simple_futures.new_future(len(smiles))
              futures.append(future)
          table.set_futures("len", futures)


The second is to change Doubled to use get_futures() instead of
get_values(), and if one of the 'len' futures has an exception, to
wrap it insides of another ResolverError::


  class DoubleLen(propbox.Resolver):
      output_names = ["doubled"]
      def resolve_column(self, name, table):
          len_futures = table.get_futures("len")
          futures = []
          for len_future in len_futures:
              prev_exception = len_future.exception()
              if prev_exception is None:
                  # There was no error
                  len_value = len_future.result()
                  future = simple_futures.new_future(len_value*2)
              else:
                  # Create the chain.
                  err = propbox.ResolverError(prev_exception, table.table_name, "doubled")
                  future = simple_futures.new_future_exception(err)
              futures.append(future)
          table.set_futures("doubled", futures)

The resulting code now generates::

  id  exception
  ID1 None
  ID2 None
  ID3 doubled -> len: ValueError("No 'O's allowed: 'c1ccccc1O'",)

This is more helpful because it says that 'doubled' failed because
'len' failed because of a ValueError.

Use 'get_original_exception()' if you only care about the actual
exception that caused the problem, and not the full resolver chain, as
in the following variation, which only prints those records with an
error::

  print("id  exception")
  for id, double_future in zip(table.get_values("id"),
                               table.get_futures("doubled")):
      exception = double_future.exception()
      if exception is not None:
          print(id, exception.get_original_exception())


which produces the following output::

  id  exception
  ID3 No 'O's allowed: 'c1ccccc1O'


Calculator
----------

The previous section showed the nitty-gritty of how to handle and
report errors during a calculation. For the most part, you don't need
to do that sort of low-level code. Instead, use the 'Calculator'
class. Here's an example::


  from __future__ import print_function
  import propbox
  from propbox import simple_futures
  
  class Len(propbox.Calculator):
      input_names = ["smiles"]
      output_names = ["len"]
      def calculate(self, name, table, input_values, output):
          for (smiles,) in input_values:
              if "O" in smiles:
                  output.add_exception(ValueError("No 'O's allowed: %r" % (smiles,)))
              else:
                  output.add_result(len(smiles))
                  
  class DoubleLen(propbox.Calculator):
      input_names = ["len"]
      output_names = ["doubled"]
      def calculate(self, name, table, input_values, output):
          for (len_value,) in input_values:
            output.add_result(len_value*2)
          
  resolver = propbox.Propbox([Len(), DoubleLen()])
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]})
  
  print("id  exception")
  for id, doubled_future in zip(table.get_values("id"),
                               table.get_futures("doubled")):
      print(id, doubled_future.exception() or doubled_future.result())
  

If you compare it to the previous section you'll see it's much
shorter. If you skipped the previous section, then great! You didn't
need to read it to understand this section.

(For that matter, you can skip to 'Decorators for simple functions'
for an even easier way to handle this code.)

The 'input_names' contains a list of columns that will be passed in as
input, and the calculator must set values for all of the
'output_names'. This is a bit more strict than a normal resolver,
which doesn't need to list its input names, and only needs to compute
the requested output name.

The Calculator's own resolve_column() will filter out the inputs which
contain an exception, and pass only the actual values to the
'calculate()' method. By "actual values" I mean there aren't even
placeholders for the error values. If one of the inputs causes an
exception then the Calculator will automatically set up the resolver
chain.

The values are passed to the "calculate()" function as a list of
lists, where the order depends on the order in input_names. For
example, in the following::

  class Example(propbox.Calculator):
      input_names = ["smiles", "doubled"]
      output_names = ["example"]
      def calculate(self, name, table, input_values, output):
          print("input_values", input_values)
          for (smiles, doubled) in input_values:
            output.add_result("2*len(%r)=%d" % (smiles, doubled))

the inputs are 'smiles' and 'doubled', which are passed in as:

  input_values [['C', 2], ['C#N', 6]]

This is row order, so the first element contains the columns for the
first non-error record, the second for the second non-error record,
and so on.


It's a bit tricky to remember that a single input name still gets a
list of lists, even though its a single element list. I use "for
(len_value,) in" in the following as an explicit reminder that I am
using a term from a single element list::

  class DoubleLen(propbox.Calculator):
      input_names = ["len"]
      output_names = ["doubled"]
      def calculate(self, name, table, input_values, output):
          for (len_value,) in input_values:
            output.add_result(len_value*2)

Otherwise it's very easy to make a mistake and do "for len_value in input_values".

The 'name' and 'table' should look familiar by this time. While you
can use the table to get or set columns, you really shouldn't because
your code will end up interfering with the Calculator code. It's there
in case the calculator needs to access any of the table configuration
information, or needs to get/set a cache value on the table.


The 'output' term is special. Use it to specify information for the
current record number, or specify information for all of the remaining
records.

The 'add_result()' method can be used when output_names contains only
a single name. add_result() sets the corresponding column for the
current record, then advances to the next record.

Use 'add_results()' when there are more outputs. The function takes
the result values as a list or tuple, and sets the corresponding
futures. Here's an example of it in use::

  class Scaling(propbox.Calculator):
      input_names = ["len"]
      output_names = ["tripled", "third"]
      def calculate(self, name, table, input_values, output):
          for (n,) in input_values:
              output.add_results((n*3 ,n/3.0))

In this case the "n*3" sets the 'tripled' column, and 'n/3.0' sets the
'third' column.


Sometimes it's more convenient to set all of the results at once,
which you can do with the 'add_column_results()' method::

  class Scaling(propbox.Calculator):
      input_names = ["len"]
      output_names = ["tripled", "third"]
      def calculate(self, name, table, input_values, output):
          tripled_list = []
          third_list = []
          for (n,) in input_values:
              tripled_list.append(n*3)
              third_list.append(n/3.0)
          output.add_column_results( (tripled_list, third_list) )

However, this doesn't appear to be one of those cases where the result
is simpler.


You saw already how to tell the output to use an exception for a given
row, by using the 'add_exception()' method::

  class Len(propbox.Calculator):
      input_names = ["smiles"]
      output_names = ["len"]
      def calculate(self, name, table, input_values, output):
          for (smiles,) in input_values:
              if "O" in smiles:
                  output.add_exception(ValueError("No 'O's allowed: %r" % (smiles,)))
              else:
                  output.add_result(len(smiles))

You can also specify a futures, either for a record via 'add_futures'
or for all of the columns via 'add_column_futures'.



CalculateName / CalculateNames
------------------------------

You may have functions which you want to turn into propbox
descriptors. The CalculateName and CalculateNames classes are
subclasses of Calculator which know how to call a function to compute a
property or a set of properties, respectively.

For example, here are three functions that might be useful for
propbox::

  from rdkit import Chem
  
  def smilin(smiles):
      mol = Chem.MolFromSmiles(smiles)
      if mol is None:
          raise ValueError("RDKit cannot parse the SMILES %r" % (smiles,))
      return mol
  
  def num_heavies(mol):
      return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() > 1)
  
  def heavy_range(mol):
      "Return the lightest and heaviest element numbers of the heavy atoms"
      atomic_nums = []
      for atom in mol.GetAtoms():
          atomic_num = atom.GetAtomicNum()
          if atomic_num > 1:
              atomic_nums.append(atomic_num)
      if not atomic_nums:
          return (0, 0)
      return min(atomic_nums), max(atomic_nums)
  

The first two only return a single value, so I'll use a CalculateName
instance for them. The last returns two values, so I'll use a
CalculateNames for them::

  import propbox
  
  resolver = propbox.Propbox([
      propbox.CalculateName(["smiles"], "mol", smilin),
      propbox.CalculateName(["mol"], "nHEAVIES", num_heavies),
      propbox.CalculateNames(["mol"], ["LIGHTEST_HEAVY", "HEAVIEST_HEAVY"], heavy_range),
      ])
  
The first parameter is the input_names list. The second parameter is
the output name (for CalculateName) or the list of output names (for
CalculateNames). The third parameter is the function to call.

I'll use that resolver to make a table::

  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "Q", "c1ccccc1O", "[U](F)(F)(F)(F)(F)F"]})

then use the table to generate CSV output, be default as a tab separated file::

  import sys
  table.save(sys.stdout, ["id", "nHEAVIES", "LIGHTEST_HEAVY", "HEAVIEST_HEAVY"])

The output in this case is::

  [14:20:24] SMILES Parse Error: syntax error for input: Q
  id	nHEAVIES	LIGHTEST_HEAVY	HEAVIEST_HEAVY
  ID1	1	6	6
  ID2	2	6	7
  ID3	*	*	*
  ID4	7	6	8
  ID5	7	9	92


Suppose though you want this in 'excel' format, which uses commas
instead of tabs, and knows how to quote terms that contain a
comma. And suppose you wanted to use '???' when the nHEAVIES could not
be computed, and 'n/a' for when the element range could not be
computed. In that case, use the following::

  import sys
  table.save(sys.stdout, ["id", "nHEAVIES", "LIGHTEST_HEAVY", "HEAVIEST_HEAVY"],
             dialect="excel", missing_values=["x", "???", "n/a", "n/a"])

which generates::

  id,nHEAVIES,LIGHTEST_HEAVY,HEAVIEST_HEAVY
  ID1,1,6,6
  ID2,2,6,7
  ID3,???,n/a,n/a
  ID4,7,6,8
  ID5,7,9,92


Decorators for simple functions
-------------------------------

The previous section assumed that you couldn't modify the code that
computed the property values. If on the other hand you can modify
them, then the 'propbox.calculate' decorator will create a
CalculateName (if output_names is a string) or CalculateNames (if
output_names is a list) for each function, and store it as the
function attribute 'propbox_resolver'.

The function 'collect_resolvers()' will look for resolvers in the
module's namespace. If an object has a 'propbox_resolver' then that
will be used a resolver. Objects which are instances of
propbox.Resolver will also be treated as a resolver. All of the
resolvers will be placed into a Propbox.

Here's an example::


  from __future__ import print_function
  
  from rdkit import Chem
  import propbox
  #propbox.DEBUG = True # Uncomment for a bit better debugging
  import sys
  
  
  @propbox.calculate(output_names="mol")
  def smilin(smiles):
      mol = Chem.MolFromSmiles(smiles)
      if mol is None:
          raise ValueError("RDKit cannot parse the SMILES %r" % (smiles,))
      return mol
  
  @propbox.calculate(output_names="nHEAVIES")
  def num_heavies(mol):
      return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() > 1)
  
  @propbox.calculate(output_names=["LIGHTEST_HEAVY", "HEAVIEST_HEAVY"])
  def heavy_range(mol):
      "Return the lightest and heaviest element numbers of the heavy atoms"
      atomic_nums = []
      for atom in mol.GetAtoms():
          atomic_num = atom.GetAtomicNum()
          if atomic_num > 1:
              atomic_nums.append(atomic_num)
      if not atomic_nums:
          return (0, 0)
      return min(atomic_nums), max(atomic_nums)
  
  
  resolver = propbox.collect_resolvers()
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "Q", "c1ccccc1O", "[U](F)(F)(F)(F)(F)F"]})
  
  table.save(sys.stdout, ["id", "nHEAVIES", "LIGHTEST_HEAVY", "HEAVIEST_HEAVY"],
             dialect="excel", missing_values=["x", "???", "n/a", "n/a"])

Not surprisingly, this gives the same output as before::

  ID1,1,6,6
  ID2,2,6,7
  ID3,???,n/a,n/a
  ID4,7,6,8
  ID5,7,9,92

But wait! Why didn't I need to configure the list of input_names?

I could have. I could have said:

  @propbox.calculate(input_names=["smiles"], output_names="mol")
  def smilin(s):
      mol = Chem.MolFromSmiles(s)
      if mol is None:
          raise ValueError("RDKit cannot parse the SMILES %r" % (s,))
      return mol

If input_names isn't given then the decorator assume that the function
arguments are the expected property names. In the original function,
the function took a 'smiles' parameter, which happened to be the same
name as the property, so I let it be.

In this modified version, I changed the function to take an 's'
instead of a 'smiles', so I needed to specify the input_names to get
the input values from the 'smiles' column instead of the 's' column.

There's another shortcut. If the function computes a single
descriptor, and the function name starts with 'calc_', then the rest
of the function name will be used as the descriptor.

That is, the first two functions could be rewritten as::

  @propbox.calculate()
  def calc_mol(smiles):
      mol = Chem.MolFromSmiles(smiles)
      if mol is None:
          raise ValueError("RDKit cannot parse the SMILES %r" % (smiles,))
      return mol
  
  @propbox.calculate()
  def calc_nHEAVIES(mol):
      return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() > 1)


I'll use this technique to rewrite the 'len' and 'doubled' descriptors
from an earlier section::

  from __future__ import print_function
  
  import propbox
  import sys
  
  @propbox.calculate()
  def calc_len(smiles):
      if "O" in smiles:
          raise ValueError("No 'O's allowed: %r" % (smiles,))
      return len(smiles)
  
  @propbox.calculate()
  def calc_doubled(len):
      return len*2
  
  
  resolver = propbox.collect_resolvers()
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "Q", "c1ccccc1O", "[U](F)(F)(F)(F)(F)F"]})
  
  table.save(sys.stdout, ["id", "doubled", "smiles"])
  
The output from this is:

  id	doubled	smiles
  ID1	2	C
  ID2	6	C#N
  ID3	2	Q
  ID4	*	c1ccccc1O
  ID5	38	[U](F)(F)(F)(F)(F)F

If I want to see the exception message I can traverse the rows myself:

  for id, doubled in table.get_future_rows(["id", "doubled"]):
      print(id.result(), doubled.exception() or doubled.result())

This prints::

  ID1 2
  ID2 6
  ID3 2
  ID4 doubled -> len: ValueError("No 'O's allowed: 'c1ccccc1O'",)
  ID5 38


Property Aliases
----------------

You'll sometimes need multiple names for the same descriptor.

For example, you might have a resolver which expects "MW" for the
molecular weight, but you follow the RDKit convention and use
"MolWt". Propbox comes with an "Aliases" resolver, which will forward
a request to the right value.

Here's an example, which says that a molecule is "large" if it as an
molecular weight of at least 75.0 (yes, this is made up)::

  from __future__ import print_function
  
  import propbox
  import sys
  
  from rdkit import Chem
  from rdkit.Chem import Descriptors
  
  
  @propbox.calculate()
  def calc_mol(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        raise ValueError("RDKit cannot parse the SMILES %r" % (smiles,))
    return mol
  
  @propbox.calculate()
  def calc_MolWt(mol):
      return Descriptors.MolWt(mol)
  

  # This wants a "MW" property, but the molecular weight is available
  # as the "MolWt" property.
  @propbox.calculate()
  def calc_is_large(MW):
      return MW > 75.0

  # Set up an alias from 'MW' to 'MolWt'
  aliases = propbox.Aliases({"MW": "MolWt"})
      
  resolver = propbox.collect_resolvers()
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["C", "C#N", "Q", "c1ccccc1O", "[U](F)(F)(F)(F)(F)F"]})
  
  table.save(sys.stdout, ["is_large", "smiles"])

The output table from the above is::

  is_large	smiles
  False	C
  False	C#N
  *	Q
  True	c1ccccc1O
  True	[U](F)(F)(F)(F)(F)F


The alias resolver also takes part in exception chaining. Printing the
exception the 'Q' entry gives:

  MW -> MolWt -> mol: ValueError("RDKit cannot parse the SMILES 'Q'",)


This MW/MolWt example is a bit contrived. It's best that you make
everything use a consistent naming scheme.

You're more likely to use aliases as a predictive model evolves. You
might start off with a blood-brain barrier model called "BBB". After a
year, you retrain. Now you have BBB_v1 and BBB_v2. That's fine - you
can implement both models.

What aliases do is let you define that "BBB" means "the most recent
validated BBB model", and have it point to BBB_v1 while BBB_v2 is
under evaluation. Once it's validated, change the alias to point to
BBB_v2. (You'll likely need BBB_v1 for a while, since other models may
have been validated against it, and now need to be revalidated on
BBB_v2.)


Property Modules
----------------

Over time you'll likely run into naming conflicts, where one person
uses $NAME to mean concept X while another person uses $NAME to mean
related concept Y. Or to make things more fun, you might have a
resolver based on OEChem, and another based on RDKit, and both provide
overlapping functionality.

Propbox implements a module system. The main resolver is in the main
module, and gets/sets the columns for the main table.

What a propbox.Module does is create a subtable. The resolver for the
module can get/set values in the subtable. The table and a subtable
are independent, so there is no conflict between the names.

The exceptions are the two sets of aliases. The module has a set of
output aliases which says that property X for the main table should be
resolved as property Y in the subtable. It also has input aliases,
which say that property B for the subtable should be resolved as
property A in the main table.

For example, the following file (named 'oceania.py') uses OEChem to
compute the OEGraphMols as 'mol', and the molecular weight as 'MW'. It
requires a SMILES string as the 'smiles' property::

  # This is 'oceania.py', based on OEChem
  from openeye.oechem import *
  
  from propbox import calculate, collect_resolvers
  
  @calculate()
  def calc_mol(smiles):
      mol = OEGraphMol()
      if OEParseSmiles(mol, smiles):
          return mol
      raise ValueError("OEChem cannot parse %r" % (smiles,))
  
  @calculate()
  def calc_MW(mol):
      return OECalculateMolecularWeight(mol)
  
  
  resolver = collect_resolvers()


While the following file (named 'eurasia.py') uses RDKit to compute
roughly equivalent properties::

  # This is 'eurasia.py', based on RDKit
  from rdkit import Chem
  from rdkit.Chem import Descriptors
  
  from propbox import calculate, collect_resolvers
  
  @calculate()
  def calc_mol(smiles):
      mol = Chem.MolFromSmiles(smiles)
      if mol is None:
          raise ValueError("RDKit cannot parse %r" % (smiles,))
      return mol
  
  @calculate()
  def calc_MW(mol):
      return Descriptors.MolWt(mol)
  
  
  resolver = collect_resolvers()

In "smith.py", I'll try to combine both resolvers into the same Propbox::


  import propbox
  
  import eurasia, oceania
  
  resolver = propbox.Propbox([eurasia.resolver, oceania.resolver])

This doesn't work. Propbox complains, saying::

                   
  Traceback (most recent call last):
    File "smith.py", line 5, in <module>
      resolver = propbox.Propbox([eurasia.resolver, oceania.resolver])
    File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 134, in __init__
      self.add_resolver(resolver)
    File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 163, in add_resolver
      self._name_to_resolver[output_name]))
  ValueError: Resolver <propbox.Propbox object at 0x108d5f810> defines the output name 'mol', which was already defined by resolver <propbox.Propbox object at 0x108d4b910>


To resolve the conflict, I'll place the oceania resolver in its own
propbox.Module. I'll also say that "OE_MW" in the main table is an
alias for "MW" in the subtable, and that "smiles" in the subtable is
an aliase for "smiles" in the main table::

  import sys
  import propbox
  
  import eurasia, oceania
  
  resolver = propbox.Propbox([
      eurasia.resolver, 
      propbox.Module("oceania", oceania.resolver,
                     {"smiles": "smiles"},
                     {"OE_MW": "MW"}),
                     ])
  
  table = propbox.make_table_from_columns(
      resolver, {"smiles": ["CC", "CCO", "O=O"]})
  
  table.save(sys.stdout, ["smiles", "MW", "OE_MW"])



As a result I can now compare the two molecular weights:

  smiles	MW	OE_MW
  CC	30.07	30.06904
  CCO	46.069	46.06844
  O=O	31.998	31.9988
  



Table configuration
-------------------

The table supports a "config" dictionary, which can be used to pass
configuration around. It's still experimental, and I don't really want
to document it.

It exists so you can define configuration information like the object
to use to de-salt, or if you don't want to specify the object, the
configuration file or the configuration data to use.

I can't help but wonder if it would be better to do configuration
through the resolvers, when I create the resolvers, rather than
through the table's "config".

An additional question is, how do I configure modules? I'm
experimenting with namespaces, so "oceania.SaltRemove_filename" would
be the salt remover for the oceania module. It's still up in the air.


Table cache
-----------

This is another experimental feature. The get_cache_value() and
set_cache_value() are used to get/set the subtable information. It's
also used for more long-term storage by the resolver.

For example, if the config defines a SaltRemover filename, then the
resolver which actually needs to remove the salts must create a
SaltRemover, configured to use that filename. What then?

Obviously I don't want to recreate the SaltRemover for each
record. Instead, my options are to 1) store it in the table (in which
case it's reloaded each time I process a batch), 2) store the
information in some sort of local cache for each resolver. But when is
the cache reset? Does everything have a unique cache key, or 3)
something else.

The get/set cache value API is used for #1. I don't think I like it
though.



Credits
=======

Andrew Dalke, Dalke Scientific, dalke@dalkescientific.com
9 June 2015, Trollhättan, Sweden

About

mirror of https://bitbucket.org/dalke/propbox

Resources

License

Unknown, GPL-2.0 licenses found

Licenses found

Unknown
COPYING
GPL-2.0
COPYING.pylru

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages