-
Notifications
You must be signed in to change notification settings - Fork 1
mirror of https://bitbucket.org/dalke/propbox
License
Unknown, GPL-2.0 licenses found
Licenses found
Unknown
COPYING
GPL-2.0
COPYING.pylru
UnixJunkie/propbox
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
propbox 0.5 Summary ======= Propbox is a Python package for computing molecular properties and models, and handing the dependencies between the calculations. The dependencies form a workflow. For example, the steps in building a consensus model may look like this: - the input is a SMILES string - turn the SMILES into a molecule - desalt it and standardize the charge model - use the clean molecule to compute logP, molecular weight, and a few other desciptors - use the descriptors to compute model-1, model-2, and model-3 - use model-1, model-2, and model-3 to compute a consensus model Rather than arrange the steps by hand, propbox uses a set of resolvers to fill out a table of properties. The table starts with the input data - one row per record. You ask the table for the output columns you want. If a property isn't available, the table asks the resolver to fill in the missing column. That operation may require additional data, in which case the resolver goes back to the table to ask for those columns. This process continues recursively until it gets to available data. (Or if there's a cycle, until Python's reaches its maximum recursion depth and throws an exception.) Each resolver then resolves the column data and the process unwinds until all of the needed columns are filled in. Installation ============ This package does not yet support the standard Python installer. You can run it from the current directory, or copy/move/link the 'propbox' subdirectory to your location of choice. License ======= The propbox package is distributed under the MIT license. (See COPYING.) The package includes a distribution of the third-party pylru.py, which is copyright Jay Hutchinson and distributed under the GPLv2 or later. (See COPYING.pylru.) 'rdprops' command-line tool =========================== The 'rdprops' command-line program computes molecular descriptors using the RDKit cheminformatics toolkit from rdkit.org . It implements the descriptors from rdkit.Chem.Descriptors as well as a few versions of SMILES strings. By default it reads a SMILES file from stdin and writes the results to stdout. I'll ask it to read from a named SMILES file instead, and only show the first few lines of output:: % ./rdprops tests/benzodiazepine.smi | head id smiles MolWt 1688 CN1C(=O)CN=C(c2ccc(Cl)cc2)c2cc(Cl)ccc21 319.191 1963 OCc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1Cl)=NC2 359.216 2118 Cc1nnc2n1-c1ccc(Cl)cc1C(c1ccccc1)=NC2 308.772 2802 O=C1CN=C(c2ccccc2Cl)c2cc([N+](=O)[O-])ccc2N1 315.716 2809 O=C(O)C1N=C(c2ccccc2)c2cc(Cl)ccc2NC1=O 314.728 2997 O=C1CN=C(c2ccccc2)c2cc(Cl)ccc2N1 270.719 3016 CN1C(=O)CN=C(c2ccccc2)c2cc(Cl)ccc21 284.746 3261 Clc1ccc2c(c1)C(c1ccccc1)=NCc1nncn1-2 294.745 3299 CCOC(=O)C1N=C(c2ccccc2F)c2cc(Cl)ccc2NC1=O 360.772 The default output contains the record identifier ("id"), the canonical isomeric SMILES ("smiles"), and the molecular weight ("MolWt"). Use the `--columns` option to specify different columns:: % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' | head id HeavyAtomCount MolWt 1688 21 319.191 1963 24 359.216 2118 22 308.772 2802 22 315.716 2809 22 314.728 2997 19 270.719 3016 20 284.746 3261 21 294.745 3299 25 360.772 Propbox uses the RDKit descriptor names for the columns, and by default uses the names for the column headers. You might prefer a different header:: % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --headers 'ID,HEAVIES,MW' | head ID HEAVIES MW 1688 21 319.191 1963 24 359.216 2118 22 308.772 2802 22 315.716 2809 22 314.728 2997 19 270.719 3016 20 284.746 3261 21 294.745 3299 25 360.772 or perhaps don't want a header at all:: % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --no-header | head 1688 21 319.191 1963 24 359.216 2118 22 308.772 2802 22 315.716 2809 22 314.728 2997 19 270.719 3016 20 284.746 3261 21 294.745 3299 25 360.772 3369 21 302.736 The default output is tab-separated, but you can change that with the `--dialect` option, which can be one of 'tab', 'space', 'whitespace', 'excel' or 'excel-tab'. (The 'whitespace' option is the same as 'space', and the Excel dialects are as defined by Python's csv module, and include the special rules for quoting):: % ./rdprops tests/benzodiazepine.smi --columns 'id,HeavyAtomCount,MolWt' --dialect excel | head id,HeavyAtomCount,MolWt 1688,21,319.191 1963,24,359.216 2118,22,308.772 2802,22,315.716 2809,22,314.728 2997,19,270.719 3016,20,284.746 3261,21,294.745 3299,25,360.772 List the available descriptors ------------------------------ use the `--list` option to get a list of the available descriptors:: % ./rdprops --list | wc -l 124 That's rather a lot, so I'll elide some of them:: % ./rdprops --list _chargeDescriptors BalabanJ BertzCT cansmiles chargeDescriptorVersion Chi0 Chi0n Chi0v ... ExactMolWt FractionCSP3 HallKierAlpha HeavyAtomCount HeavyAtomMolWt id input_format input_mol input_record ... mol MolLogP MolMR MolWt MolWt_version nci_iupac_name nci_names ... TPSA ... VSA_EState8 VSA_EState9 A future version will include a way to get a description of each descriptor. What's also missing is a naming convention or some other mechanism to describe if it makes sense to print a descriptor as text. For example, the 'mol' property is the RDKit molecule object for the input structure, after de-salting. It doesn't make sense to display the opaque text representation of a molecule object :: % ./rdprops tests/benzodiazepine.smi --columns 'id,mol' | head -5 id mol 1688 <rdkit.Chem.rdchem.Mol object at 0x105c44910> 1963 <rdkit.Chem.rdchem.Mol object at 0x105c44980> 2118 <rdkit.Chem.rdchem.Mol object at 0x105c449f0> 2802 <rdkit.Chem.rdchem.Mol object at 0x105c44a60> Similarly, the _chargeDescriptors property is another internal property that shouldn't really be exposed. (I'll use this as an example of how the quoting rules work for the 'excel' dialect.):: % ./rdprops tests/benzodiazepine.smi --columns 'id,_chargeDescriptors' --dialect excel | head -3 id,_chargeDescriptors 1688,"ChargeDescriptor(minCharge=-0.31319991842931816, maxCharge=0.24791727974294836)" 1963,"ChargeDescriptor(minCharge=-0.38834256479943147, maxCharge=0.16298797813009208)" I may move to the convention that a leading '_', and perhaps also a leading lowercase character, indicate an internal variable. Or I may have some way to mark certain descriptors as only being for internal use. Then again, I like how IPython supports adapters to, for example, show inline images for a molecule in a table. Perhaps I'll do that. Specify the format ------------------ Propbox uses the filename extension to determine the file format, and to see if the file is gzip compressed. The following case-insensitive extensions are supported: .smi, .ism, .isosmi - SMILES .smi.gz, .ism.gz, .isosmi.gz - gzip compressed SMILES .sdf, .sd, .mdl - SD file .sdf.gz, .sd.gz, .mdl.gz - gzip compressed SD file If propbox does not recognize the file format extension, or if the input comes from stdin, then it will assume the input is an uncompressed file format. You can specify the format directly using `--format` instead of depending on propbox's auto-detection code. For example, since rdprops expects a SMILES file from stdin, pipeing in an SD file will cause a problem:: % ./rdprops < tests/CHEMBL11862.sdf [01:33:10] SMILES Parse Error: syntax error for input: CHEMBL11862 [01:33:10] SMILES Parse Error: syntax error for input: SciTegic11101117232D Traceback (most recent call last): File "rdprops", line 9, in <module> rdprops.main() File "/Users/dalke/cvses/propbox/propbox/rdprops.py", line 174, in main ids_and_mols = list(batch_reader) File "/Users/dalke/cvses/propbox/propbox/rdkit_toolkit.py", line 183, in _read_smiles raise ValueError("Line %d is empty" % (lineno,)) ValueError: Line 3 is empty I'll instead tell it the input is an uncompressed SD file:: % ./rdprops --format sdf < tests/CHEMBL11862.sdf id smiles MolWt CHEMBL11862 Oc1cc2c(cc1O)CNCC2 165.192 The supported formats are 'smi', 'smi.gz', 'sdf', and 'sdf.gz', with the expected meanings. Use an SD tag as a title ------------------------ By default propbox will use the title line of the SD file as the identifier. Sometimes the identifier is in one of the tags, as ChEBI and older ChEMBL data sets, or if you want to use the InChI or other primary key stored in a tag. For example, the title line in CHEMBL11862.sdf is "CHEMBL11862":: % ./rdprops tests/CHEMBL11862.sdf id smiles MolWt CHEMBL11862 Oc1cc2c(cc1O)CNCC2 165.192 while the SD tag 'nci_iupac_name' contains the IUPAC name that I got from passing the structure over to NCI:: % ./rdprops --id-tag nci_iupac_name tests/CHEMBL11862.sdf id smiles MolWt 1,2,3,4-tetrahydroisoquinoline-6,7-diol Oc1cc2c(cc1O)CNCC2 165.192 Reader arguments ---------------- The RDKit SMILES and SDF readers support a few options: SMILES: has_header - Is the first line of the SMILES file a header line? (boolean, with default of False) delimiter - Specify how to parse the fields of a SMILES files? (One of 'space'/" ", 'tab'/"\t", 'whitespace', or 'to-eol', with default of 'to-eol') sanitize - Should the newly parsed molecule be sanitized? (boolean, with default of True) SDF: strictParsing - Use strict parsing rules? (boolean, with default of True) removeHs - Should hydrogens be removed from the molecule? (boolean, with default of True) sanitize - same as in SMILES The "delimiter" option is a bit unusual. Different people have a different interpretation of what a SMILES file means. The orignal Daylight definition was that the file contains a SMILES, followed by a whitespace, and the rest of the line is the identifier. In propbox (and in chemfp) this is called the 'to-eol' delimiter, and is the default. Other people think of a SMILES file as a space, tab, or whitespace separated file, where the first column is the SMILES, the second column is the identifier, and additional columns are ignored. In propbox these are refered to as the "space", "tab", and "whitespace" delimiter styles, respectively. ("Whitespace" means that each word is treated as its own field.) You can specify these reader arguments on the command line. For example, in "tests/drugs.smi" is a file I got from Daylight many years ago:: % cat tests/drugs.smi N12CCC36C1CC(C(C2)=CCOC4CC5=O)C4C3N5c7ccccc76 Strychnine c1ccccc1C(=O)OC2CC(N3C)CCC3C2C(=O)OC cocaine COc1cc2c(ccnc2cc1)C(O)C4CC(CC3)C(C=C)CN34 quinine OC(=O)C1CN(C)C2CC3=CCNc(ccc4)c3c4C2=C1 lyseric acid CCN(CC)C(=O)C1CN(C)C2CC3=CNc(ccc4)c3c4C2=C1 LSD C123C5C(O)C=CC2C(N(C)CC1)Cc(ccc4O)c3c4O5 morphine C123C5C(OC(=O)C)C=CC2C(N(C)CC1)Cc(ccc4OC(=O)C)c3c4O5 heroin c1ncccc1C1CCCN1C nicotine CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a Two of the identifiers, "lyseric acid" and "vitamin a", contain a space in them. The default delimiter style is 'to-eol', which is why the following show the full names:: % ./rdprops --columns 'id,MolWt' tests/drugs.smi id MolWt Strychnine 334.419 cocaine 303.358 quinine 324.424 lyseric acid 282.343 LSD 323.44 morphine 285.343 heroin 369.417 nicotine 162.236 caffeine 194.194 vitamin a 272.432 To specify the 'whitespace' delimiter style, use the `-R` parameter, which takes a NAME=VALUE setting:: % ./rdprops --columns 'id,MolWt' -R delimiter=whitespace tests/drugs.smi id MolWt Strychnine 334.419 cocaine 303.358 quinine 324.424 lyseric 282.343 LSD 323.44 morphine 285.343 heroin 369.417 nicotine 162.236 caffeine 194.194 vitamin 272.432 The boolean reader args interpret the strings "True", "true", or "1" a a true value, and "False", "false", or "0" for a false value. For example, the following will skip the first line of drugs.smi on the assumption that it's a header line:: % ./rdprops --columns 'id,MolWt' -R has_header=true tests/drugs.smi id MolWt cocaine 303.358 quinine 324.424 lyseric acid 282.343 LSD 323.44 morphine 285.343 heroin 369.417 nicotine 162.236 caffeine 194.194 vitamin a 272.432 Batch size ---------- The 'nci_iupac_name' uses the NCI web service API to turn a SMILES into an IUPAC name. This is mostly a proof-of-concept API, and it's rather slow since I make a request for each record. (Does the NCI resolver have a batch mode API?) Still, let's give it a whirl:: % ./rdprops --columns 'id,nci_iupac_name' tests/drugs.smi id nci_iupac_name Strychnine * cocaine methyl 3-(benzoyloxy)-8-methyl-8-azabicyclo[3.2.1]octane-2-carboxylate quinine (5-ethenyl-1-azabicyclo[2.2.2]octan-7-yl)-(6-methoxyquinolin-4-yl)methanol lyseric acid * LSD * morphine * heroin * nicotine 3-(1-methylpyrrolidin-2-yl)pyridine caffeine 1,3,7-trimethylpurine-2,6-dione vitamin a * This took about 3 seconds, but you'll notice that there was no output until everything was ready. This is because propbox by default processes the records in batches of 1,000 records. It will compute the properties for the first 1,000 structures, then display the result, then compute the properties for the second 1,000 structures, then display those results, etc. I can ask it to process one record at a time using the `--batch-size` parameter:: % ./rdprops --columns 'id,nci_iupac_name' --batch-size 1 tests/drugs.smi id nci_iupac_name Strychnine * cocaine methyl 3-(benzoyloxy)-8-methyl-8-azabicyclo[3.2.1]octane-2-carboxylate quinine (5-ethenyl-1-azabicyclo[2.2.2]octan-7-yl)-(6-methoxyquinolin-4-yl)methanol lyseric acid * LSD * morphine * heroin * nicotine 3-(1-methylpyrrolidin-2-yl)pyridine caffeine 1,3,7-trimethylpurine-2,6-dione vitamin a * (Propbox uses a '*' for records which had a problem. There is currently no way to use another symbol.) In the NCI case there is no timing difference between a batch size of 1 and of 1,000 records because the propbox NCI client makes one request at a time. Batch mode exists because in some cases it's faster to process N molecules at once than to process each one individually. Eg, in the future propbox might be able to send all of the queries to the server in a single request, which would save a lot of network overhead. Use `--batch-size all` to process all of the structures in a single batch. Add a resolver -------------- Use `-r` or `--resolver` to add a resolver to the built-in resolver. I'll cover the details in the next section. For an example of how it works, I'll create a simple model based on the molecular weight and the number of hydrogen bond donors. The descriptor will be called 'model', and located in a file called "model.py" in the current directory (or somewhere else on the Python path):: % cat model.py from propbox import calculate, collect_resolvers @calculate() def calc_model(MolWt, NumHDonors): return MolWt * 12.34 / (NumHDonors + 1) resolver = collect_resolvers() This is a non-standard resolver, so I need to tell rdprops the path for how to load it:: % ./rdprops --columns 'id,model' -r model.resolver tests/CHEMBL11862.sdf id model CHEMBL11862 509.61732 To double-check, I'll get the molecular weight and number of hbond donors to do the math myself:: % ./rdprops --columns 'id,MolWt,NumHDonors,model' -r model.resolver tests/CHEMBL11862.sdf id MolWt NumHDonors model CHEMBL11862 165.192 3 509.61732 And what do you know, it matches! >>> 165.192 * 12.34 / (3 + 1) 509.61732000000001 The propbox resolver framework ============================== Propbox is built around two concepts: a table and a resolver. The rows of the table are structure records, and the columns are molecular properties, referenced by name. A resolver is an object which can fill in columns of a table. A resolver may get columns from the table in order to do its job. Create a table -------------- There are two ways to create a table; by rows ("records") or by columns. I'll create a table with no resolver and a single column, "smiles", with some SMILES data:: >>> import propbox >>> table = propbox.make_table_from_columns(None, {"smiles": ["C", "O=O"]}) >>> table.get_values("smiles") ['C', 'O=O'] Missing identifiers will be created automatically:: >>> table.get_values("id") ['ID1', 'ID2'] or you can specify the identifiers yourself:: >>> table = propbox.make_table_from_columns(None, ... {"smiles": ["C", "O=O"], "id": ["methane", "water"]}) >>> table.get_values("smiles") ['C', 'O=O'] >>> table.get_values("id") ['methane', 'water'] Use make_table_from_records() if you have per-record dictionary data:: >>> table = propbox.make_table_from_records(None, ... [{"smiles": "O=O", "id": "water"}, {"smiles": "c1ccccc1O", "id": "phenol"}]) >>> table.get_values("smiles") ['O=O', 'c1ccccc1O'] >>> table.get_values("id") ['water', 'phenol'] I used None as the resolver, but the None object doesn't support the resolver protocol, so if I try to get a column that doesn't yet exist, I'll get the following:: >>> table.get_values("MW") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "propbox/__init__.py", line 709, in get_values futures = self.get_futures(name) File "propbox/__init__.py", line 693, in get_futures self.resolver.resolve_column(name, self) AttributeError: 'NoneType' object has no attribute 'resolve_column' Define a resolver ----------------- Here's a resolver which returns a constant value:: from __future__ import print_function import propbox class Constant(propbox.Resolver): output_names = ["value"] def __init__(self, value): self.value = value def resolve_column(self, name, table): table.set_values("value", [self.value] * len(table)) table = propbox.make_table_from_records(Constant(4), [{}, {}]) print("ids", table.get_values("id")) print("values", table.get_values("value")) This creates the following output:: ids ['ID1', 'ID2'] values [4, 4] How this works is, the table doesn't know about the 'value' column, so it asks the resolver to resolve the column 'value'. The table passes itself as the table, so the resolver can use the table to get or set data. The Constant resolver uses len(table) to get the number of rows in the table -- two in this case -- and create the list [4, 4], which it then uses to set the table column named 'value', which is then available to the table. Thes 'output_names' attribute contains a list of the column names that the resolver can compute. It isn't actually used in this case, since the table will ask the resolver to handle any unknown column. I could, for example, ask for 'xyzzy' and it would call the resolver:: table = propbox.make_table_from_records(Constant(4), [{}, {}]) print("ids", table.get_values("id")) print("xyzzy", table.get_values("xyzzy")) However, the table does double-check that the resolver adds the requested column, so the above will generate the following error:: ids ['ID1', 'ID2'] Traceback (most recent call last): File "tmp.py", line 15, in <module> print("xyzzy", table.get_values("xyzzy")) File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 709, in get_values futures = self.get_futures(name) File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 698, in get_futures % (self.resolver, name)) AssertionError: Resolver <__main__.Constant object at 0x1007e74d0> did not set values for column 'xyzzy' Define a Propbox ---------------- A Propbox is a resolver which contains other resolvers. It uses the 'output_names' of the other resolvers to figure out which resolver to use. For example, I'll modify the Constant resolver so I can specify which column it will set:: from __future__ import print_function import propbox class Constant(propbox.Resolver): def __init__(self, descriptor, value): self.value = value self.descriptor = descriptor self.output_names = [descriptor] def resolve_column(self, name, table): table.set_values(self.descriptor, [self.value] * len(table)) then create a Propbox which contains two Constants; one which sets 'value' to 8 and the other which sets 'xyzzy' to 13:: resolver = propbox.Propbox() resolver.add_resolver(Constant("value", 8)) resolver.add_resolver(Constant("xyzzy", 13)) and finally create a table which uses that Propbox resolver:: table = propbox.make_table_from_records(resolver, [{}, {}]) print("value", table.get_values("value")) print("xyzzy", table.get_values("xyzzy")) print("unknown", table.get_values("unknown")) When I run it, I get the following output:: value [8, 8] xyzzy [13, 13] Traceback (most recent call last): File "tmp.py", line 22, in <module> print("unknown", table.get_values("unknown")) File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 709, in get_values futures = self.get_futures(name) File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 693, in get_futures self.resolver.resolve_column(name, self) File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 171, in resolve_column raise PropboxKeyError(self, name) propbox.PropboxKeyError: unknown In case you were wondering, the PropboxKeyError inherits from the regular KeyError, as well as from propbox.PropboxError. A resolver that uses the table ------------------------------ The Constant resolver is pretty boring. What about a resolver which returns the length of the SMILES string, stored in the 'smiles' column, and sets the column 'len'?:: from __future__ import print_function import propbox class Len(propbox.Resolver): output_names = ["len"] def resolve_column(self, name, table): smiles_list = table.get_values("smiles") table.set_values("len", [len(smiles) for smiles in smiles_list]) resolver = Len() table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]}) print(table.get_values("len")) The output from this is:: [1, 3, 9] which I think you expected. What's new here is that the resolver asked the table to get the "smiles" column. This is a recursive call, since the table was the one to call the resolver in the first place. The recursion might go several levels deep. I'll also create a Len2, which doubles the value of "len". Since I have to resolvers, I'll need to put them into a Propbox:: from __future__ import print_function import propbox class Len(propbox.Resolver): output_names = ["len"] def resolve_column(self, name, table): smiles_list = table.get_values("smiles") table.set_values("len", [len(smiles) for smiles in smiles_list]) class DoubleLen(propbox.Resolver): output_names = ["len2"] def resolve_column(self, name, table): len_list = table.get_values("len") table.set_values("len2", [value*2 for value in len_list]) resolver = propbox.Propbox([Len(), DoubleLen()]) table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]}) print(table.get_values("len2")) Here's what happened: - The column "len2" does not exist, so the table asks the Propbox resolver to fill it in; - The Propbox resolver used the output_names to figure out that the DoubleLen resolver could resolve that column. - The DoubleLen resolver needs the values for the "len" column from the table; - The column 'len' doesn't exist, so the table asks the Propbox resolver to fill it in; - The Propbox resolver used the output_names to figure out that the Len resolver could resolve that column; - The Len resolver needs the values for the "smiles" column from the table; - The table returns "smiles" column; - The Len resolver computes the string lengths and sets the values for the "len" column; - The DoubleLen resolver doubles those values, and sets the "len2" column; - The calculations are complete and returned to the caller. All of the intermediate values are stored in the table in case they are needed for additional calculations. Futures ------- What if there was an error during the calculation? For that matter, how does a resolver even indicate an error? You'll need to understand 'futures' to understand how errors work in propbox. A future is something which wraps a return value, or raised exception. It's often used in modern asynchronous I/O libraries, including Python 3.4, where it is used as a placeholder for the actual return value, which will be available in the future. (It's called a 'promise' in some libraries.) Probox is not asynchronous, though I want it to go that way. I use the 'future' concept as a way to keep track of if something was a return value or an exception. Here's what it looks like, using part of the propbox API that should only be used by resolvers. I'll store the value 12 as if it were a successfully computed descriptor:: >>> from propbox import simple_futures >>> future = simple_futures.new_future(12) >>> future <propbox.simple_futures.Future object at 0x100666490> >>> future.result() 12 This being Python, I can store anything in the future's result:: >>> future = simple_futures.new_future("twelve") >>> future.result() 'twelve' I can even store an exception instance as the value:: >>> future = simple_futures.new_future(ValueError("must be a string")) >>> future.result() ValueError('must be a string',) What if I have a "real" exception, that is, something which shouldn't be treated as a return value? I'll create a future that contains an exception:: >>> future = simple_futures.new_future_exception(ValueError("must be a string")) >>> future.result() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "propbox/simple_futures.py", line 34, in result raise self._exception ValueError: must be a string I asked the future for its result, but since it contained an exception, it raised the exception. (One limitation is that the exception no longer has stack information. If you enable propbox.DEBUG=1 then you'll see the stack trace printed to stderr when a resolver raises an unhandled exception.) If you want the exception value, without going through the try/except mechanism, then ask the future for it:: >>> future.exception() ValueError('must be a string',) This will return None if there was no exception. Setting descriptor exceptions ----------------------------- Two sections earlier I used get_values() and set_values() to get and set the column values. If there's an error then get_values() by default will use None as a placeholder error value, and there's no way to use set_values() to specify an error. The get_values()/set_values() functions are really just wrappers around the underlying futures data. You can access the futures directly with get_futures() and set_futures(). In the following, I'll modify the "len" resolver so it gives an error if the SMILES string contains the letter 'O':: from __future__ import print_function import propbox from propbox import simple_futures class Len(propbox.Resolver): output_names = ["len"] def resolve_column(self, name, table): smiles_list = table.get_values("smiles") futures = [] for smiles in smiles_list: if "O" in smiles: err = ValueError("No 'O's allowed: %r" % (smiles,)) future = simple_futures.new_future_exception(err) else: future = simple_futures.new_future(len(smiles)) futures.append(future) table.set_futures("len", futures) resolver = Len() table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]}) print(table.get_values("len")) The result from when I run this is:: [1, 3, None] because None is the placeholder error value. I can change that to something else. In the following I use zero:: print(table.get_values("len", 0)) which creates the output:: [1, 3, 0] Exception chaining (advanced) ----------------------------- This is an advanced topic. In almost all cases you can use the 'Calculator' class in the next section, which handles exception chaining automatically. Suppose you want a "doubled" property, which is twice the "len" property. What if "len" has an error? Since "len" is supposed to be a number, it's easy to check if it's the None value, and do something different in that case. In the following, the 'Len' class is unchanged from the previous section. What's new is the 'DoubleLen' class, the propbox which includes both Len and DoubleLen, and the output, where this time I show the future's exception for each record:: from __future__ import print_function import propbox from propbox import simple_futures class Len(propbox.Resolver): output_names = ["len"] def resolve_column(self, name, table): smiles_list = table.get_values("smiles") futures = [] for smiles in smiles_list: if "O" in smiles: err = ValueError("No 'O's allowed: %r" % (smiles,)) future = simple_futures.new_future_exception(err) else: future = simple_futures.new_future(len(smiles)) futures.append(future) table.set_futures("len", futures) # This version does not implement exception chaining class DoubleLen(propbox.Resolver): output_names = ["doubled"] def resolve_column(self, name, table): len_list = table.get_values("len") futures = [] for len_value in len_list: if len_value is None: err = Exception("No 'len' available") future = simple_futures.new_future_exception(err) else: future = simple_futures.new_future(len_value*2) futures.append(future) table.set_futures("doubled", futures) resolver = propbox.Propbox([Len(), DoubleLen()]) table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]}) print("id exception") for id, double_future in zip(table.get_values("id"), table.get_futures("doubled")): print(id, double_future.exception()) This gives the following output:: id exception ID1 None ID2 None ID3 No 'len' available It would be nice to know *why* 'len' isn't available. Propbox implements exception chaining, which is where one resolver exception can wrap another, all the way back to the actual exception that caused the problem. To do that correctly, I'll need to make two changes. The first is to wrap the ValueException of the Len class inside of a ResolverError exception, so that callers know the descriptor that caused the original problem. That's one new line of code in the following:: class Len(propbox.Resolver): output_names = ["len"] def resolve_column(self, name, table): smiles_list = table.get_values("smiles") futures = [] for smiles in smiles_list: if "O" in smiles: err = ValueError("No 'O's allowed: %r" % (smiles,)) # I added the next line for better exception chaining. # It will include the name of the descriptor that has the problem. err = propbox.ResolverError(err, table.table_name, "len") future = simple_futures.new_future_exception(err) else: future = simple_futures.new_future(len(smiles)) futures.append(future) table.set_futures("len", futures) The second is to change Doubled to use get_futures() instead of get_values(), and if one of the 'len' futures has an exception, to wrap it insides of another ResolverError:: class DoubleLen(propbox.Resolver): output_names = ["doubled"] def resolve_column(self, name, table): len_futures = table.get_futures("len") futures = [] for len_future in len_futures: prev_exception = len_future.exception() if prev_exception is None: # There was no error len_value = len_future.result() future = simple_futures.new_future(len_value*2) else: # Create the chain. err = propbox.ResolverError(prev_exception, table.table_name, "doubled") future = simple_futures.new_future_exception(err) futures.append(future) table.set_futures("doubled", futures) The resulting code now generates:: id exception ID1 None ID2 None ID3 doubled -> len: ValueError("No 'O's allowed: 'c1ccccc1O'",) This is more helpful because it says that 'doubled' failed because 'len' failed because of a ValueError. Use 'get_original_exception()' if you only care about the actual exception that caused the problem, and not the full resolver chain, as in the following variation, which only prints those records with an error:: print("id exception") for id, double_future in zip(table.get_values("id"), table.get_futures("doubled")): exception = double_future.exception() if exception is not None: print(id, exception.get_original_exception()) which produces the following output:: id exception ID3 No 'O's allowed: 'c1ccccc1O' Calculator ---------- The previous section showed the nitty-gritty of how to handle and report errors during a calculation. For the most part, you don't need to do that sort of low-level code. Instead, use the 'Calculator' class. Here's an example:: from __future__ import print_function import propbox from propbox import simple_futures class Len(propbox.Calculator): input_names = ["smiles"] output_names = ["len"] def calculate(self, name, table, input_values, output): for (smiles,) in input_values: if "O" in smiles: output.add_exception(ValueError("No 'O's allowed: %r" % (smiles,))) else: output.add_result(len(smiles)) class DoubleLen(propbox.Calculator): input_names = ["len"] output_names = ["doubled"] def calculate(self, name, table, input_values, output): for (len_value,) in input_values: output.add_result(len_value*2) resolver = propbox.Propbox([Len(), DoubleLen()]) table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "c1ccccc1O"]}) print("id exception") for id, doubled_future in zip(table.get_values("id"), table.get_futures("doubled")): print(id, doubled_future.exception() or doubled_future.result()) If you compare it to the previous section you'll see it's much shorter. If you skipped the previous section, then great! You didn't need to read it to understand this section. (For that matter, you can skip to 'Decorators for simple functions' for an even easier way to handle this code.) The 'input_names' contains a list of columns that will be passed in as input, and the calculator must set values for all of the 'output_names'. This is a bit more strict than a normal resolver, which doesn't need to list its input names, and only needs to compute the requested output name. The Calculator's own resolve_column() will filter out the inputs which contain an exception, and pass only the actual values to the 'calculate()' method. By "actual values" I mean there aren't even placeholders for the error values. If one of the inputs causes an exception then the Calculator will automatically set up the resolver chain. The values are passed to the "calculate()" function as a list of lists, where the order depends on the order in input_names. For example, in the following:: class Example(propbox.Calculator): input_names = ["smiles", "doubled"] output_names = ["example"] def calculate(self, name, table, input_values, output): print("input_values", input_values) for (smiles, doubled) in input_values: output.add_result("2*len(%r)=%d" % (smiles, doubled)) the inputs are 'smiles' and 'doubled', which are passed in as: input_values [['C', 2], ['C#N', 6]] This is row order, so the first element contains the columns for the first non-error record, the second for the second non-error record, and so on. It's a bit tricky to remember that a single input name still gets a list of lists, even though its a single element list. I use "for (len_value,) in" in the following as an explicit reminder that I am using a term from a single element list:: class DoubleLen(propbox.Calculator): input_names = ["len"] output_names = ["doubled"] def calculate(self, name, table, input_values, output): for (len_value,) in input_values: output.add_result(len_value*2) Otherwise it's very easy to make a mistake and do "for len_value in input_values". The 'name' and 'table' should look familiar by this time. While you can use the table to get or set columns, you really shouldn't because your code will end up interfering with the Calculator code. It's there in case the calculator needs to access any of the table configuration information, or needs to get/set a cache value on the table. The 'output' term is special. Use it to specify information for the current record number, or specify information for all of the remaining records. The 'add_result()' method can be used when output_names contains only a single name. add_result() sets the corresponding column for the current record, then advances to the next record. Use 'add_results()' when there are more outputs. The function takes the result values as a list or tuple, and sets the corresponding futures. Here's an example of it in use:: class Scaling(propbox.Calculator): input_names = ["len"] output_names = ["tripled", "third"] def calculate(self, name, table, input_values, output): for (n,) in input_values: output.add_results((n*3 ,n/3.0)) In this case the "n*3" sets the 'tripled' column, and 'n/3.0' sets the 'third' column. Sometimes it's more convenient to set all of the results at once, which you can do with the 'add_column_results()' method:: class Scaling(propbox.Calculator): input_names = ["len"] output_names = ["tripled", "third"] def calculate(self, name, table, input_values, output): tripled_list = [] third_list = [] for (n,) in input_values: tripled_list.append(n*3) third_list.append(n/3.0) output.add_column_results( (tripled_list, third_list) ) However, this doesn't appear to be one of those cases where the result is simpler. You saw already how to tell the output to use an exception for a given row, by using the 'add_exception()' method:: class Len(propbox.Calculator): input_names = ["smiles"] output_names = ["len"] def calculate(self, name, table, input_values, output): for (smiles,) in input_values: if "O" in smiles: output.add_exception(ValueError("No 'O's allowed: %r" % (smiles,))) else: output.add_result(len(smiles)) You can also specify a futures, either for a record via 'add_futures' or for all of the columns via 'add_column_futures'. CalculateName / CalculateNames ------------------------------ You may have functions which you want to turn into propbox descriptors. The CalculateName and CalculateNames classes are subclasses of Calculator which know how to call a function to compute a property or a set of properties, respectively. For example, here are three functions that might be useful for propbox:: from rdkit import Chem def smilin(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: raise ValueError("RDKit cannot parse the SMILES %r" % (smiles,)) return mol def num_heavies(mol): return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() > 1) def heavy_range(mol): "Return the lightest and heaviest element numbers of the heavy atoms" atomic_nums = [] for atom in mol.GetAtoms(): atomic_num = atom.GetAtomicNum() if atomic_num > 1: atomic_nums.append(atomic_num) if not atomic_nums: return (0, 0) return min(atomic_nums), max(atomic_nums) The first two only return a single value, so I'll use a CalculateName instance for them. The last returns two values, so I'll use a CalculateNames for them:: import propbox resolver = propbox.Propbox([ propbox.CalculateName(["smiles"], "mol", smilin), propbox.CalculateName(["mol"], "nHEAVIES", num_heavies), propbox.CalculateNames(["mol"], ["LIGHTEST_HEAVY", "HEAVIEST_HEAVY"], heavy_range), ]) The first parameter is the input_names list. The second parameter is the output name (for CalculateName) or the list of output names (for CalculateNames). The third parameter is the function to call. I'll use that resolver to make a table:: table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "Q", "c1ccccc1O", "[U](F)(F)(F)(F)(F)F"]}) then use the table to generate CSV output, be default as a tab separated file:: import sys table.save(sys.stdout, ["id", "nHEAVIES", "LIGHTEST_HEAVY", "HEAVIEST_HEAVY"]) The output in this case is:: [14:20:24] SMILES Parse Error: syntax error for input: Q id nHEAVIES LIGHTEST_HEAVY HEAVIEST_HEAVY ID1 1 6 6 ID2 2 6 7 ID3 * * * ID4 7 6 8 ID5 7 9 92 Suppose though you want this in 'excel' format, which uses commas instead of tabs, and knows how to quote terms that contain a comma. And suppose you wanted to use '???' when the nHEAVIES could not be computed, and 'n/a' for when the element range could not be computed. In that case, use the following:: import sys table.save(sys.stdout, ["id", "nHEAVIES", "LIGHTEST_HEAVY", "HEAVIEST_HEAVY"], dialect="excel", missing_values=["x", "???", "n/a", "n/a"]) which generates:: id,nHEAVIES,LIGHTEST_HEAVY,HEAVIEST_HEAVY ID1,1,6,6 ID2,2,6,7 ID3,???,n/a,n/a ID4,7,6,8 ID5,7,9,92 Decorators for simple functions ------------------------------- The previous section assumed that you couldn't modify the code that computed the property values. If on the other hand you can modify them, then the 'propbox.calculate' decorator will create a CalculateName (if output_names is a string) or CalculateNames (if output_names is a list) for each function, and store it as the function attribute 'propbox_resolver'. The function 'collect_resolvers()' will look for resolvers in the module's namespace. If an object has a 'propbox_resolver' then that will be used a resolver. Objects which are instances of propbox.Resolver will also be treated as a resolver. All of the resolvers will be placed into a Propbox. Here's an example:: from __future__ import print_function from rdkit import Chem import propbox #propbox.DEBUG = True # Uncomment for a bit better debugging import sys @propbox.calculate(output_names="mol") def smilin(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: raise ValueError("RDKit cannot parse the SMILES %r" % (smiles,)) return mol @propbox.calculate(output_names="nHEAVIES") def num_heavies(mol): return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() > 1) @propbox.calculate(output_names=["LIGHTEST_HEAVY", "HEAVIEST_HEAVY"]) def heavy_range(mol): "Return the lightest and heaviest element numbers of the heavy atoms" atomic_nums = [] for atom in mol.GetAtoms(): atomic_num = atom.GetAtomicNum() if atomic_num > 1: atomic_nums.append(atomic_num) if not atomic_nums: return (0, 0) return min(atomic_nums), max(atomic_nums) resolver = propbox.collect_resolvers() table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "Q", "c1ccccc1O", "[U](F)(F)(F)(F)(F)F"]}) table.save(sys.stdout, ["id", "nHEAVIES", "LIGHTEST_HEAVY", "HEAVIEST_HEAVY"], dialect="excel", missing_values=["x", "???", "n/a", "n/a"]) Not surprisingly, this gives the same output as before:: ID1,1,6,6 ID2,2,6,7 ID3,???,n/a,n/a ID4,7,6,8 ID5,7,9,92 But wait! Why didn't I need to configure the list of input_names? I could have. I could have said: @propbox.calculate(input_names=["smiles"], output_names="mol") def smilin(s): mol = Chem.MolFromSmiles(s) if mol is None: raise ValueError("RDKit cannot parse the SMILES %r" % (s,)) return mol If input_names isn't given then the decorator assume that the function arguments are the expected property names. In the original function, the function took a 'smiles' parameter, which happened to be the same name as the property, so I let it be. In this modified version, I changed the function to take an 's' instead of a 'smiles', so I needed to specify the input_names to get the input values from the 'smiles' column instead of the 's' column. There's another shortcut. If the function computes a single descriptor, and the function name starts with 'calc_', then the rest of the function name will be used as the descriptor. That is, the first two functions could be rewritten as:: @propbox.calculate() def calc_mol(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: raise ValueError("RDKit cannot parse the SMILES %r" % (smiles,)) return mol @propbox.calculate() def calc_nHEAVIES(mol): return sum(1 for atom in mol.GetAtoms() if atom.GetAtomicNum() > 1) I'll use this technique to rewrite the 'len' and 'doubled' descriptors from an earlier section:: from __future__ import print_function import propbox import sys @propbox.calculate() def calc_len(smiles): if "O" in smiles: raise ValueError("No 'O's allowed: %r" % (smiles,)) return len(smiles) @propbox.calculate() def calc_doubled(len): return len*2 resolver = propbox.collect_resolvers() table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "Q", "c1ccccc1O", "[U](F)(F)(F)(F)(F)F"]}) table.save(sys.stdout, ["id", "doubled", "smiles"]) The output from this is: id doubled smiles ID1 2 C ID2 6 C#N ID3 2 Q ID4 * c1ccccc1O ID5 38 [U](F)(F)(F)(F)(F)F If I want to see the exception message I can traverse the rows myself: for id, doubled in table.get_future_rows(["id", "doubled"]): print(id.result(), doubled.exception() or doubled.result()) This prints:: ID1 2 ID2 6 ID3 2 ID4 doubled -> len: ValueError("No 'O's allowed: 'c1ccccc1O'",) ID5 38 Property Aliases ---------------- You'll sometimes need multiple names for the same descriptor. For example, you might have a resolver which expects "MW" for the molecular weight, but you follow the RDKit convention and use "MolWt". Propbox comes with an "Aliases" resolver, which will forward a request to the right value. Here's an example, which says that a molecule is "large" if it as an molecular weight of at least 75.0 (yes, this is made up):: from __future__ import print_function import propbox import sys from rdkit import Chem from rdkit.Chem import Descriptors @propbox.calculate() def calc_mol(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: raise ValueError("RDKit cannot parse the SMILES %r" % (smiles,)) return mol @propbox.calculate() def calc_MolWt(mol): return Descriptors.MolWt(mol) # This wants a "MW" property, but the molecular weight is available # as the "MolWt" property. @propbox.calculate() def calc_is_large(MW): return MW > 75.0 # Set up an alias from 'MW' to 'MolWt' aliases = propbox.Aliases({"MW": "MolWt"}) resolver = propbox.collect_resolvers() table = propbox.make_table_from_columns( resolver, {"smiles": ["C", "C#N", "Q", "c1ccccc1O", "[U](F)(F)(F)(F)(F)F"]}) table.save(sys.stdout, ["is_large", "smiles"]) The output table from the above is:: is_large smiles False C False C#N * Q True c1ccccc1O True [U](F)(F)(F)(F)(F)F The alias resolver also takes part in exception chaining. Printing the exception the 'Q' entry gives: MW -> MolWt -> mol: ValueError("RDKit cannot parse the SMILES 'Q'",) This MW/MolWt example is a bit contrived. It's best that you make everything use a consistent naming scheme. You're more likely to use aliases as a predictive model evolves. You might start off with a blood-brain barrier model called "BBB". After a year, you retrain. Now you have BBB_v1 and BBB_v2. That's fine - you can implement both models. What aliases do is let you define that "BBB" means "the most recent validated BBB model", and have it point to BBB_v1 while BBB_v2 is under evaluation. Once it's validated, change the alias to point to BBB_v2. (You'll likely need BBB_v1 for a while, since other models may have been validated against it, and now need to be revalidated on BBB_v2.) Property Modules ---------------- Over time you'll likely run into naming conflicts, where one person uses $NAME to mean concept X while another person uses $NAME to mean related concept Y. Or to make things more fun, you might have a resolver based on OEChem, and another based on RDKit, and both provide overlapping functionality. Propbox implements a module system. The main resolver is in the main module, and gets/sets the columns for the main table. What a propbox.Module does is create a subtable. The resolver for the module can get/set values in the subtable. The table and a subtable are independent, so there is no conflict between the names. The exceptions are the two sets of aliases. The module has a set of output aliases which says that property X for the main table should be resolved as property Y in the subtable. It also has input aliases, which say that property B for the subtable should be resolved as property A in the main table. For example, the following file (named 'oceania.py') uses OEChem to compute the OEGraphMols as 'mol', and the molecular weight as 'MW'. It requires a SMILES string as the 'smiles' property:: # This is 'oceania.py', based on OEChem from openeye.oechem import * from propbox import calculate, collect_resolvers @calculate() def calc_mol(smiles): mol = OEGraphMol() if OEParseSmiles(mol, smiles): return mol raise ValueError("OEChem cannot parse %r" % (smiles,)) @calculate() def calc_MW(mol): return OECalculateMolecularWeight(mol) resolver = collect_resolvers() While the following file (named 'eurasia.py') uses RDKit to compute roughly equivalent properties:: # This is 'eurasia.py', based on RDKit from rdkit import Chem from rdkit.Chem import Descriptors from propbox import calculate, collect_resolvers @calculate() def calc_mol(smiles): mol = Chem.MolFromSmiles(smiles) if mol is None: raise ValueError("RDKit cannot parse %r" % (smiles,)) return mol @calculate() def calc_MW(mol): return Descriptors.MolWt(mol) resolver = collect_resolvers() In "smith.py", I'll try to combine both resolvers into the same Propbox:: import propbox import eurasia, oceania resolver = propbox.Propbox([eurasia.resolver, oceania.resolver]) This doesn't work. Propbox complains, saying:: Traceback (most recent call last): File "smith.py", line 5, in <module> resolver = propbox.Propbox([eurasia.resolver, oceania.resolver]) File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 134, in __init__ self.add_resolver(resolver) File "/Users/dalke/cvses/propbox/propbox/__init__.py", line 163, in add_resolver self._name_to_resolver[output_name])) ValueError: Resolver <propbox.Propbox object at 0x108d5f810> defines the output name 'mol', which was already defined by resolver <propbox.Propbox object at 0x108d4b910> To resolve the conflict, I'll place the oceania resolver in its own propbox.Module. I'll also say that "OE_MW" in the main table is an alias for "MW" in the subtable, and that "smiles" in the subtable is an aliase for "smiles" in the main table:: import sys import propbox import eurasia, oceania resolver = propbox.Propbox([ eurasia.resolver, propbox.Module("oceania", oceania.resolver, {"smiles": "smiles"}, {"OE_MW": "MW"}), ]) table = propbox.make_table_from_columns( resolver, {"smiles": ["CC", "CCO", "O=O"]}) table.save(sys.stdout, ["smiles", "MW", "OE_MW"]) As a result I can now compare the two molecular weights: smiles MW OE_MW CC 30.07 30.06904 CCO 46.069 46.06844 O=O 31.998 31.9988 Table configuration ------------------- The table supports a "config" dictionary, which can be used to pass configuration around. It's still experimental, and I don't really want to document it. It exists so you can define configuration information like the object to use to de-salt, or if you don't want to specify the object, the configuration file or the configuration data to use. I can't help but wonder if it would be better to do configuration through the resolvers, when I create the resolvers, rather than through the table's "config". An additional question is, how do I configure modules? I'm experimenting with namespaces, so "oceania.SaltRemove_filename" would be the salt remover for the oceania module. It's still up in the air. Table cache ----------- This is another experimental feature. The get_cache_value() and set_cache_value() are used to get/set the subtable information. It's also used for more long-term storage by the resolver. For example, if the config defines a SaltRemover filename, then the resolver which actually needs to remove the salts must create a SaltRemover, configured to use that filename. What then? Obviously I don't want to recreate the SaltRemover for each record. Instead, my options are to 1) store it in the table (in which case it's reloaded each time I process a batch), 2) store the information in some sort of local cache for each resolver. But when is the cache reset? Does everything have a unique cache key, or 3) something else. The get/set cache value API is used for #1. I don't think I like it though. Credits ======= Andrew Dalke, Dalke Scientific, dalke@dalkescientific.com 9 June 2015, Trollhättan, Sweden
About
mirror of https://bitbucket.org/dalke/propbox
Resources
License
Unknown, GPL-2.0 licenses found
Licenses found
Unknown
COPYING
GPL-2.0
COPYING.pylru
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published