Smiles processing #135

cjmcgill · 2021-02-04T15:05:48Z

This PR consolidates the logic checks and changes to args.smiles_columns in one function and applies it during argument processing. These checks and changes are now held within an expanded preprocess_smiles_columns in chemprop.data.utils. The redundant checks and changes have been removed from other code locations.

For normal function of chemprop, preprocess_smiles_columns will be run during argument processing before args.smiles_columns is referenced anywhere in the code. This simplifies references to args.smiles_columns by enforcing that it contains a list of strings corresponding to the header of the data file from early on.

The changes to smiles_columns consolidated into the preprocess_smiles_columns function, with the previous locations for the changes in parentheses:

If None, make it a list of None (CommonArgs,SklearnPredictArgs)
Error if the number of smiles columns is different than the number of molecules (CommonArgs,SklearnPredictArgs)
If not a list, make it a list (preprocess_smiles_columns as constructed previously)
If None, make the smiles_columns the first n columns in the data file for n number of molecules (save_smiles_splits,get_task_names,get_smiles,get_data,make_predictions)
Error if the smiles_columns do not appear in the header of the data file (not previously checked)

Some of the utils functions are optional for the smiles_columns arguments with None as the default value: get_smiles, get_task_names, get_data, and save_smiles_splits. In these cases preprocess_smiles_columns will be used following a logic check to see if the value type is list, which will catch the default value None as well as direct references from scripts that are supplying a single string value instead of a list. Applying preprocess_smiles_columns here is less preferred but necessary to preserve flexible use of the functions for scripting. However, if this flexibility is not important, I would prefer to go back and make a list entry of smiles_columns required for these functions and remove the check.

One issue with this implementation is a circular import that happens when using the from module import x form of imports as is the style in chemprop. The preprocess_smiles_columns function is in chemprop.data.utils and must be imported into chemprop.args. However, chemprop.data.utils imports argument classes from chemprop.args for several of its functions, causing a circular import. I resolved this by using import chemprop.data.utils in chemprop.args and then referencing the function as chemprop.data.utils.preprocess_smiles_columns. This looks inelegant, but does resolve the circularity. I'm open to other ways of resolving it, but am not sure which is stylistically best (performing from chemprop.data import preprocess_smiles columns within functions or doing a broad import chemprop.args in the utils file instead are possible alternatives).

cjmcgill · 2021-02-04T18:45:36Z

Added a dummy input file to the web app workflow so that it will not fail header checks in preprocess_smiles_columns.

kevingreenman

These changes make sense to me. I've confirmed that this works for training and predicting with 1 or 2 SMILES inputs. I'm not sure what the stylistically best way to handle the circular imports is, but I'm not sure that there will be any downsides to the way it's written here.

mliu49

I don't have any good alternatives for addressing the circular import. It's unfortunate that it's caused by importing the Arg classes just for typing. I think importing the full utils module is preferred over moving the import inside a function. For convenience, you could change the name of the module while importing, e.g. import chemprop.data.utils as dutils or something similar, but it's not necessary.

chemprop/args.py

chemprop/data/__init__.py

mliu49 · 2021-02-08T21:48:25Z

chemprop/utils.py


        indices_by_smiles = {}
-        for i, line in enumerate(reader):
-            smiles = tuple(line[j] for j in smiles_columns_index)
+        for i, row in tqdm(enumerate(reader)):


I think tqdm may not work properly when wrapping enumerate. enumerate(tqdm(reader)) should be used instead. https://github.com/tqdm/tqdm#faq-and-known-issues

I copied the pattern I saw in get_data which was previously the same. It seems to work there in a vital part of the chemprop code, so it might be okay for the specific case of when it's wrapping a csv reader. I can change the ordering in both places.

chemprop/data/utils.py

mliu49 · 2021-02-08T22:26:53Z

chemprop/data/utils.py

+def preprocess_smiles_columns(path: str,
+                              smiles_columns: Optional[Union[str, List[Optional[str]]]],
+                              number_of_molecules: int = 1) -> List[Optional[str]]:


I have a high level idea for how this function could behave to avoid needing the list check in the util functions and the dummy file in the website. Let me know whether you think it would work.

(pseudocode)

if smiles_columns is None if path is a valid file return first number_of_molecules columns of file header else (e.g. `'None'` from the web view) return list of None of length number_of_molecules else if smiles_columns is not a list wrap in list if path is a valid file check that smiles_columns exists in file header return smiles_columns

This is a good structure. The valid file check seems a better way to address the dummy web input. And this is a good way to run if we want to always pass through it on the way through the utils.

My preference would be to make smiles_columns a required input to those functions with no default value and remove the preprocess_smiles_columns backstop from the utility functions entirely. But, always running through them maintains the flexibility well without changing the outside view of the utilities.

Wait, actually there's still a problem. Your structure doesn't include the check for whether the length of the smiles_columns is the same as the number_of_molecules. We need that check in the initial arguments processing, when the input will be a list but no guarantees it's the right length.

But the different utils functions don't have access the args.number_of_molecules, so preprocess_smiles_columns will error out for a mismatch if we ever try to run with 2 molecules.

I guess there are three options: 1) Stay with the existing structure with awkward type checks in the utils to detect if it needs to pass through the function again. 2) Make number_of_molecules an input to all the associated utils and always pass through the function. 3) Make a list input for smiles_columns required for the utils and remove the function check so it never passes through the function.

Thought about this a little more. Making more information required (either smiles_columns or number_of_molecules) really does hamper the usefulness of general utilities. More information is required to either run through all the time or run through never. So I'm going to keep it as a backstop and go with a similar check to what is there now, but paring it down to checking for None instead of being a list.

Thanks for catching that omission. I think keeping the check in the utils makes sense then.

I do think it would be good to add the file check to avoid needing a dummy file when only parsing certain args manually.

I am otherwise adopting your file structure. The file check is much better than making a dummy file just to read.

cjmcgill · 2021-02-09T16:55:42Z

@mliu49 I've incorporated the changes from your comments and our subsequent discussion.

Paring back the check to a check for None from a check for List meant that I needed to update a bunch of scripts. But we're better off with them changed.

Please take a look and see if it's ready to merge.

mliu49 · 2021-02-09T19:33:28Z

Sorry, it takes me a while to review since I'm not familiar with the code.

I have some general comments/questions about the new changes.

It seems that some of the script (like similarity.py) are not designed to support multiple SMILES columns. For those, it does not seem like a good idea to change the arguments to lists.
Since preprocess_smiles_columns does properly handle string input, and you've added preprocessing to all of the scripts, is changing the arg type necessary?
I think I'm a bit confused about the changed from checking list type to checking for None in the utils. It seems that switching to only check for None has led to more complexity?
If preprocessing smiles_columns in the scripts is kept, I think it would be cleaner to define the process_args method in the Args classes.

cjmcgill · 2021-02-10T01:52:26Z

@mliu49 sorry I'm putting reviews on you that's needing so much back and forth. Thought they were cleaner than this. Looking at your notes, I think that I got ahead of myself and made a few changes on a somewhat arbitrary judgement that the list check didn't feel like a good thing, that cascaded into making a lot of changes.

You're right, not everything needs to be updated to smiles_columns. I've ended up adding "fixes" that are redundant to the checks I made in preprocess_smiles_columns which were intended to remove redundancy!

I've gone back and reverted the check in the utils to a list check. And I've gone back and removed the preprocess_smiles_columns from each script because none of them had a particular need for it (no multiple instances demanding centralization and consistency). I did leave minor changes in them to make them accept smiles_columns as inputs and feed it to util functions where available.

mliu49 · 2021-02-10T15:35:47Z

Thanks for making the changes!

I noticed that TAP does not automatically support Union types (https://github.com/swansonk14/typed-argument-parser#complex-types). Running the scripts gives the following error. I think either str or List[str] should be used for each script depending on whether or not multiple SMILES columns is supported (though that might need testing to determine). It seems that find_similar_mols.py, overlap.py, and save_features.py definitely don't support multiple SMILES, based on their use of the flatten argument to get_smiles.

Traceback (most recent call last):
  File "/Users/mjliu/Code/chemprop/scripts/save_features.py", line 116, in <module>
    generate_and_save_features(Args().parse_args())
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/site-packages/tap/tap.py", line 108, in __init__
    self._configure()
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/site-packages/tap/tap.py", line 317, in _configure
    self._add_arguments()
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/site-packages/tap/tap.py", line 269, in _add_arguments
    self._add_argument(f'--{variable}')
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/site-packages/tap/tap.py", line 202, in _add_argument
    and isinstance(None, get_args(var_type)[1])
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/typing.py", line 697, in __instancecheck__
    return self.__subclasscheck__(type(obj))
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/typing.py", line 700, in __subclasscheck__
    raise TypeError("Subscripted generics cannot be used with"
TypeError: Subscripted generics cannot be used with class and instance checks

mliu49 · 2021-02-10T15:38:55Z

scripts/class_balance.py

@@ -19,14 +19,14 @@ class Args(Tap):
    split_type: Literal['random', 'scaffold'] = 'scaffold'  # Split type, either "random" or "scaffold"


-def class_balance(data_path: str, split_type: str):
+def class_balance(data_path: str, split_type: str, args: Args):


args is currently a global variable in this module. I think making it an argument is a good idea, but it would be best to remove the global variable definition below.

In the if __name__ == '__main__' section, You could pass Args().parse_args() directly to the class_balance function instead of storing it in args first.

This is a good suggestion! Thanks!

When you say it's a global variable in this module, how can you tell that in this script? Do you mean just passively because of the ordering? That's not the case for chemprop generally though I don't think.

Yeah, the first hint was that the script (presumably) worked before you passed args as an argument, even though args was used in the class_balance function. The second is that PyCharm flags the argument name as shadowing a name from the outer scope.

Any variable defined at the module level is automatically global, and using a global variable is also automatic if it does not exist within the function scope. The global keyword is only needed to "export" a local variable to the global scope, e.g. when setting a global variable from inside a function.

cjmcgill · 2021-02-11T01:33:50Z

@mliu49 I've removed the Union typing. And now I have gone through the different scripts and individually tested them so I know which ones can function with multiple molecules. Should be good to go now.

create_crossval_splits works with a list of multiple smiles_columns.
find_similar_mols could work with multiple smiles_columns but would probably be unintended behavior so I reverted to a single string entry for smiles_column.
overlap didn't originally take multiple smiles_columns but I updated the code so it could and tested it accordingly.
save_features doesn't work with multiple smiles_columns. I reverted all changes to it.
split_data works with multiple smiles_columns.
class_balance this script has a hardcoded file reference in there and I don't have a way to test it. I can't guarantee that this script is working without the intended file context, but having fixed minor errors in it, I am at least confident that it's closer to working now than it was before.

mliu49 · 2021-02-11T15:23:26Z

Thanks, I think it looks good to me now!

Would you be comfortable with rebasing this branch on top of master and dropping or squashing some of the reverted commits?

cjmcgill · 2021-02-12T06:44:02Z

Thanks, I think it looks good to me now!

Would you be comfortable with rebasing this branch on top of master and dropping or squashing some of the reverted commits?

@mliu49 I am not sure if I did that right. I rebased and squashed all the associated commits. But not sure if I did it in the right order to clean up the commit log.

mliu49 · 2021-02-12T15:38:50Z

I think you did it right, but then you did a git pull at the end which merged in the original version of the branch. Instead, you needed to do git push --force to overwrite the remote branch. I went ahead and removed the merge commit.

cjmcgill · 2021-02-12T16:26:26Z

I think you did it right, but then you did a git pull at the end which merged in the original version of the branch. Instead, you needed to do git push --force to overwrite the remote branch. I went ahead and removed the merge commit.

Are we clear to merge it now?

cjmcgill requested review from swansonk14, mliu49, maforsuelo, oscarwumit, fhvermei and hesther February 4, 2021 15:05

chemprop deleted a comment from lgtm-com bot Feb 4, 2021

cjmcgill requested a review from kevingreenman February 8, 2021 16:42

kevingreenman approved these changes Feb 8, 2021

View reviewed changes

mliu49 reviewed Feb 8, 2021

View reviewed changes

cjmcgill requested review from mliu49 and kevingreenman February 9, 2021 19:03

chemprop deleted a comment from lgtm-com bot Feb 10, 2021

mliu49 reviewed Feb 10, 2021

View reviewed changes

All smiles_columns changes consolidated into preprocess_smiles_columns

bc39eba

mliu49 force-pushed the smiles_processing branch from 3837992 to bc39eba Compare February 12, 2021 15:36

mliu49 merged commit 0c46945 into master Feb 12, 2021

mliu49 deleted the smiles_processing branch February 12, 2021 16:34

cjmcgill mentioned this pull request Feb 13, 2021

New Function - Save the latent representation for a molecules #119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smiles processing #135

Smiles processing #135

cjmcgill commented Feb 4, 2021

cjmcgill commented Feb 4, 2021

kevingreenman left a comment

mliu49 left a comment

mliu49 Feb 8, 2021

cjmcgill Feb 9, 2021 •

edited

mliu49 Feb 8, 2021

cjmcgill Feb 9, 2021

cjmcgill Feb 9, 2021

cjmcgill Feb 9, 2021 •

edited

mliu49 Feb 9, 2021

cjmcgill Feb 9, 2021

cjmcgill commented Feb 9, 2021

mliu49 commented Feb 9, 2021

cjmcgill commented Feb 10, 2021

mliu49 commented Feb 10, 2021

mliu49 Feb 10, 2021

cjmcgill Feb 10, 2021 •

edited

mliu49 Feb 11, 2021

cjmcgill commented Feb 11, 2021 •

edited

mliu49 commented Feb 11, 2021

cjmcgill commented Feb 12, 2021 •

edited

mliu49 commented Feb 12, 2021

cjmcgill commented Feb 12, 2021 •

edited

Smiles processing #135

Smiles processing #135

Conversation

cjmcgill commented Feb 4, 2021

cjmcgill commented Feb 4, 2021

kevingreenman left a comment

Choose a reason for hiding this comment

mliu49 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjmcgill Feb 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjmcgill Feb 9, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjmcgill commented Feb 9, 2021

mliu49 commented Feb 9, 2021

cjmcgill commented Feb 10, 2021

mliu49 commented Feb 10, 2021

Choose a reason for hiding this comment

cjmcgill Feb 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjmcgill commented Feb 11, 2021 • edited

mliu49 commented Feb 11, 2021

cjmcgill commented Feb 12, 2021 • edited

mliu49 commented Feb 12, 2021

cjmcgill commented Feb 12, 2021 • edited

cjmcgill Feb 9, 2021 •

edited

cjmcgill Feb 9, 2021 •

edited

cjmcgill Feb 10, 2021 •

edited

cjmcgill commented Feb 11, 2021 •

edited

cjmcgill commented Feb 12, 2021 •

edited

cjmcgill commented Feb 12, 2021 •

edited