Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smiles processing #135

Merged
merged 1 commit into from
Feb 12, 2021
Merged

Smiles processing #135

merged 1 commit into from
Feb 12, 2021

Conversation

cjmcgill
Copy link
Contributor

@cjmcgill cjmcgill commented Feb 4, 2021

This PR consolidates the logic checks and changes to args.smiles_columns in one function and applies it during argument processing. These checks and changes are now held within an expanded preprocess_smiles_columns in chemprop.data.utils. The redundant checks and changes have been removed from other code locations.

For normal function of chemprop, preprocess_smiles_columns will be run during argument processing before args.smiles_columns is referenced anywhere in the code. This simplifies references to args.smiles_columns by enforcing that it contains a list of strings corresponding to the header of the data file from early on.

The changes to smiles_columns consolidated into the preprocess_smiles_columns function, with the previous locations for the changes in parentheses:

  • If None, make it a list of None (CommonArgs,SklearnPredictArgs)
  • Error if the number of smiles columns is different than the number of molecules (CommonArgs,SklearnPredictArgs)
  • If not a list, make it a list (preprocess_smiles_columns as constructed previously)
  • If None, make the smiles_columns the first n columns in the data file for n number of molecules (save_smiles_splits,get_task_names,get_smiles,get_data,make_predictions)
  • Error if the smiles_columns do not appear in the header of the data file (not previously checked)

Some of the utils functions are optional for the smiles_columns arguments with None as the default value: get_smiles, get_task_names, get_data, and save_smiles_splits. In these cases preprocess_smiles_columns will be used following a logic check to see if the value type is list, which will catch the default value None as well as direct references from scripts that are supplying a single string value instead of a list. Applying preprocess_smiles_columns here is less preferred but necessary to preserve flexible use of the functions for scripting. However, if this flexibility is not important, I would prefer to go back and make a list entry of smiles_columns required for these functions and remove the check.

One issue with this implementation is a circular import that happens when using the from module import x form of imports as is the style in chemprop. The preprocess_smiles_columns function is in chemprop.data.utils and must be imported into chemprop.args. However, chemprop.data.utils imports argument classes from chemprop.args for several of its functions, causing a circular import. I resolved this by using import chemprop.data.utils in chemprop.args and then referencing the function as chemprop.data.utils.preprocess_smiles_columns. This looks inelegant, but does resolve the circularity. I'm open to other ways of resolving it, but am not sure which is stylistically best (performing from chemprop.data import preprocess_smiles columns within functions or doing a broad import chemprop.args in the utils file instead are possible alternatives).

@cjmcgill
Copy link
Contributor Author

cjmcgill commented Feb 4, 2021

Added a dummy input file to the web app workflow so that it will not fail header checks in preprocess_smiles_columns.

@chemprop chemprop deleted a comment from lgtm-com bot Feb 4, 2021
Copy link
Member

@kevingreenman kevingreenman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes make sense to me. I've confirmed that this works for training and predicting with 1 or 2 SMILES inputs. I'm not sure what the stylistically best way to handle the circular imports is, but I'm not sure that there will be any downsides to the way it's written here.

Copy link
Contributor

@mliu49 mliu49 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any good alternatives for addressing the circular import. It's unfortunate that it's caused by importing the Arg classes just for typing. I think importing the full utils module is preferred over moving the import inside a function. For convenience, you could change the name of the module while importing, e.g. import chemprop.data.utils as dutils or something similar, but it's not necessary.

chemprop/args.py Outdated Show resolved Hide resolved
chemprop/data/__init__.py Outdated Show resolved Hide resolved

indices_by_smiles = {}
for i, line in enumerate(reader):
smiles = tuple(line[j] for j in smiles_columns_index)
for i, row in tqdm(enumerate(reader)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tqdm may not work properly when wrapping enumerate. enumerate(tqdm(reader)) should be used instead. https://github.com/tqdm/tqdm#faq-and-known-issues

Copy link
Contributor Author

@cjmcgill cjmcgill Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the pattern I saw in get_data which was previously the same. It seems to work there in a vital part of the chemprop code, so it might be okay for the specific case of when it's wrapping a csv reader. I can change the ordering in both places.

chemprop/data/utils.py Outdated Show resolved Hide resolved
Comment on lines +19 to +21
def preprocess_smiles_columns(path: str,
smiles_columns: Optional[Union[str, List[Optional[str]]]],
number_of_molecules: int = 1) -> List[Optional[str]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a high level idea for how this function could behave to avoid needing the list check in the util functions and the dummy file in the website. Let me know whether you think it would work.

(pseudocode)

if smiles_columns is None
    if path is a valid file
        return first number_of_molecules columns of file header
    else (e.g. `'None'` from the web view)
        return list of None of length number_of_molecules
else
    if smiles_columns is not a list
        wrap in list
    if path is a valid file
        check that smiles_columns exists in file header
    return smiles_columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good structure. The valid file check seems a better way to address the dummy web input. And this is a good way to run if we want to always pass through it on the way through the utils.

My preference would be to make smiles_columns a required input to those functions with no default value and remove the preprocess_smiles_columns backstop from the utility functions entirely. But, always running through them maintains the flexibility well without changing the outside view of the utilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, actually there's still a problem. Your structure doesn't include the check for whether the length of the smiles_columns is the same as the number_of_molecules. We need that check in the initial arguments processing, when the input will be a list but no guarantees it's the right length.

But the different utils functions don't have access the args.number_of_molecules, so preprocess_smiles_columns will error out for a mismatch if we ever try to run with 2 molecules.

I guess there are three options: 1) Stay with the existing structure with awkward type checks in the utils to detect if it needs to pass through the function again. 2) Make number_of_molecules an input to all the associated utils and always pass through the function. 3) Make a list input for smiles_columns required for the utils and remove the function check so it never passes through the function.

Copy link
Contributor Author

@cjmcgill cjmcgill Feb 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this a little more. Making more information required (either smiles_columns or number_of_molecules) really does hamper the usefulness of general utilities. More information is required to either run through all the time or run through never. So I'm going to keep it as a backstop and go with a similar check to what is there now, but paring it down to checking for None instead of being a list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that omission. I think keeping the check in the utils makes sense then.

I do think it would be good to add the file check to avoid needing a dummy file when only parsing certain args manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am otherwise adopting your file structure. The file check is much better than making a dummy file just to read.

@cjmcgill
Copy link
Contributor Author

cjmcgill commented Feb 9, 2021

@mliu49 I've incorporated the changes from your comments and our subsequent discussion.

Paring back the check to a check for None from a check for List meant that I needed to update a bunch of scripts. But we're better off with them changed.

Please take a look and see if it's ready to merge.

@mliu49
Copy link
Contributor

mliu49 commented Feb 9, 2021

Sorry, it takes me a while to review since I'm not familiar with the code.

I have some general comments/questions about the new changes.

  • It seems that some of the script (like similarity.py) are not designed to support multiple SMILES columns. For those, it does not seem like a good idea to change the arguments to lists.
  • Since preprocess_smiles_columns does properly handle string input, and you've added preprocessing to all of the scripts, is changing the arg type necessary?
  • I think I'm a bit confused about the changed from checking list type to checking for None in the utils. It seems that switching to only check for None has led to more complexity?
  • If preprocessing smiles_columns in the scripts is kept, I think it would be cleaner to define the process_args method in the Args classes.

@chemprop chemprop deleted a comment from lgtm-com bot Feb 10, 2021
@cjmcgill
Copy link
Contributor Author

@mliu49 sorry I'm putting reviews on you that's needing so much back and forth. Thought they were cleaner than this. Looking at your notes, I think that I got ahead of myself and made a few changes on a somewhat arbitrary judgement that the list check didn't feel like a good thing, that cascaded into making a lot of changes.

You're right, not everything needs to be updated to smiles_columns. I've ended up adding "fixes" that are redundant to the checks I made in preprocess_smiles_columns which were intended to remove redundancy!

I've gone back and reverted the check in the utils to a list check. And I've gone back and removed the preprocess_smiles_columns from each script because none of them had a particular need for it (no multiple instances demanding centralization and consistency). I did leave minor changes in them to make them accept smiles_columns as inputs and feed it to util functions where available.

@mliu49
Copy link
Contributor

mliu49 commented Feb 10, 2021

Thanks for making the changes!

I noticed that TAP does not automatically support Union types (https://github.com/swansonk14/typed-argument-parser#complex-types). Running the scripts gives the following error. I think either str or List[str] should be used for each script depending on whether or not multiple SMILES columns is supported (though that might need testing to determine). It seems that find_similar_mols.py, overlap.py, and save_features.py definitely don't support multiple SMILES, based on their use of the flatten argument to get_smiles.

Traceback (most recent call last):
  File "/Users/mjliu/Code/chemprop/scripts/save_features.py", line 116, in <module>
    generate_and_save_features(Args().parse_args())
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/site-packages/tap/tap.py", line 108, in __init__
    self._configure()
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/site-packages/tap/tap.py", line 317, in _configure
    self._add_arguments()
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/site-packages/tap/tap.py", line 269, in _add_arguments
    self._add_argument(f'--{variable}')
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/site-packages/tap/tap.py", line 202, in _add_argument
    and isinstance(None, get_args(var_type)[1])
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/typing.py", line 697, in __instancecheck__
    return self.__subclasscheck__(type(obj))
  File "/Users/mjliu/miniconda3/envs/chemprop/lib/python3.9/typing.py", line 700, in __subclasscheck__
    raise TypeError("Subscripted generics cannot be used with"
TypeError: Subscripted generics cannot be used with class and instance checks

@@ -19,14 +19,14 @@ class Args(Tap):
split_type: Literal['random', 'scaffold'] = 'scaffold' # Split type, either "random" or "scaffold"


def class_balance(data_path: str, split_type: str):
def class_balance(data_path: str, split_type: str, args: Args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

args is currently a global variable in this module. I think making it an argument is a good idea, but it would be best to remove the global variable definition below.

In the if __name__ == '__main__' section, You could pass Args().parse_args() directly to the class_balance function instead of storing it in args first.

Copy link
Contributor Author

@cjmcgill cjmcgill Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion! Thanks!

When you say it's a global variable in this module, how can you tell that in this script? Do you mean just passively because of the ordering? That's not the case for chemprop generally though I don't think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the first hint was that the script (presumably) worked before you passed args as an argument, even though args was used in the class_balance function. The second is that PyCharm flags the argument name as shadowing a name from the outer scope.

Any variable defined at the module level is automatically global, and using a global variable is also automatic if it does not exist within the function scope. The global keyword is only needed to "export" a local variable to the global scope, e.g. when setting a global variable from inside a function.

@cjmcgill
Copy link
Contributor Author

cjmcgill commented Feb 11, 2021

@mliu49 I've removed the Union typing. And now I have gone through the different scripts and individually tested them so I know which ones can function with multiple molecules. Should be good to go now.

create_crossval_splits works with a list of multiple smiles_columns.
find_similar_mols could work with multiple smiles_columns but would probably be unintended behavior so I reverted to a single string entry for smiles_column.
overlap didn't originally take multiple smiles_columns but I updated the code so it could and tested it accordingly.
save_features doesn't work with multiple smiles_columns. I reverted all changes to it.
split_data works with multiple smiles_columns.
class_balance this script has a hardcoded file reference in there and I don't have a way to test it. I can't guarantee that this script is working without the intended file context, but having fixed minor errors in it, I am at least confident that it's closer to working now than it was before.

@mliu49
Copy link
Contributor

mliu49 commented Feb 11, 2021

Thanks, I think it looks good to me now!

Would you be comfortable with rebasing this branch on top of master and dropping or squashing some of the reverted commits?

@cjmcgill
Copy link
Contributor Author

cjmcgill commented Feb 12, 2021

Thanks, I think it looks good to me now!

Would you be comfortable with rebasing this branch on top of master and dropping or squashing some of the reverted commits?

@mliu49 I am not sure if I did that right. I rebased and squashed all the associated commits. But not sure if I did it in the right order to clean up the commit log.

@mliu49
Copy link
Contributor

mliu49 commented Feb 12, 2021

I think you did it right, but then you did a git pull at the end which merged in the original version of the branch. Instead, you needed to do git push --force to overwrite the remote branch. I went ahead and removed the merge commit.

@cjmcgill
Copy link
Contributor Author

cjmcgill commented Feb 12, 2021

I think you did it right, but then you did a git pull at the end which merged in the original version of the branch. Instead, you needed to do git push --force to overwrite the remote branch. I went ahead and removed the merge commit.

Are we clear to merge it now?

@mliu49 mliu49 merged commit 0c46945 into master Feb 12, 2021
@mliu49 mliu49 deleted the smiles_processing branch February 12, 2021 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants