-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smiles processing #135
Smiles processing #135
Conversation
Added a dummy input file to the web app workflow so that it will not fail header checks in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes make sense to me. I've confirmed that this works for training and predicting with 1 or 2 SMILES inputs. I'm not sure what the stylistically best way to handle the circular imports is, but I'm not sure that there will be any downsides to the way it's written here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any good alternatives for addressing the circular import. It's unfortunate that it's caused by importing the Arg classes just for typing. I think importing the full utils module is preferred over moving the import inside a function. For convenience, you could change the name of the module while importing, e.g. import chemprop.data.utils as dutils
or something similar, but it's not necessary.
chemprop/utils.py
Outdated
|
||
indices_by_smiles = {} | ||
for i, line in enumerate(reader): | ||
smiles = tuple(line[j] for j in smiles_columns_index) | ||
for i, row in tqdm(enumerate(reader)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think tqdm may not work properly when wrapping enumerate. enumerate(tqdm(reader))
should be used instead. https://github.com/tqdm/tqdm#faq-and-known-issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied the pattern I saw in get_data
which was previously the same. It seems to work there in a vital part of the chemprop code, so it might be okay for the specific case of when it's wrapping a csv reader. I can change the ordering in both places.
def preprocess_smiles_columns(path: str, | ||
smiles_columns: Optional[Union[str, List[Optional[str]]]], | ||
number_of_molecules: int = 1) -> List[Optional[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a high level idea for how this function could behave to avoid needing the list check in the util functions and the dummy file in the website. Let me know whether you think it would work.
(pseudocode)
if smiles_columns is None
if path is a valid file
return first number_of_molecules columns of file header
else (e.g. `'None'` from the web view)
return list of None of length number_of_molecules
else
if smiles_columns is not a list
wrap in list
if path is a valid file
check that smiles_columns exists in file header
return smiles_columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good structure. The valid file check seems a better way to address the dummy web input. And this is a good way to run if we want to always pass through it on the way through the utils.
My preference would be to make smiles_columns a required input to those functions with no default value and remove the preprocess_smiles_columns
backstop from the utility functions entirely. But, always running through them maintains the flexibility well without changing the outside view of the utilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, actually there's still a problem. Your structure doesn't include the check for whether the length of the smiles_columns is the same as the number_of_molecules. We need that check in the initial arguments processing, when the input will be a list but no guarantees it's the right length.
But the different utils functions don't have access the args.number_of_molecules, so preprocess_smiles_columns
will error out for a mismatch if we ever try to run with 2 molecules.
I guess there are three options: 1) Stay with the existing structure with awkward type checks in the utils to detect if it needs to pass through the function again. 2) Make number_of_molecules an input to all the associated utils and always pass through the function. 3) Make a list input for smiles_columns required for the utils and remove the function check so it never passes through the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about this a little more. Making more information required (either smiles_columns or number_of_molecules) really does hamper the usefulness of general utilities. More information is required to either run through all the time or run through never. So I'm going to keep it as a backstop and go with a similar check to what is there now, but paring it down to checking for None
instead of being a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that omission. I think keeping the check in the utils makes sense then.
I do think it would be good to add the file check to avoid needing a dummy file when only parsing certain args manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am otherwise adopting your file structure. The file check is much better than making a dummy file just to read.
@mliu49 I've incorporated the changes from your comments and our subsequent discussion. Paring back the check to a check for None from a check for List meant that I needed to update a bunch of scripts. But we're better off with them changed. Please take a look and see if it's ready to merge. |
Sorry, it takes me a while to review since I'm not familiar with the code. I have some general comments/questions about the new changes.
|
@mliu49 sorry I'm putting reviews on you that's needing so much back and forth. Thought they were cleaner than this. Looking at your notes, I think that I got ahead of myself and made a few changes on a somewhat arbitrary judgement that the list check didn't feel like a good thing, that cascaded into making a lot of changes. You're right, not everything needs to be updated to smiles_columns. I've ended up adding "fixes" that are redundant to the checks I made in I've gone back and reverted the check in the utils to a list check. And I've gone back and removed the |
Thanks for making the changes! I noticed that TAP does not automatically support Union types (https://github.com/swansonk14/typed-argument-parser#complex-types). Running the scripts gives the following error. I think either str or List[str] should be used for each script depending on whether or not multiple SMILES columns is supported (though that might need testing to determine). It seems that
|
scripts/class_balance.py
Outdated
@@ -19,14 +19,14 @@ class Args(Tap): | |||
split_type: Literal['random', 'scaffold'] = 'scaffold' # Split type, either "random" or "scaffold" | |||
|
|||
|
|||
def class_balance(data_path: str, split_type: str): | |||
def class_balance(data_path: str, split_type: str, args: Args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args
is currently a global variable in this module. I think making it an argument is a good idea, but it would be best to remove the global variable definition below.
In the if __name__ == '__main__'
section, You could pass Args().parse_args()
directly to the class_balance
function instead of storing it in args
first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good suggestion! Thanks!
When you say it's a global variable in this module, how can you tell that in this script? Do you mean just passively because of the ordering? That's not the case for chemprop generally though I don't think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the first hint was that the script (presumably) worked before you passed args as an argument, even though args was used in the class_balance
function. The second is that PyCharm flags the argument name as shadowing a name from the outer scope.
Any variable defined at the module level is automatically global, and using a global variable is also automatic if it does not exist within the function scope. The global
keyword is only needed to "export" a local variable to the global scope, e.g. when setting a global variable from inside a function.
@mliu49 I've removed the Union typing. And now I have gone through the different scripts and individually tested them so I know which ones can function with multiple molecules. Should be good to go now.
|
Thanks, I think it looks good to me now! Would you be comfortable with rebasing this branch on top of master and dropping or squashing some of the reverted commits? |
@mliu49 I am not sure if I did that right. I rebased and squashed all the associated commits. But not sure if I did it in the right order to clean up the commit log. |
3837992
to
bc39eba
Compare
I think you did it right, but then you did a git pull at the end which merged in the original version of the branch. Instead, you needed to do |
Are we clear to merge it now? |
This PR consolidates the logic checks and changes to
args.smiles_columns
in one function and applies it during argument processing. These checks and changes are now held within an expandedpreprocess_smiles_columns
inchemprop.data.utils
. The redundant checks and changes have been removed from other code locations.For normal function of chemprop,
preprocess_smiles_columns
will be run during argument processing beforeargs.smiles_columns
is referenced anywhere in the code. This simplifies references toargs.smiles_columns
by enforcing that it contains a list of strings corresponding to the header of the data file from early on.The changes to
smiles_columns
consolidated into thepreprocess_smiles_columns
function, with the previous locations for the changes in parentheses:None
, make it a list ofNone
(CommonArgs
,SklearnPredictArgs
)CommonArgs
,SklearnPredictArgs
)preprocess_smiles_columns
as constructed previously)None
, make thesmiles_columns
the first n columns in the data file for n number of molecules (save_smiles_splits
,get_task_names
,get_smiles
,get_data
,make_predictions
)smiles_columns
do not appear in the header of the data file (not previously checked)Some of the utils functions are optional for the
smiles_columns
arguments withNone
as the default value:get_smiles
,get_task_names
,get_data
, andsave_smiles_splits
. In these casespreprocess_smiles_columns
will be used following a logic check to see if the value type is list, which will catch the default valueNone
as well as direct references from scripts that are supplying a single string value instead of a list. Applyingpreprocess_smiles_columns
here is less preferred but necessary to preserve flexible use of the functions for scripting. However, if this flexibility is not important, I would prefer to go back and make a list entry ofsmiles_columns
required for these functions and remove the check.One issue with this implementation is a circular import that happens when using the
from module import x
form of imports as is the style in chemprop. Thepreprocess_smiles_columns
function is inchemprop.data.utils
and must be imported intochemprop.args
. However,chemprop.data.utils
imports argument classes fromchemprop.args
for several of its functions, causing a circular import. I resolved this by usingimport chemprop.data.utils
inchemprop.args
and then referencing the function aschemprop.data.utils.preprocess_smiles_columns
. This looks inelegant, but does resolve the circularity. I'm open to other ways of resolving it, but am not sure which is stylistically best (performingfrom chemprop.data import preprocess_smiles columns
within functions or doing a broadimport chemprop.args
in the utils file instead are possible alternatives).