-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output options for make_predictions #235
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, I added some minor comments
full_preds = [] | ||
for full_index in range(len(full_data)): | ||
valid_index = full_to_valid_indices.get(full_index, None) | ||
preds = avg_preds[valid_index] if valid_index is not None else ['Invalid SMILES'] * num_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to also return the invalid SMILES themselves, or is it not really beneficial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it's beneficial. But if people do want it, the right place is in the load_data function where these are getting split up in the first place. I've added it there as one of the outputs (unused in the full workflow). Is that a reasonable return structure do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alongd I've actually rethought this and I don't want to add another output to load_data where it would break existing code for people. The right place would be in a separate data/utils function. I've added new functions for that from both files and lists.
chemprop/train/make_predictions.py
Outdated
@@ -259,6 +272,8 @@ def make_predictions(args: PredictArgs, smiles: List[List[str]] = None, | |||
loading data and a model and making predictions. | |||
:param smiles: List of list of SMILES to make predictions on. | |||
:param model_objects: Tuple of output of load_model function which can be called separately. | |||
:param return_invalid_smiles: Whether to return None values for invalid SMILES, otherwise will skip them in returned predictions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to make the docstring description of return_invalid_smiles
identical in both predict_and_save()
and make_predictions()
, or is it different on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was unintentional, I will make them the same.
chemprop/train/make_predictions.py
Outdated
@@ -122,6 +123,7 @@ def predict_and_save(args: PredictArgs, train_args: TrainArgs, test_data: Molecu | |||
:param full_to_valid_indices: A dictionary dictionary mapping full to valid indices. | |||
:param models: A list or generator object of :class:`~chemprop.models.MoleculeModel`\ s. | |||
:param scalers: A list or generator object of :class:`~chemprop.features.scaler.StandardScaler` objects. | |||
:param return_invalid_smiles: Whether to include invalid SMILES with a value None in the returned predictions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably misunderstood, but it looks like the description says that a None
value is returned, when in practice an 'Invalid SMILES'
string is returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are completely right here. This is a mistake in the docstring and I will fix.
03aa8c5
to
8b96051
Compare
Thanks for the modifications, @cjmcgill! |
8b96051
to
2dc8594
Compare
@alongd I refactored the commits so the spacing changes were grouped in a separate commit. Also you are right on about testing. Thus far, we have not had unit tests in this repo, only relying on only integration tests. I am adding unit tests in the in-progress #232, and tests on these functions will be included in it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
User noted that for the make_predictions function, invalid SMILES can be included in the arguments but the returned values will exclude the invalid entries without any notation of which ones are invalid. The default behavior before for the values saved to file has always been to include the entries with invalid SMILES with prediction values of "Invalid SMILES". Following the refactoring of make_predictions in #200, it's more accessible than before to use this function directly and so it's now more obvious that the saved behavior and the function return behavior do not match.
I've added two new options to the make_predictions function.
return_invalid_smiles
. This option indicates whether to include predictions for invalid datapoints, with the value "Invalid SMILES" in the function return. This makes it consistent with the saved value behavior. This option has been set default to True. The previous behavior before the option would have been False.return_index_dict
. This option indicates whether to return a dictionary keyed with the index of the initial dataset or smiles list for the predictions rather than a list of list of values. This option defaults to False. Ifreturn_invalid_smiles
is set to False, the invalid index values will be missing from the returned dictionary.