-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing Featurizer Logging Issues #2040
Conversation
Ok, I think this PR is now ready for review. I've refactored the loader classes to use the new invokable featurizer style. This cleans up the code (letting us remove some helper functions) and fixes our logging issues @ncfrey Could you take a look? I've refactored the CC @peastman @nd-02110114 |
Forgot to mention, but I've also added some tests for |
deepchem/data/data_loader.py
Outdated
id_field=None, | ||
featurizer=None, | ||
featurizer: Optional[Featurizer] = None, | ||
log_every_n=1000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing type annotations on id_field
and log_every_n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix!
deepchem/data/data_loader.py
Outdated
""" | ||
|
||
def __init__(self, | ||
tasks, | ||
smiles_field=None, | ||
tasks: OneOrMany[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be List[str]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix!
deepchem/data/data_loader.py
Outdated
features = [ | ||
elt for (is_valid, elt) in zip(valid_inds, features) if is_valid | ||
] | ||
return np.array(features), valid_inds | ||
|
||
|
||
class UserCSVLoader(CSVLoader): | ||
""" | ||
Handles loading of CSV files with user-defined featurizers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that be "user-defined features" instead of "user-defined featurizers"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch! Will fix
Looks good. I do notice some inconsistent use of type annotations. It sometimes has annotations from some arguments of a method, but not others. Or annotates the return type, but not the arguments. |
self.tasks = tasks | ||
self.smiles_field = smiles_field | ||
self.feature_field = feature_field | ||
self.id_field = id_field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is necessary...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use self.id_field
later down so I think so. I might be misunderstanding your comment though!
|
||
Returns | ||
------- | ||
Iterator over shards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of Numpy docstring style, the first line is type annotation.
https://numpydoc.readthedocs.io/en/latest/format.html
Iterator[pd.DataFrame]
Iterator over shards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix!
deepchem/data/data_loader.py
Outdated
>>> dataset = loader.create_dataset(os.path.join(current_dir, "tests", "membrane_permeability.sdf")) # doctest:+ELLIPSIS | ||
Reading ... | ||
>>> len(dataset) | ||
2 | ||
""" | ||
|
||
def __init__(self, tasks, sanitize=False, featurizer=None, log_every_n=1000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to add type annotation...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good suggestion, will add in!
@@ -674,13 +725,13 @@ def _get_shards(self, input_files, shard_size): | |||
|
|||
def _featurize_shard(self, shard): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to add type annotation...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, will do!
features = [elt for elt in self.featurizer(shard[self.feature_field])] | ||
valid_inds = np.array( | ||
[1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool) | ||
features = [ | ||
elt for (is_valid, elt) in zip(valid_inds, features) if is_valid | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the view point of the performance, is it better to reduce the number of the loop...?
features = []
valid_inds = []
for feat in self.featurizer(shard[self.feature_field])):
is_valid = True if feat.size > 0 else False
valid_inds.append(is_valid)
if is_valid:
features.append(feat)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can test it and see which is faster. Based on my recent work, I think these lines probably have a negligible impact on performance, so it won't make a difference either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interests of keeping the code succinct, I'm going to keep the current lines, but if it turns out to be a performance issue, will fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I left a few suggestions.
deepchem/data/data_loader.py
Outdated
logger.info("About to featurize shard.") | ||
features = [elt for elt in self.featurizer(shard[self.feature_field])] | ||
valid_inds = np.array( | ||
[1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool) | ||
features = [ | ||
elt for (is_valid, elt) in zip(valid_inds, features) if is_valid | ||
] | ||
return np.array(features), valid_inds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I had consolidated the loops, following @nd-02110114's suggestion - but as long as it's consistent with the other loaders it seems ok.
features = [elt for elt in self.featurizer(shard[self.mol_field])] | ||
valid_inds = np.array( | ||
[1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool) | ||
features = [ | ||
elt for (is_valid, elt) in zip(valid_inds, features) if is_valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above for these list comprehensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now, I'll keep the list comprehensions, but if this turns out to be a performance issue will replace!
featurizer = dc.feat.CircularFingerprint(size=1024) | ||
tasks = ["endpoint"] | ||
loader = dc.data.CSVLoader( | ||
tasks=tasks, smiles_field="smiles", featurizer=featurizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be updated to feature_field
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix!
input_file = os.path.join(current_dir, "../../data/tests/no_labels.csv") | ||
featurizer = dc.feat.CircularFingerprint(size=1024) | ||
loader = dc.data.CSVLoader( | ||
tasks=[], smiles_field="smiles", featurizer=featurizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also feature_field
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix!
featurizer = dc.feat.CircularFingerprint(size=1024) | ||
loader = dc.data.CSVLoader( | ||
tasks=[], smiles_field="smiles", featurizer=featurizer) | ||
loader.create_dataset(input_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add another assert len(X)
here to check the dataset creation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, will add!
for idm, mol in enumerate(dataset.X): | ||
assert dataset.X[idm].get_num_atoms() == len(dataset.X[idm].parents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be good to run a list comprehension that generates a boolean array, and then check at the end with one assert
statement that everything is True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was cruft leftover in this PR and was merged in an earlier PR, so the changes went away in the rebase. It might be good to do a later pass to simplify some of these tests though!
I've addressed all open comments so going ahead and merging this one in to get the fix in place. |
This is #2032 reopened to see if it solves mysterious travis vs. local test issues.