Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Featurizer Logging Issues #2040

Merged
merged 8 commits into from
Jul 31, 2020
Merged

Fixing Featurizer Logging Issues #2040

merged 8 commits into from
Jul 31, 2020

Conversation

rbharath
Copy link
Member

This is #2032 reopened to see if it solves mysterious travis vs. local test issues.

@coveralls
Copy link

coveralls commented Jul 22, 2020

Coverage Status

Coverage increased (+0.4%) to 75.44% when pulling 5811e3b on log_fix2 into 5984e9e on master.

@rbharath
Copy link
Member Author

Ok, I think this PR is now ready for review. I've refactored the loader classes to use the new invokable featurizer style. This cleans up the code (letting us remove some helper functions) and fixes our logging issues

@ncfrey Could you take a look? I've refactored the JsonLoader as well.

CC @peastman @nd-02110114

@rbharath
Copy link
Member Author

Forgot to mention, but I've also added some tests for SDFLoader (we didn't have any before!), which required adding some test data files, and added examples in the docstrings.

@rbharath rbharath changed the title Fixing Logging Issues Fixing Featurizer Logging Issues Jul 23, 2020
Comment on lines 331 to 333
id_field=None,
featurizer=None,
featurizer: Optional[Featurizer] = None,
log_every_n=1000):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type annotations on id_field and log_every_n.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will fix!

"""

def __init__(self,
tasks,
smiles_field=None,
tasks: OneOrMany[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be List[str].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will fix!

features = [
elt for (is_valid, elt) in zip(valid_inds, features) if is_valid
]
return np.array(features), valid_inds


class UserCSVLoader(CSVLoader):
"""
Handles loading of CSV files with user-defined featurizers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be "user-defined features" instead of "user-defined featurizers"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch! Will fix

@peastman
Copy link
Contributor

Looks good. I do notice some inconsistent use of type annotations. It sometimes has annotations from some arguments of a method, but not others. Or annotates the return type, but not the arguments.

self.tasks = tasks
self.smiles_field = smiles_field
self.feature_field = feature_field
self.id_field = id_field
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is necessary...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use self.id_field later down so I think so. I might be misunderstanding your comment though!


Returns
-------
Iterator over shards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of Numpy docstring style, the first line is type annotation.
https://numpydoc.readthedocs.io/en/latest/format.html

Iterator[pd.DataFrame]
  Iterator over shards

Copy link
Member Author

@rbharath rbharath Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix!

>>> dataset = loader.create_dataset(os.path.join(current_dir, "tests", "membrane_permeability.sdf")) # doctest:+ELLIPSIS
Reading ...
>>> len(dataset)
2
"""

def __init__(self, tasks, sanitize=False, featurizer=None, log_every_n=1000):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to add type annotation...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good suggestion, will add in!

@@ -674,13 +725,13 @@ def _get_shards(self, input_files, shard_size):

def _featurize_shard(self, shard):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to add type annotation...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, will do!

Comment on lines +411 to +430
features = [elt for elt in self.featurizer(shard[self.feature_field])]
valid_inds = np.array(
[1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool)
features = [
elt for (is_valid, elt) in zip(valid_inds, features) if is_valid
]
Copy link
Member

@nissy-dev nissy-dev Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the view point of the performance, is it better to reduce the number of the loop...?

features = []
valid_inds = []
for feat in self.featurizer(shard[self.feature_field])):
    is_valid = True if feat.size > 0 else False
    valid_inds.append(is_valid)
    if is_valid:
        features.append(feat)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can test it and see which is faster. Based on my recent work, I think these lines probably have a negligible impact on performance, so it won't make a difference either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interests of keeping the code succinct, I'm going to keep the current lines, but if it turns out to be a performance issue, will fix!

Copy link
Contributor

@ncfrey ncfrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a few suggestions.

Comment on lines 669 to 694
logger.info("About to featurize shard.")
features = [elt for elt in self.featurizer(shard[self.feature_field])]
valid_inds = np.array(
[1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool)
features = [
elt for (is_valid, elt) in zip(valid_inds, features) if is_valid
]
return np.array(features), valid_inds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I had consolidated the loops, following @nd-02110114's suggestion - but as long as it's consistent with the other loaders it seems ok.

Comment on lines +728 to +754
features = [elt for elt in self.featurizer(shard[self.mol_field])]
valid_inds = np.array(
[1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool)
features = [
elt for (is_valid, elt) in zip(valid_inds, features) if is_valid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above for these list comprehensions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now, I'll keep the list comprehensions, but if this turns out to be a performance issue will replace!

featurizer = dc.feat.CircularFingerprint(size=1024)
tasks = ["endpoint"]
loader = dc.data.CSVLoader(
tasks=tasks, smiles_field="smiles", featurizer=featurizer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be updated to feature_field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will fix!

input_file = os.path.join(current_dir, "../../data/tests/no_labels.csv")
featurizer = dc.feat.CircularFingerprint(size=1024)
loader = dc.data.CSVLoader(
tasks=[], smiles_field="smiles", featurizer=featurizer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also feature_field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will fix!

featurizer = dc.feat.CircularFingerprint(size=1024)
loader = dc.data.CSVLoader(
tasks=[], smiles_field="smiles", featurizer=featurizer)
loader.create_dataset(input_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add another assert len(X) here to check the dataset creation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, will add!

Comment on lines 105 to 106
for idm, mol in enumerate(dataset.X):
assert dataset.X[idm].get_num_atoms() == len(dataset.X[idm].parents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be good to run a list comprehension that generates a boolean array, and then check at the end with one assert statement that everything is True.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was cruft leftover in this PR and was merged in an earlier PR, so the changes went away in the rebase. It might be good to do a later pass to simplify some of these tests though!

@rbharath
Copy link
Member Author

I've addressed all open comments so going ahead and merging this one in to get the fix in place.

@rbharath rbharath merged commit b13f82d into master Jul 31, 2020
@rbharath rbharath deleted the log_fix2 branch July 31, 2020 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants