Fixing Featurizer Logging Issues #2040

rbharath · 2020-07-22T19:36:05Z

This is #2032 reopened to see if it solves mysterious travis vs. local test issues.

coveralls · 2020-07-22T23:24:38Z

Coverage increased (+0.4%) to 75.44% when pulling 5811e3b on log_fix2 into 5984e9e on master.

rbharath · 2020-07-23T00:11:47Z

Ok, I think this PR is now ready for review. I've refactored the loader classes to use the new invokable featurizer style. This cleans up the code (letting us remove some helper functions) and fixes our logging issues

@ncfrey Could you take a look? I've refactored the JsonLoader as well.

CC @peastman @nd-02110114

rbharath · 2020-07-23T02:34:16Z

Forgot to mention, but I've also added some tests for SDFLoader (we didn't have any before!), which required adding some test data files, and added examples in the docstrings.

peastman · 2020-07-23T03:58:30Z

deepchem/data/data_loader.py

               id_field=None,
-               featurizer=None,
+               featurizer: Optional[Featurizer] = None,
               log_every_n=1000):


Missing type annotations on id_field and log_every_n.

Good catch, will fix!

peastman · 2020-07-23T04:01:26Z

deepchem/data/data_loader.py

  """

  def __init__(self,
-               tasks,
-               smiles_field=None,
+               tasks: OneOrMany[str],


This should be List[str].

Good catch, will fix!

peastman · 2020-07-23T04:03:31Z

deepchem/data/data_loader.py

+    features = [
+        elt for (is_valid, elt) in zip(valid_inds, features) if is_valid
+    ]
+    return np.array(features), valid_inds


 class UserCSVLoader(CSVLoader):
  """
  Handles loading of CSV files with user-defined featurizers.


Should that be "user-defined features" instead of "user-defined featurizers"?

Ah good catch! Will fix

peastman · 2020-07-23T04:07:41Z

Looks good. I do notice some inconsistent use of type annotations. It sometimes has annotations from some arguments of a method, but not others. Or annotates the return type, but not the arguments.

nissy-dev · 2020-07-23T06:34:37Z

deepchem/data/data_loader.py

    self.tasks = tasks
-    self.smiles_field = smiles_field
+    self.feature_field = feature_field
+    self.id_field = id_field


This line is necessary...?

I think we use self.id_field later down so I think so. I might be misunderstanding your comment though!

nissy-dev · 2020-07-23T06:36:45Z

deepchem/data/data_loader.py

+
+    Returns
+    -------
+    Iterator over shards


In the case of Numpy docstring style, the first line is type annotation.
https://numpydoc.readthedocs.io/en/latest/format.html

Iterator[pd.DataFrame] Iterator over shards

nissy-dev · 2020-07-23T06:44:57Z

deepchem/data/data_loader.py

+  >>> dataset = loader.create_dataset(os.path.join(current_dir, "tests", "membrane_permeability.sdf")) # doctest:+ELLIPSIS
+  Reading ...
+  >>> len(dataset)
+  2
  """

  def __init__(self, tasks, sanitize=False, featurizer=None, log_every_n=1000):


Is it better to add type annotation...?

Ah good suggestion, will add in!

nissy-dev · 2020-07-23T06:45:10Z

deepchem/data/data_loader.py

@@ -674,13 +725,13 @@ def _get_shards(self, input_files, shard_size):

  def _featurize_shard(self, shard):


Is it better to add type annotation...?

Good suggestion, will do!

nissy-dev · 2020-07-23T06:49:19Z

deepchem/data/data_loader.py

+    features = [elt for elt in self.featurizer(shard[self.feature_field])]
+    valid_inds = np.array(
+        [1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool)
+    features = [
+        elt for (is_valid, elt) in zip(valid_inds, features) if is_valid
+    ]


From the view point of the performance, is it better to reduce the number of the loop...?

features = [] valid_inds = [] for feat in self.featurizer(shard[self.feature_field])): is_valid = True if feat.size > 0 else False valid_inds.append(is_valid) if is_valid: features.append(feat)

You can test it and see which is faster. Based on my recent work, I think these lines probably have a negligible impact on performance, so it won't make a difference either way.

In the interests of keeping the code succinct, I'm going to keep the current lines, but if it turns out to be a performance issue, will fix!

ncfrey

Looks good! I left a few suggestions.

ncfrey · 2020-07-23T12:32:42Z

deepchem/data/data_loader.py

+    logger.info("About to featurize shard.")
+    features = [elt for elt in self.featurizer(shard[self.feature_field])]
+    valid_inds = np.array(
+        [1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool)
+    features = [
+        elt for (is_valid, elt) in zip(valid_inds, features) if is_valid
+    ]
+    return np.array(features), valid_inds


Here I had consolidated the loops, following @nd-02110114's suggestion - but as long as it's consistent with the other loaders it seems ok.

ncfrey · 2020-07-23T12:33:27Z

deepchem/data/data_loader.py

+    features = [elt for elt in self.featurizer(shard[self.mol_field])]
+    valid_inds = np.array(
+        [1 if np.array(elt).size > 0 else 0 for elt in features], dtype=bool)
+    features = [
+        elt for (is_valid, elt) in zip(valid_inds, features) if is_valid


Same comment as above for these list comprehensions.

I think for now, I'll keep the list comprehensions, but if this turns out to be a performance issue will replace!

ncfrey · 2020-07-23T12:35:15Z

deepchem/data/tests/test_csv_loader.py

+  featurizer = dc.feat.CircularFingerprint(size=1024)
+  tasks = ["endpoint"]
+  loader = dc.data.CSVLoader(
+      tasks=tasks, smiles_field="smiles", featurizer=featurizer)


Should this be updated to feature_field?

Good catch, will fix!

ncfrey · 2020-07-23T12:36:20Z

deepchem/data/tests/test_data_loader.py

+  input_file = os.path.join(current_dir, "../../data/tests/no_labels.csv")
+  featurizer = dc.feat.CircularFingerprint(size=1024)
+  loader = dc.data.CSVLoader(
+      tasks=[], smiles_field="smiles", featurizer=featurizer)


Also feature_field?

Good catch, will fix!

ncfrey · 2020-07-23T12:37:23Z

deepchem/data/tests/test_data_loader.py

+  featurizer = dc.feat.CircularFingerprint(size=1024)
+  loader = dc.data.CSVLoader(
+      tasks=[], smiles_field="smiles", featurizer=featurizer)
+  loader.create_dataset(input_file)


Maybe add another assert len(X) here to check the dataset creation?

Good suggestion, will add!

ncfrey · 2020-07-23T12:55:09Z

deepchem/trans/tests/test_transformers.py

+  for idm, mol in enumerate(dataset.X):
+    assert dataset.X[idm].get_num_atoms() == len(dataset.X[idm].parents)


It could be good to run a list comprehension that generates a boolean array, and then check at the end with one assert statement that everything is True.

This code was cruft leftover in this PR and was merged in an earlier PR, so the changes went away in the rebase. It might be good to do a later pass to simplify some of these tests though!

rbharath · 2020-07-31T02:58:19Z

I've addressed all open comments so going ahead and merging this one in to get the fix in place.

rbharath changed the title ~~Fixing Logging Issues~~ Fixing Featurizer Logging Issues Jul 23, 2020

peastman reviewed Jul 23, 2020

View reviewed changes

nissy-dev reviewed Jul 23, 2020

View reviewed changes

ncfrey reviewed Jul 23, 2020

View reviewed changes

Bharath Ramsundar added 8 commits July 30, 2020 17:28

Changes

615cff2

changes

0dd1de3

Changes

d62f215

changes

90c131e

Changes

0c0a6a6

Adding in example SDFs

89701cd

Fixes

b0c3877

Addressing open review comments

5811e3b

rbharath force-pushed the log_fix2 branch from 58f600e to 5811e3b Compare July 31, 2020 01:00

rbharath merged commit b13f82d into master Jul 31, 2020

rbharath deleted the log_fix2 branch July 31, 2020 02:58

rbharath mentioned this pull request Jul 31, 2020

Logging improvements #2023

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing Featurizer Logging Issues #2040

Fixing Featurizer Logging Issues #2040

rbharath commented Jul 22, 2020

coveralls commented Jul 22, 2020 •

edited

Loading

rbharath commented Jul 23, 2020

rbharath commented Jul 23, 2020

peastman Jul 23, 2020

rbharath Jul 31, 2020

peastman Jul 23, 2020

rbharath Jul 31, 2020

peastman Jul 23, 2020

rbharath Jul 31, 2020

peastman commented Jul 23, 2020

nissy-dev Jul 23, 2020

rbharath Jul 31, 2020

nissy-dev Jul 23, 2020

rbharath Jul 31, 2020 •

edited

Loading

nissy-dev Jul 23, 2020

rbharath Jul 31, 2020

nissy-dev Jul 23, 2020

rbharath Jul 31, 2020

nissy-dev Jul 23, 2020 •

edited

Loading

peastman Jul 23, 2020

rbharath Jul 31, 2020

ncfrey left a comment

ncfrey Jul 23, 2020

ncfrey Jul 23, 2020

rbharath Jul 31, 2020

ncfrey Jul 23, 2020

rbharath Jul 31, 2020

ncfrey Jul 23, 2020

rbharath Jul 31, 2020

ncfrey Jul 23, 2020

rbharath Jul 31, 2020

ncfrey Jul 23, 2020

rbharath Jul 31, 2020

rbharath commented Jul 31, 2020

		@@ -674,13 +725,13 @@ def _get_shards(self, input_files, shard_size):

		def _featurize_shard(self, shard):

		for idm, mol in enumerate(dataset.X):
		assert dataset.X[idm].get_num_atoms() == len(dataset.X[idm].parents)

Fixing Featurizer Logging Issues #2040

Fixing Featurizer Logging Issues #2040

Conversation

rbharath commented Jul 22, 2020

coveralls commented Jul 22, 2020 • edited Loading

rbharath commented Jul 23, 2020

rbharath commented Jul 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peastman commented Jul 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbharath Jul 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nissy-dev Jul 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncfrey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rbharath commented Jul 31, 2020

coveralls commented Jul 22, 2020 •

edited

Loading

rbharath Jul 31, 2020 •

edited

Loading

nissy-dev Jul 23, 2020 •

edited

Loading