Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tensorflow-macos and clean up some test running warning noise #741

Merged
merged 5 commits into from
Dec 19, 2022

Conversation

leos
Copy link
Contributor

@leos leos commented Dec 18, 2022

Dataprofiler doesn't install on MacOS on an M1 right now because the package tensorflow can't be installed. The package for darwin should be tensorflow-macos - https://developer.apple.com/metal/tensorflow-plugin/ - we could also install tensorflow-metal? Not sure if we want that to be an optional install?

When running the tests there's also a lot of noise in the test output. This PR fixes some of the lowest hanging fruit.

  • Fixing warning when running tests - reference:
/repos/DataProfiler/venv/lib/python3.10/site-packages/charset_normalizer/legacy.py:64: DeprecationWarning: staticmethod from_fp, from_bytes, from_path and normalize are deprecated and scheduled to be removed in 3.0
  warnings.warn(  # pragma: nocover
  • Fixing warning when running tests by updating requests:
/repos/DataProfiler/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.13) or chardet (5.1.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
  • Fixing warning:
/repos/DataProfiler/venv/lib/python3.10/site-packages/keras/engine/training.py:2416: UserWarning: Metric F1Score implements a `reset_states()` method; rename it to `reset_state()` (without the final "s"). The name `reset_states()` has been deprecated to improve API consistency.
  m.reset_state()
  • Fixing warning and type ambiguity:
/repos/DataProfiler/dataprofiler/profilers/profile_builder.py:610: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  df_series = df_series.loc[true_sample_list]
  • Fixing NumPy deprecation warning:
/repos/DataProfiler/dataprofiler/tests/profilers/test_profile_builder.py:1792: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.float,
/repos/DataProfiler/dataprofiler/tests/profilers/test_profile_builder.py:1795: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
/repos/DataProfiler/dataprofiler/reports/graphs.py:354: PendingDeprecationWarning: The set_tight_layout function will be deprecated in a future version. Use set_layout_engine instead.
  fig.set_tight_layout(True)
  • Fixing missing spaces at ends of joined strings in warnings

@leos
Copy link
Contributor Author

leos commented Dec 18, 2022

I'm seeing this test failure in CI:

_____________________ TestRegexPostProcessor.test_process ______________________
self = <dataprofiler.tests.labelers.test_data_processing.TestRegexPostProcessor testMethod=test_process>

    def test_process(self):
    
        label_mapping = label_mapping = {"PAD": 0, "UNKNOWN": 1, "TEST1": 2}
        data = None
        results = dict(
            pred=[
                np.array([[1, 1, 0], [0, 0, 1], [0, 0, 1], [1, 1, 1]]),
                np.array([[0, 1, 0], [1, 1, 1], [1, 0, 1]]),
            ]
        )
    
        # aggregation_func = 'split'
        expected_output = dict(
            pred=[
                np.array([[0.5, 0.5, 0], [0, 0, 1], [0, 0, 1], [1 / 3, 1 / 3, 1 / 3]]),
                np.array([[0, 1, 0], [1 / 3, 1 / 3, 1 / 3], [0.5, 0, 0.5]]),
            ]
        )
        processor = RegexPostProcessor(aggregation_func="split")
        process_output = processor.process(data, results, label_mapping)
    
        self.assertIn("pred", process_output)
        for expected, output in zip(expected_output["pred"], process_output["pred"]):
            self.assertTrue(np.array_equal(expected, output))
    
        # aggregation_func = 'priority'
        priority_order = [1, 0, 2]
        expected_output = dict(pred=[np.array([1, 2, 2, 1]), np.array([1, 1, 0])])
        processor = RegexPostProcessor(
            aggregation_func="priority", priority_order=priority_order
        )
        process_output = processor.process(data, results, label_mapping)
    
        self.assertIn("pred", process_output)
        for expected, output in zip(expected_output["pred"], process_output["pred"]):
            self.assertTrue(np.array_equal(expected, output))
    
        # aggregation_func = 'random'
    
        # first random
        random_state = random.Random(0)
        expected_output = dict(pred=[np.array([0, 2, 2, 0]), np.array([1, 0, 0])])
        processor = RegexPostProcessor(
            aggregation_func="random", random_state=random_state
        )
        process_output = processor.process(data, results, label_mapping)
    
        self.assertIn("pred", process_output)
        for expected, output in zip(expected_output["pred"], process_output["pred"]):
            self.assertTrue(np.array_equal(expected, output))
    
        # second random
        random_state = random.Random(1)
        expected_output = dict(
>           pred=np.array([np.array([1, 2, 2, 1]), np.array([1, 1, 2])])
        )
E       ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

dataprofiler/tests/labelers/test_data_processing.py:2811: ValueError

This appears to also fail on main - I'm not sure how it could have worked? The shape being passed to np.array is actually not a "box"?

🐍 DataProfiler ~/repos/DataProfiler ❯❯❯ DATAPROFILER_SEED=0 python3 -m unittest dataprofiler.tests.labelers.test_data_processing.TestRegexPostProcessor.test_process

/Users/leo/repos/DataProfiler/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.13) or chardet (5.1.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
E
======================================================================
ERROR: test_process (dataprofiler.tests.labelers.test_data_processing.TestRegexPostProcessor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/leo/repos/DataProfiler/dataprofiler/tests/labelers/test_data_processing.py", line 2811, in test_process
    pred=np.array([np.array([1, 2, 2, 1]), np.array([1, 1, 2])])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

Also concerning, running make test locally did not trigger this failure.

dataprofiler/labelers/labeler_utils.py Outdated Show resolved Hide resolved
Comment on lines +606 to +610
true_sample_list = (
sorted(true_sample_set)
if min_true_samples > 0 or sample_ids is None
else list(true_sample_set)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concern about increased memory space as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JGSweets @taylorfturner I didn't do this just to fix the typing issue, I did it to fix the warning below which warns that sets can't be used as indexers anymore (there's more discussion on why this was done here:
pandas-dev/pandas#42825):

/repos/DataProfiler/dataprofiler/profilers/profile_builder.py:610: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  df_series = df_series.loc[true_sample_list]

I'm open to other suggestions of how to address this - I don't really know numpy/pandas best practices here but I'd guess there should be an idiomatic way to do this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the counter argument is that sorted was already going to convert it to a list and the memory was already being used whether or not it was "in-place".

Hence, this should be fine.

@JGSweets
Copy link
Collaborator

np.array([np.array([1, 2, 2, 1]), np.array([1, 1, 2])])

I believe this is a deprecated usage and we are required to now specify 'dtype=object' for ragged tensors. See below:

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

The reason for varying failure could be platform based / numpy version.

When i updated from 1.22.4 -> 1.24.0 It errors as oppose to showing a deprecation warning.

@leos
Copy link
Contributor Author

leos commented Dec 19, 2022

I believe this is a deprecated usage and we are required to now specify 'dtype=object' for ragged tensors. See below:

@JGSweets Ah, got it. That makes sense. Regardless, the outer array doesn't need to be an np.array for this test, so the fix I did should work.

When i updated from 1.22.4 -> 1.24.0 It errors as oppose to showing a deprecation warning.

Have you guys considered using something like poetry to do the dependency management? It would simplify a bunch of stuff and add the capability to do a requirements.in which would be fairly loose and a requirements.lock which would set specific versions to guarantee that folks running dataprofiler/testing it are on the exact same set of transitive libraries.

@taylorfturner taylorfturner enabled auto-merge (squash) December 19, 2022 18:11
@JGSweets
Copy link
Collaborator

Have you guys considered using something like poetry to do the dependency management? It would simplify a bunch of stuff and add the capability to do a requirements.in which would be fairly loose and a requirements.lock which would set specific versions to guarantee that folks running dataprofiler/testing it are on the exact same set of transitive libraries.

I am interested, but haven't had time to consider it yet.

@taylorfturner taylorfturner merged commit 9600701 into capitalone:main Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants