Use tensorflow-macos and clean up some test running warning noise #741

leos · 2022-12-18T18:31:59Z

Dataprofiler doesn't install on MacOS on an M1 right now because the package tensorflow can't be installed. The package for darwin should be tensorflow-macos - https://developer.apple.com/metal/tensorflow-plugin/ - we could also install tensorflow-metal? Not sure if we want that to be an optional install?

When running the tests there's also a lot of noise in the test output. This PR fixes some of the lowest hanging fruit.

Fixing warning when running tests - reference:

/repos/DataProfiler/venv/lib/python3.10/site-packages/charset_normalizer/legacy.py:64: DeprecationWarning: staticmethod from_fp, from_bytes, from_path and normalize are deprecated and scheduled to be removed in 3.0
  warnings.warn(  # pragma: nocover

Fixing warning when running tests by updating requests:

/repos/DataProfiler/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.13) or chardet (5.1.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "

Fixing warning:

/repos/DataProfiler/venv/lib/python3.10/site-packages/keras/engine/training.py:2416: UserWarning: Metric F1Score implements a `reset_states()` method; rename it to `reset_state()` (without the final "s"). The name `reset_states()` has been deprecated to improve API consistency.
  m.reset_state()

Fixing warning and type ambiguity:

/repos/DataProfiler/dataprofiler/profilers/profile_builder.py:610: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  df_series = df_series.loc[true_sample_list]

Fixing NumPy deprecation warning:

/repos/DataProfiler/dataprofiler/tests/profilers/test_profile_builder.py:1792: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.float,
/repos/DataProfiler/dataprofiler/tests/profilers/test_profile_builder.py:1795: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Fixing warning - reference:

/repos/DataProfiler/dataprofiler/reports/graphs.py:354: PendingDeprecationWarning: The set_tight_layout function will be deprecated in a future version. Use set_layout_engine instead.
  fig.set_tight_layout(True)

Fixing missing spaces at ends of joined strings in warnings

leos · 2022-12-18T19:01:42Z

I'm seeing this test failure in CI:

_____________________ TestRegexPostProcessor.test_process ______________________
self = <dataprofiler.tests.labelers.test_data_processing.TestRegexPostProcessor testMethod=test_process>

    def test_process(self):
    
        label_mapping = label_mapping = {"PAD": 0, "UNKNOWN": 1, "TEST1": 2}
        data = None
        results = dict(
            pred=[
                np.array([[1, 1, 0], [0, 0, 1], [0, 0, 1], [1, 1, 1]]),
                np.array([[0, 1, 0], [1, 1, 1], [1, 0, 1]]),
            ]
        )
    
        # aggregation_func = 'split'
        expected_output = dict(
            pred=[
                np.array([[0.5, 0.5, 0], [0, 0, 1], [0, 0, 1], [1 / 3, 1 / 3, 1 / 3]]),
                np.array([[0, 1, 0], [1 / 3, 1 / 3, 1 / 3], [0.5, 0, 0.5]]),
            ]
        )
        processor = RegexPostProcessor(aggregation_func="split")
        process_output = processor.process(data, results, label_mapping)
    
        self.assertIn("pred", process_output)
        for expected, output in zip(expected_output["pred"], process_output["pred"]):
            self.assertTrue(np.array_equal(expected, output))
    
        # aggregation_func = 'priority'
        priority_order = [1, 0, 2]
        expected_output = dict(pred=[np.array([1, 2, 2, 1]), np.array([1, 1, 0])])
        processor = RegexPostProcessor(
            aggregation_func="priority", priority_order=priority_order
        )
        process_output = processor.process(data, results, label_mapping)
    
        self.assertIn("pred", process_output)
        for expected, output in zip(expected_output["pred"], process_output["pred"]):
            self.assertTrue(np.array_equal(expected, output))
    
        # aggregation_func = 'random'
    
        # first random
        random_state = random.Random(0)
        expected_output = dict(pred=[np.array([0, 2, 2, 0]), np.array([1, 0, 0])])
        processor = RegexPostProcessor(
            aggregation_func="random", random_state=random_state
        )
        process_output = processor.process(data, results, label_mapping)
    
        self.assertIn("pred", process_output)
        for expected, output in zip(expected_output["pred"], process_output["pred"]):
            self.assertTrue(np.array_equal(expected, output))
    
        # second random
        random_state = random.Random(1)
        expected_output = dict(
>           pred=np.array([np.array([1, 2, 2, 1]), np.array([1, 1, 2])])
        )
E       ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

dataprofiler/tests/labelers/test_data_processing.py:2811: ValueError

This appears to also fail on main - I'm not sure how it could have worked? The shape being passed to np.array is actually not a "box"?

🐍 DataProfiler ~/repos/DataProfiler ❯❯❯ DATAPROFILER_SEED=0 python3 -m unittest dataprofiler.tests.labelers.test_data_processing.TestRegexPostProcessor.test_process

/Users/leo/repos/DataProfiler/venv/lib/python3.10/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.13) or chardet (5.1.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
E
======================================================================
ERROR: test_process (dataprofiler.tests.labelers.test_data_processing.TestRegexPostProcessor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/leo/repos/DataProfiler/dataprofiler/tests/labelers/test_data_processing.py", line 2811, in test_process
    pred=np.array([np.array([1, 2, 2, 1]), np.array([1, 1, 2])])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)

Also concerning, running make test locally did not trigger this failure.

dataprofiler/profilers/profile_builder.py

requirements-ml.txt

dataprofiler/labelers/labeler_utils.py

taylorfturner · 2022-12-19T16:52:37Z

dataprofiler/profilers/profile_builder.py

+        true_sample_list = (
+            sorted(true_sample_set)
+            if min_true_samples > 0 or sample_ids is None
+            else list(true_sample_set)
+        )


concern about increased memory space as well

@JGSweets @taylorfturner I didn't do this just to fix the typing issue, I did it to fix the warning below which warns that sets can't be used as indexers anymore (there's more discussion on why this was done here:
pandas-dev/pandas#42825):

/repos/DataProfiler/dataprofiler/profilers/profile_builder.py:610: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead. df_series = df_series.loc[true_sample_list]

I'm open to other suggestions of how to address this - I don't really know numpy/pandas best practices here but I'd guess there should be an idiomatic way to do this.

I guess the counter argument is that sorted was already going to convert it to a list and the memory was already being used whether or not it was "in-place".

Hence, this should be fine.

JGSweets · 2022-12-19T17:00:27Z

np.array([np.array([1, 2, 2, 1]), np.array([1, 1, 2])])

I believe this is a deprecated usage and we are required to now specify 'dtype=object' for ragged tensors. See below:

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

The reason for varying failure could be platform based / numpy version.

When i updated from 1.22.4 -> 1.24.0 It errors as oppose to showing a deprecation warning.

leos · 2022-12-19T17:46:09Z

I believe this is a deprecated usage and we are required to now specify 'dtype=object' for ragged tensors. See below:

@JGSweets Ah, got it. That makes sense. Regardless, the outer array doesn't need to be an np.array for this test, so the fix I did should work.

When i updated from 1.22.4 -> 1.24.0 It errors as oppose to showing a deprecation warning.

Have you guys considered using something like poetry to do the dependency management? It would simplify a bunch of stuff and add the capability to do a requirements.in which would be fairly loose and a requirements.lock which would set specific versions to guarantee that folks running dataprofiler/testing it are on the exact same set of transitive libraries.

JGSweets · 2022-12-19T18:50:35Z

Have you guys considered using something like poetry to do the dependency management? It would simplify a bunch of stuff and add the capability to do a requirements.in which would be fairly loose and a requirements.lock which would set specific versions to guarantee that folks running dataprofiler/testing it are on the exact same set of transitive libraries.

I am interested, but haven't had time to consider it yet.

leos requested review from JGSweets, ksneab7, taylorfturner, micdavis and tyfarnan as code owners December 18, 2022 18:31

Leo Shklovskii added 4 commits December 18, 2022 14:04

tensorflow-macos package and clean up warnings

9ec5f0b

more cleanup

65e37a9

fixing requirements-ml.txt to still work on intel macs

0b648f1

fixing broken test

a12f44b

leos force-pushed the iterate_deps branch from 54a1321 to a12f44b Compare December 18, 2022 19:04

JGSweets reviewed Dec 19, 2022

View reviewed changes

dataprofiler/profilers/profile_builder.py Show resolved Hide resolved

JGSweets reviewed Dec 19, 2022

View reviewed changes

requirements-ml.txt Show resolved Hide resolved

taylorfturner requested changes Dec 19, 2022

View reviewed changes

fixing docstring

5832f00

taylorfturner enabled auto-merge (squash) December 19, 2022 18:11

taylorfturner approved these changes Dec 19, 2022

View reviewed changes

JGSweets approved these changes Dec 19, 2022

View reviewed changes

taylorfturner merged commit 9600701 into capitalone:main Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tensorflow-macos and clean up some test running warning noise #741

Use tensorflow-macos and clean up some test running warning noise #741

leos commented Dec 18, 2022 •

edited

leos commented Dec 18, 2022

taylorfturner Dec 19, 2022

leos Dec 19, 2022

JGSweets Dec 19, 2022

JGSweets commented Dec 19, 2022

leos commented Dec 19, 2022

JGSweets commented Dec 19, 2022

Use tensorflow-macos and clean up some test running warning noise #741

Use tensorflow-macos and clean up some test running warning noise #741

Conversation

leos commented Dec 18, 2022 • edited

leos commented Dec 18, 2022

taylorfturner Dec 19, 2022

Choose a reason for hiding this comment

leos Dec 19, 2022

Choose a reason for hiding this comment

JGSweets Dec 19, 2022

Choose a reason for hiding this comment

JGSweets commented Dec 19, 2022

leos commented Dec 19, 2022

JGSweets commented Dec 19, 2022

leos commented Dec 18, 2022 •

edited