Added static typing to *_data classes in data_readers #677

Sanketh7 · 2022-10-05T14:34:29Z

''

taylorfturner · 2022-10-05T15:02:16Z

@Sanketh7 make sure you do you new branches off of capitalone/main for DP so your branch isn't out of date

taylorfturner · 2022-10-07T16:30:52Z

looks like tests are failing because options is None

>       self.SAMPLES_PER_LINE_DEFAULT: int = options.get("record_samples_per_line", 1)
E       AttributeError: 'NoneType' object has no attribute 'get'

dataprofiler/data_readers/csv_data.py

dataprofiler/data_readers/graph_data.py

taylorfturner · 2022-10-11T14:47:30Z

@Sanketh7 tests failing 🤔

taylorfturner

couple material comments and some notes around code additions. Can't approve since I merged main

dataprofiler/data_readers/graph_data.py

taylorfturner · 2022-10-12T18:32:14Z

dataprofiler/data_readers/graph_data.py

                    continue
                if (
                    column is not self._source_node
                    or column is not self._destination_node
-                ):
+                ) and self._column_names is not None:


Note: code change here

taylorfturner · 2022-10-12T18:32:34Z

dataprofiler/data_readers/graph_data.py

            self._column_names = self.csv_column_names(
                self.input_file_path, self._header, self._delimiter, self.file_encoding
            )
-        if self._source_node is None:
+        if self._source_node is None and self._column_names is not None:


Note: code change here

taylorfturner · 2022-10-12T18:32:37Z

dataprofiler/data_readers/graph_data.py

            self._source_node = self._find_target_string_in_column(
                self._column_names, self._source_keywords
            )
-        if self._destination_node is None:
+        if self._destination_node is None and self._column_names is not None:


Note: code change here

dataprofiler/data_readers/text_data.py

JGSweets · 2022-10-13T14:33:05Z

dataprofiler/data_readers/csv_data.py

@@ -18,9 +20,14 @@
 class CSVData(SpreadSheetDataMixin, BaseData):
    """SpreadsheetData class to save and load spreadsheet data."""

-    data_type = "csv"
+    data_type: Optional[str] = "csv"


I think this might be an issue with the original code. The value shouldn't be optional, it is only `None for the base class b/c it needs to be set by the derived. Hence, setting it to optional seems like it isn't true.

Maybe in the base we could make it an abstract property?

import abc class Base(abc.ABC): @property @abc.abstractmethod def data_type(self) -> str: ...<FILL> @data_type.setter @abc.abstractmethod def data_type(self, value: str) -> None: ...<FILL> class CSVData(Base): data_type: str = "csv

Would something like this work?

+1 yeah I'd prefer too that the child classes aren't Optional[str] type for data_type attribute CC @Sanketh7

Better yet, not the above...

class Base(): name: str def print_name(self): print(self.name) # will raise an Attribute error at runtime if `name` isn't defined in subclass class Derived(Base): name = "derived one"

This last solution works well and I've committed that change.

Oops. Looks like it makes a test fail.

=================================== FAILURES =================================== _______________ TestBaseDataClass.test_can_apply_data_functions ________________ self = <dataprofiler.tests.data_readers.test_base_data.TestBaseDataClass testMethod=test_can_apply_data_functions> def test_can_apply_data_functions(self): class FakeDataClass: # matches the `data_type` value in BaseData for validating priority data_type = "FakeData" def func1(self): return "success" # initialize the data class data = BaseData(input_file_path="", data=FakeDataClass(), options={}) # if the function exists in BaseData fail the test because the results # may become inaccurate. self.assertFalse(hasattr(BaseData, "func1")) with self.assertRaisesRegex( AttributeError, "Neither 'BaseData' nor 'FakeDataClass' " "objects have attribute 'test'", ): data.test # validate it will take BaseData attribute over the data attribute > self.assertIsNone(data.data_type) E AssertionError: 'FakeData' is not None dataprofiler/tests/data_readers/test_base_data.py:96: AssertionError

I assume we could change the test to detect an attribute error instead of checking if the field is None?

Interesting, basically bc data_type doesn't exist in BaseData, this test is no longer validating priority.
Attribute error still wouldn't be correct bc FakeData as it.
We would need to test a property they both have. I think instead we can update the test to use input_file_path

class FakeDataClass: # matches the `data_type` value in BaseData for validating priority options = {"not_empty": "data"} def func1(self): return "success"

In the test we can assert the options is empty.

Made the change here: 2e85001

I didn't remove the data_type field from FakeDataClass because I wasn't sure if that would cause unintended side effects.

JGSweets · 2022-10-13T14:39:57Z

dataprofiler/data_readers/csv_data.py

@@ -58,8 +65,9 @@ def __init__(self, input_file_path=None, data=None, options=None):
        :return: None
        """
        options = self._check_and_return_options(options)
+        options = cast(Dict, options)


shouldn't the line above already indicate that it is a dict?

You're right. I probably forgot to delete this from a previous implementation.

JGSweets · 2022-10-13T14:40:41Z

dataprofiler/data_readers/csv_data.py

        BaseData.__init__(self, input_file_path, data, options)
-        SpreadSheetDataMixin.__init__(self, input_file_path, data, options)
+        SpreadSheetDataMixin.__init__(self, input_file_path, data, cast(Dict, options))


Do we need to cast here and above, if at all?

Same as above.

JGSweets · 2022-10-13T14:42:47Z

dataprofiler/data_readers/csv_data.py

        """
        Ensure options are valid inputs to the data reader.

        :param options: dictionary of options for the csv reader to validate
        :type options: dict
        :return: None
        """
-        options = super()._check_and_return_options(options)
+        options = super(CSVData, CSVData)._check_and_return_options(options)


What's going on here? this seems abnormal.

I had to change _check_and_return_options to be a static method because BaseData implements it as a static method. However, super() with no arguments only really works with instance and class methods. I ended up following https://stackoverflow.com/questions/26788214/super-and-staticmethod-interaction#:~:text=When%20you%20call%20a%20class,or%20class%20it%20was%20called. to figure out how to make it work with static methods.

@JGSweets thoughts on this? Any additional thoughts around this implementation?

Found this example which helps provide an example for how @Sanketh7 implement this: https://stackoverflow.com/questions/26788214/super-and-staticmethod-interaction#:~:text=class%20Second(First)%3A%0A%20%20%40staticmethod%0A%20%20def%20getlist()%3A%0A%20%20%20%20l%20%3D%20super(Second%2C%20Second).getlist()%20%20%23%20note%20the%202nd%20argument%0A%20%20%20%20l.append(%27second%27)

I'm fine with it. We can refactor in the future if necessary.

JGSweets · 2022-10-13T14:49:09Z

dataprofiler/data_readers/csv_data.py

@@ -186,7 +198,7 @@ def _guess_delimiter_and_quotechar(
        vocab = Counter(data_as_str)
        if "\n" in vocab:
            vocab.pop("\n")
-        for char in omitted + [quotechar]:
+        for char in omitted + ([quotechar] if quotechar is not None else []):


I think this line gets a little confusing if it's all in one. If this if is necessary, can we do it before the for?

omitted_list: list[str] = ommitted if quotechar is not None: omitted_list: list[str] = omitted + [quotechar]

then we can use that in the for

@Sanketh7 in case you missed

JGSweets · 2022-10-13T14:50:52Z

dataprofiler/data_readers/csv_data.py

@@ -534,13 +546,13 @@ def _load_data_from_str(self, data_as_str):
            )
        return data_utils.read_csv_df(
            data_buffered,
-            self.delimiter,
-            self.header,
+            cast(str, self.delimiter),


do these need to be casted?

Looks like with one of my recent commits to data_utils, self.delimiter doesn't need to be casted. However, self.header does because read_csv_df only supports Optional[int] but self.header could also be a string the "auto" case. As far as I can tell, this case doesn't exist at this point at runtime but it doesn't seem like mypy can detect that.

taylorfturner · 2022-10-18T11:44:59Z

dataprofiler/data_readers/csv_data.py

        """
        Ensure options are valid inputs to the data reader.

        :param options: dictionary of options for the csv reader to validate
        :type options: dict
        :return: None
        """
-        options = super()._check_and_return_options(options)
+        options = super(CSVData, CSVData)._check_and_return_options(options)


@JGSweets thoughts on this? Any additional thoughts around this implementation?

Found this example which helps provide an example for how @Sanketh7 implement this: https://stackoverflow.com/questions/26788214/super-and-staticmethod-interaction#:~:text=class%20Second(First)%3A%0A%20%20%40staticmethod%0A%20%20def%20getlist()%3A%0A%20%20%20%20l%20%3D%20super(Second%2C%20Second).getlist()%20%20%23%20note%20the%202nd%20argument%0A%20%20%20%20l.append(%27second%27)

taylorfturner · 2022-10-18T11:47:46Z

dataprofiler/data_readers/csv_data.py

+    def reload(
+        self,
+        input_file_path: Optional[str] = None,
+        data: Optional[pd.DataFrame] = None,


:type data: multiple types is the docstring for data... but static typing is only saying None or pd.DataFrame. I think at least one or the other should be update to make sure the code and docstring match

taylorfturner · 2022-10-18T11:48:36Z

dataprofiler/data_readers/csv_data.py

@@ -737,4 +759,4 @@ def reload(self, input_file_path=None, data=None, options=None):
            header=self.header, delimiter=self.delimiter, quotechar=self.quotechar
        )
        super(CSVData, self).reload(input_file_path, data, options)
-        self.__init__(self.input_file_path, data, options)
+        self.__init__(self.input_file_path, data, options)  # type: ignore


Can we get rid of this #type: ignore by resolving an issue upstream?

The issue is that mypy doesn't like using self.__init__ because it doesn't know which constructor will end up being called (and therefore doesn't know what types are needed). However, the current code assumes that behavior so I just told mypy to ignore that line.

taylorfturner · 2022-10-18T11:50:07Z

dataprofiler/data_readers/graph_data.py

+    def __init__(
+        self,
+        input_file_path: Optional[str] = None,
+        data: Optional[nx.Graph] = None,


same comment from above in csv_data --> :type data: multiple types is type on data but we only allow nx.graph or None in this __init__. Should update so they match

taylorfturner · 2022-10-18T11:54:08Z

dataprofiler/data_readers/parquet_data.py

+            file_path = cast(
+                Union[StringIO, BytesIO], file_path
+            )  # guaranteed by is_stream_buffer


let's see if we can get rid of these cast statements

taylorfturner · 2022-10-18T11:54:13Z

dataprofiler/data_readers/parquet_data.py

+            file_path = cast(
+                Union[StringIO, BytesIO], file_path
+            )  # guaranteed by is_stream_buffer


let's see if we can get rid of these cast statements

taylorfturner · 2022-10-18T11:54:27Z

dataprofiler/data_readers/parquet_data.py

@@ -148,4 +178,4 @@ def reload(self, input_file_path=None, data=None, options=None):
        :return: None
        """
        super(ParquetData, self).reload(input_file_path, data, options)
-        self.__init__(self.input_file_path, data, options)
+        self.__init__(self.input_file_path, data, options)  # type: ignore


any change we can get rid of this #type: ignore

dataprofiler/data_readers/text_data.py

taylorfturner · 2022-10-18T11:55:36Z

dataprofiler/tests/data_readers/test_base_data.py

@@ -93,7 +94,7 @@ def func1(self):
            data.test

        # validate it will take BaseData attribute over the data attribute
-        self.assertIsNone(data.data_type)


why is this getting removed?

In order to make data_type have type str instead of Optional[str], I had to make it so it doesn't get set to anything in BaseData (and just has a type definition line for mypy). This makes it so you get an exception if you try getting the data_type in BaseData so @JGSweets recommended looking at another field to test the same behavior. In this case, I looked to make sure options is taken from BaseData instead of FakeData which would mean that options is an empty dict.

it's replaced below with a different assert

Sanketh7 added 4 commits October 5, 2022 10:22

added static typing for csv_data

bb57a02

added static typing for graph_data

c79b025

added static typing to parquet_data

3ae6f9e

added static typing for text_data

2fc5c9d

''

Sanketh7 requested review from JGSweets, ksneab7, taylorfturner, micdavis and tyfarnan as code owners October 5, 2022 14:34

taylorfturner added the static_typing mypy static typing issues label Oct 6, 2022

taylorfturner assigned Sanketh7 Oct 6, 2022

Merge branch 'main' into data_readers_static_typing3

390a884

JGSweets reviewed Oct 7, 2022

View reviewed changes

dataprofiler/data_readers/csv_data.py Outdated Show resolved Hide resolved

JGSweets reviewed Oct 7, 2022

View reviewed changes

dataprofiler/data_readers/csv_data.py Outdated Show resolved Hide resolved

JGSweets reviewed Oct 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Show resolved Hide resolved

JGSweets reviewed Oct 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved

JGSweets reviewed Oct 7, 2022

View reviewed changes

dataprofiler/data_readers/graph_data.py Outdated Show resolved Hide resolved

Sanketh7 added 4 commits October 7, 2022 15:38

removed if statement

0fe32bf

changed repeated conditionals to single assert

c3383a4

fixed formatting

fb7807c

Merge branch 'main' into data_readers_static_typing3

8e53e5b

Merge branch 'main' into data_readers_static_typing3

d242b67

taylorfturner reviewed Oct 12, 2022

View reviewed changes

ksneab7 previously approved these changes Oct 12, 2022

View reviewed changes

JGSweets reviewed Oct 13, 2022

View reviewed changes

taylorfturner added the High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable label Oct 13, 2022

changed data_type from Optional[str] to str

e2f18c0

Sanketh7 dismissed ksneab7’s stale review via e2f18c0 October 13, 2022 21:28

Sanketh7 and others added 6 commits October 13, 2022 17:37

removed extra casts

5817232

removed cast to self.delimiter

8e815e8

cleaned up omitted list

48269df

changed base_data test to work with new static typing

2e85001

Merge branch 'main' into data_readers_static_typing3

744a137

Merge branch 'main' into data_readers_static_typing3

ef67679

taylorfturner reviewed Oct 18, 2022

View reviewed changes

Sanketh7 added 3 commits October 18, 2022 09:28

removed IO casts in parquet_data

6b67f3e

removed options cast in text_data

aef07a9

fixed pre-commit failure

ddc391e

taylorfturner enabled auto-merge (squash) October 18, 2022 13:52

JGSweets approved these changes Oct 18, 2022

View reviewed changes

micdavis approved these changes Oct 18, 2022

View reviewed changes

taylorfturner merged commit 44a3256 into capitalone:main Oct 18, 2022

Added static typing to *_data classes in data_readers #677

Added static typing to *_data classes in data_readers #677

Conversation

Sanketh7 commented Oct 5, 2022

taylorfturner commented Oct 5, 2022

taylorfturner commented Oct 7, 2022

taylorfturner commented Oct 11, 2022

taylorfturner left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets Oct 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets Oct 13, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taylorfturner left a comment •

edited

JGSweets Oct 13, 2022 •

edited

JGSweets Oct 13, 2022 •

edited