Add data loader to GraphData class #528

MisterPNP · 2022-07-12T14:59:23Z

The previous GraphData only contained is_match, a function allowing to differentiate between tabular and graph CSV datasets.

This PR adds the functionality of reading/loading the graph data from the input CSV into a NetworkX graph. This graph contains nodes, edges, and possible attributes for both. The use of NetworkX is very useful for downstream profiling of the data due to NetworkX's integrated functions to analyze and pull information from graphs.

JGSweets

Changes requested from previous closed PR #527 . Will unblock once updated.

JGSweets · 2022-07-12T20:08:22Z

dataprofiler/data_readers/graph_data.py

+        :type options: dict
+        :return: None
+        """
+


I think we are missing:
options = self._check_and_return_options(options)
here

@JGSweets need a _check_and_return_options specific to Graph Data? _check_and_return_options in CSVData would get run when instantiating the class

@taylorfturner It would, but we aren't inheriting CSVData

Right ... so we'll just need to add a _check_and_return_options for GraphData options since _check_and_return_options would be specific to CSVData. Low priority for now as long as we check CSVData options as part of this PR

JGSweets · 2022-07-12T20:09:46Z

dataprofiler/data_readers/graph_data.py

+        self._quotechar = options.get("quotechar", None)
+        self._header = options.get("header", 'auto')
+
+        self._data = self._format_data_networkx()


Remove from here as this should occur if someone references the data auto. We only want to load the data if we need to load it :)

JGSweets · 2022-07-12T20:17:16Z

dataprofiler/data_readers/graph_data.py

        return target_index

-
    @classmethod
    def csv_column_names(cls, file_path, options):


right now this presumes the first row is the header. Some CSVs have comments at the top. We cause use the info from CSVData.is_match and options to determine this. (will also need a test, e.g. single graph data file which has comments before the column names start).
to make this more generic we could do:

@classmethod def csv_column_names(cls, file_path, delimiter, quotechar, header): column_names = [] if header is None: return column_names for _ in range(header + 1): row = next(csv_reader) . . .

csv.reader can be updated to handle quotechar too:

csv_reader = csv.reader(csv_file, delimiter=delimiter, quotechar=quotechar)

JGSweets · 2022-07-12T20:18:11Z

dataprofiler/data_readers/graph_data.py

@@ -90,7 +143,37 @@ def is_match(cls, file_path, options=None):
            options.update(destination_node = destination_index)
            options.update(destination_list = target_keywords)
            options.update(source_list = source_keywords)
-            options.update(column_name = column_names)
+            options.update(column_names = column_names)


JGSweets · 2022-07-12T20:19:00Z

dataprofiler/data_readers/graph_data.py

+        csv_as_list = []
+        data_as_pd = data_utils.read_csv_df(self.input_file_path,self._delimiter,self._header,[],read_in_string=True,encoding=self.file_encoding)
+        data_as_pd = data_as_pd.apply(lambda x: x.str.strip())
+        csv_as_list = data_as_pd.values.tolist()


do we need to convert to a list? this will increase the memory as we now need to store two objects.

Can we loop through the df?
or can we use:
https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html

JGSweets · 2022-07-12T20:21:26Z

dataprofiler/tests/data_readers/test_csv_graph_data.py

+            self.assertFalse(GraphData.is_match(input_file["path"]))  
+
+    # test loading data
+    def test_data_loader_nodes(self):


Do you think we could do a single test that loops through all our self. file_or_buf_list?

Maybe we can add the expected result of nodes, edges in the dict within the setUpClass?

…_csv_df from data_utils, optimize load, add class doc

… test all files

taylorfturner · 2022-07-12T20:16:51Z

dataprofiler/data_readers/graph_data.py

+        self._quotechar = options.get("quotechar", None)
+        self._header = options.get("header", 'auto')
+
+        self._data = self._format_data_networkx()


taylorfturner · 2022-07-12T20:17:17Z

dataprofiler/tests/data_readers/test_csv_graph_data.py


 if __name__ == '__main__':
-    unittest.main()
+    unittest.main()


new line EOF

taylorfturner · 2022-07-12T20:34:38Z

dataprofiler/data_readers/graph_data.py

+        :type options: dict
+        :return: None
+        """
+


Right ... so we'll just need to add a _check_and_return_options for GraphData options since _check_and_return_options would be specific to CSVData. Low priority for now as long as we check CSVData options as part of this PR

MisterPNP requested review from JGSweets, ksneab7, taylorfturner, micdavis and tyfarnan as code owners July 12, 2022 14:59

JGSweets suggested changes Jul 12, 2022

View reviewed changes

JGSweets enabled auto-merge (squash) July 12, 2022 15:53

JGSweets added Work In Progress Solution is being developed Medium Priority Significant improvement or bug / feature reducing overall performance New Feature A feature addition not currently in the library labels Jul 12, 2022

auto-merge was automatically disabled July 12, 2022 19:02
Head branch was pushed to by a user without write access

JGSweets reviewed Jul 12, 2022

View reviewed changes

MisterPNP added 5 commits July 12, 2022 17:08

create up to date branch w/ graph data loader

71a959b

local props updated in _init_, add error handling in _init_, use read…

0672be9

…_csv_df from data_utils, optimize load, add class doc

cleanup

2fe7b0b

add check options to GraphData

0f4eaed

rebase with Jake's formatting

d513da4

MisterPNP force-pushed the graph_loader branch from 6c9b90f to d513da4 Compare July 12, 2022 21:20

JGSweets enabled auto-merge (squash) July 12, 2022 21:20

JGSweets removed the Work In Progress Solution is being developed label Jul 13, 2022

add comment handling in header for csv_column_name, reformat tests to…

bb4b8c7

… test all files

auto-merge was automatically disabled July 13, 2022 15:45
Head branch was pushed to by a user without write access

change csv_column_names to handle no header

1cb21bf

taylorfturner approved these changes Jul 13, 2022

View reviewed changes

Merge branch 'main' into graph_loader

69e143f

taylorfturner enabled auto-merge (squash) July 13, 2022 17:38

taylorfturner approved these changes Jul 13, 2022

View reviewed changes

JGSweets approved these changes Jul 13, 2022

View reviewed changes

taylorfturner merged commit 4db38b5 into capitalone:main Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add data loader to GraphData class #528

Add data loader to GraphData class #528

MisterPNP commented Jul 12, 2022

JGSweets left a comment

JGSweets Jul 12, 2022

taylorfturner Jul 12, 2022

JGSweets Jul 12, 2022

taylorfturner Jul 12, 2022

JGSweets Jul 12, 2022

taylorfturner Jul 12, 2022

JGSweets Jul 12, 2022

JGSweets Jul 12, 2022

JGSweets Jul 12, 2022

JGSweets Jul 12, 2022

JGSweets Jul 12, 2022 •

edited

JGSweets Jul 12, 2022

taylorfturner Jul 12, 2022

taylorfturner Jul 12, 2022

taylorfturner Jul 12, 2022

Add data loader to GraphData class #528

Add data loader to GraphData class #528

Conversation

MisterPNP commented Jul 12, 2022

JGSweets left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets Jul 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets Jul 12, 2022 •

edited