Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data loader to GraphData class #528

Merged
merged 8 commits into from
Jul 13, 2022

Conversation

MisterPNP
Copy link
Contributor

The previous GraphData only contained is_match, a function allowing to differentiate between tabular and graph CSV datasets.

This PR adds the functionality of reading/loading the graph data from the input CSV into a NetworkX graph. This graph contains nodes, edges, and possible attributes for both. The use of NetworkX is very useful for downstream profiling of the data due to NetworkX's integrated functions to analyze and pull information from graphs.

Copy link
Collaborator

@JGSweets JGSweets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes requested from previous closed PR #527 . Will unblock once updated.

@JGSweets JGSweets enabled auto-merge (squash) July 12, 2022 15:53
@JGSweets JGSweets added Work In Progress Solution is being developed Medium Priority Significant improvement or bug / feature reducing overall performance New Feature A feature addition not currently in the library labels Jul 12, 2022
auto-merge was automatically disabled July 12, 2022 19:02

Head branch was pushed to by a user without write access

:type options: dict
:return: None
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing:
options = self._check_and_return_options(options)
here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JGSweets need a _check_and_return_options specific to Graph Data? _check_and_return_options in CSVData would get run when instantiating the class

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taylorfturner It would, but we aren't inheriting CSVData

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ... so we'll just need to add a _check_and_return_options for GraphData options since _check_and_return_options would be specific to CSVData. Low priority for now as long as we check CSVData options as part of this PR

self._quotechar = options.get("quotechar", None)
self._header = options.get("header", 'auto')

self._data = self._format_data_networkx()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove from here as this should occur if someone references the data auto. We only want to load the data if we need to load it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return target_index


@classmethod
def csv_column_names(cls, file_path, options):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now this presumes the first row is the header. Some CSVs have comments at the top. We cause use the info from CSVData.is_match and options to determine this. (will also need a test, e.g. single graph data file which has comments before the column names start).
to make this more generic we could do:

@classmethod
def csv_column_names(cls, file_path, delimiter, quotechar, header): 
    column_names = []
    if header is None:
        return column_names

    for _ in range(header + 1):
        row = next(csv_reader)
    .
    .
    .

csv.reader can be updated to handle quotechar too:

csv_reader = csv.reader(csv_file, delimiter=delimiter, quotechar=quotechar)

@@ -90,7 +143,37 @@ def is_match(cls, file_path, options=None):
options.update(destination_node = destination_index)
options.update(destination_list = target_keywords)
options.update(source_list = source_keywords)
options.update(column_name = column_names)
options.update(column_names = column_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

csv_as_list = []
data_as_pd = data_utils.read_csv_df(self.input_file_path,self._delimiter,self._header,[],read_in_string=True,encoding=self.file_encoding)
data_as_pd = data_as_pd.apply(lambda x: x.str.strip())
csv_as_list = data_as_pd.values.tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to convert to a list? this will increase the memory as we now need to store two objects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertFalse(GraphData.is_match(input_file["path"]))

# test loading data
def test_data_loader_nodes(self):
Copy link
Collaborator

@JGSweets JGSweets Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we could do a single test that loops through all our self. file_or_buf_list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add the expected result of nodes, edges in the dict within the setUpClass?

@JGSweets JGSweets enabled auto-merge (squash) July 12, 2022 21:20
@JGSweets JGSweets removed the Work In Progress Solution is being developed label Jul 13, 2022
auto-merge was automatically disabled July 13, 2022 15:45

Head branch was pushed to by a user without write access

self._quotechar = options.get("quotechar", None)
self._header = options.get("header", 'auto')

self._data = self._format_data_networkx()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


if __name__ == '__main__':
unittest.main()
unittest.main()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line EOF

:type options: dict
:return: None
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ... so we'll just need to add a _check_and_return_options for GraphData options since _check_and_return_options would be specific to CSVData. Low priority for now as long as we check CSVData options as part of this PR

@taylorfturner taylorfturner enabled auto-merge (squash) July 13, 2022 17:38
@taylorfturner taylorfturner merged commit 4db38b5 into capitalone:main Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Medium Priority Significant improvement or bug / feature reducing overall performance New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants