-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data loader to GraphData class #528
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes requested from previous closed PR #527 . Will unblock once updated.
Head branch was pushed to by a user without write access
:type options: dict | ||
:return: None | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing:
options = self._check_and_return_options(options)
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JGSweets need a _check_and_return_options
specific to Graph Data? _check_and_return_options
in CSVData would get run when instantiating the class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylorfturner It would, but we aren't inheriting CSVData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ... so we'll just need to add a _check_and_return_options
for GraphData
options since _check_and_return_options
would be specific to CSVData
. Low priority for now as long as we check CSVData
options as part of this PR
self._quotechar = options.get("quotechar", None) | ||
self._header = options.get("header", 'auto') | ||
|
||
self._data = self._format_data_networkx() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove from here as this should occur if someone references the data auto. We only want to load the data if we need to load it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
return target_index | ||
|
||
|
||
@classmethod | ||
def csv_column_names(cls, file_path, options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now this presumes the first row is the header. Some CSVs have comments at the top. We cause use the info from CSVData.is_match
and options to determine this. (will also need a test, e.g. single graph data file which has comments before the column names start).
to make this more generic we could do:
@classmethod
def csv_column_names(cls, file_path, delimiter, quotechar, header):
column_names = []
if header is None:
return column_names
for _ in range(header + 1):
row = next(csv_reader)
.
.
.
csv.reader can be updated to handle quotechar too:
csv_reader = csv.reader(csv_file, delimiter=delimiter, quotechar=quotechar)
@@ -90,7 +143,37 @@ def is_match(cls, file_path, options=None): | |||
options.update(destination_node = destination_index) | |||
options.update(destination_list = target_keywords) | |||
options.update(source_list = source_keywords) | |||
options.update(column_name = column_names) | |||
options.update(column_names = column_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
csv_as_list = [] | ||
data_as_pd = data_utils.read_csv_df(self.input_file_path,self._delimiter,self._header,[],read_in_string=True,encoding=self.file_encoding) | ||
data_as_pd = data_as_pd.apply(lambda x: x.str.strip()) | ||
csv_as_list = data_as_pd.values.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to convert to a list? this will increase the memory as we now need to store two objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we loop through the df?
or can we use:
https://networkx.org/documentation/stable/reference/generated/networkx.convert_matrix.from_pandas_edgelist.html
self.assertFalse(GraphData.is_match(input_file["path"])) | ||
|
||
# test loading data | ||
def test_data_loader_nodes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we could do a single test that loops through all our self. file_or_buf_list
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add the expected result of nodes, edges in the dict within the setUpClass
?
…_csv_df from data_utils, optimize load, add class doc
Head branch was pushed to by a user without write access
self._quotechar = options.get("quotechar", None) | ||
self._header = options.get("header", 'auto') | ||
|
||
self._data = self._format_data_networkx() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
if __name__ == '__main__': | ||
unittest.main() | ||
unittest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line EOF
:type options: dict | ||
:return: None | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ... so we'll just need to add a _check_and_return_options
for GraphData
options since _check_and_return_options
would be specific to CSVData
. Low priority for now as long as we check CSVData
options as part of this PR
The previous GraphData only contained is_match, a function allowing to differentiate between tabular and graph CSV datasets.
This PR adds the functionality of reading/loading the graph data from the input CSV into a NetworkX graph. This graph contains nodes, edges, and possible attributes for both. The use of NetworkX is very useful for downstream profiling of the data due to NetworkX's integrated functions to analyze and pull information from graphs.