Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: FsNeo4jCSVLoader fails if nodes have disjoint keys #408

Merged
merged 4 commits into from Nov 17, 2020

Conversation

Spacerat
Copy link
Contributor

@Spacerat Spacerat commented Nov 10, 2020

Summary of Changes

I discovered a bug where FsNeo4jCSVLoader fails if two nodes of the same type have the same number of attributes, but with different names. This happened to us for DashboardChart, due to this code:

def _create_node_iterator(self) -> Iterator[GraphNode]:
node_attributes = {
'id': self._chart_id
}
if self._chart_name:
node_attributes['name'] = self._chart_name
if self._chart_type:
node_attributes['type'] = self._chart_type
if self._chart_url:
node_attributes['url'] = self._chart_url
node = GraphNode(
key=self._get_chart_node_key(),
label=DashboardChart.DASHBOARD_CHART_LABEL,
attributes=node_attributes
)
yield node

The root cause of this is that FsNeo4jCSVLoader names CSV files by the number of keys the node has. My new test demonstrates the problem. When the first node is loaded, a CSV is created with a column for job. On attempting to load the second node, the loader fails because it cannot find a column in the CSV named pet.

This fixes the problem by making the file key dependent on the actual set of record keys. I actually tried two implementations:

  1. The first just concatenates the keys, sorted
  2. The second builds a dictionary of fieldset -> ID, and assigns increasing IDs.

I went with the second to avoid excessively long filenames.

Tests

I added a unit test which catches the bug. You can check out the "Add failing test" commit and run make test to observe it.

Documentation

N/A

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • PR includes a summary of changes.
  • PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
  • PR passes make test

@Spacerat Spacerat marked this pull request as ready for review November 10, 2020 04:04
Joseph Atkins-Turkish added 4 commits November 9, 2020 20:07
Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>
Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>
Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>
Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>
@feng-tao
Copy link
Member

will take a look this pr as well.

Copy link
Member

@feng-tao feng-tao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @Spacerat fixing it!

@feng-tao feng-tao merged commit c07cec9 into amundsen-io:master Nov 17, 2020
@Spacerat
Copy link
Contributor Author

Thanks @feng-tao !

@Spacerat Spacerat deleted the joe/fix-csvloader-bug branch November 19, 2020 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants