fix: FsNeo4jCSVLoader fails if nodes have disjoint keys #408

Spacerat · 2020-11-10T03:49:03Z

Summary of Changes

I discovered a bug where FsNeo4jCSVLoader fails if two nodes of the same type have the same number of attributes, but with different names. This happened to us for DashboardChart, due to this code:

amundsendatabuilder/databuilder/models/dashboard/dashboard_chart.py

Lines 58 to 77 in d91c0c5

    
           def _create_node_iterator(self) -> Iterator[GraphNode]: 
        
               node_attributes = { 
        
                   'id': self._chart_id 
        
               } 
        
               if self._chart_name: 
        
                   node_attributes['name'] = self._chart_name 
        
               if self._chart_type: 
        
                   node_attributes['type'] = self._chart_type 
        
               if self._chart_url: 
        
                   node_attributes['url'] = self._chart_url 
        
               node = GraphNode( 
        
                   key=self._get_chart_node_key(), 
        
                   label=DashboardChart.DASHBOARD_CHART_LABEL, 
        
                   attributes=node_attributes 
        
               ) 
        
               yield node

The root cause of this is that FsNeo4jCSVLoader names CSV files by the number of keys the node has. My new test demonstrates the problem. When the first node is loaded, a CSV is created with a column for job. On attempting to load the second node, the loader fails because it cannot find a column in the CSV named pet.

This fixes the problem by making the file key dependent on the actual set of record keys. I actually tried two implementations:

The first just concatenates the keys, sorted
The second builds a dictionary of fieldset -> ID, and assigns increasing IDs.

I went with the second to avoid excessively long filenames.

Tests

I added a unit test which catches the bug. You can check out the "Add failing test" commit and run make test to observe it.

Documentation

N/A

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
PR passes make test

Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>

feng-tao · 2020-11-10T05:56:57Z

will take a look this pr as well.

feng-tao

lgtm, thanks @Spacerat fixing it!

Spacerat · 2020-11-19T20:38:31Z

Thanks @feng-tao !

Spacerat marked this pull request as ready for review November 10, 2020 04:04

Spacerat requested review from allisonsuarez, dikshathakur3119, feng-tao, jinhyukchang and a team as code owners November 10, 2020 04:04

Joseph Atkins-Turkish added 4 commits November 9, 2020 20:07

Prepare for test_fs_neo4j_csv_loader to run multiple tests

57ea2b8

Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>

Add failing test

11c4d35

Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>

Fix failing test

5d489c0

Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>

Implement using numeric keys

78e19c6

Signed-off-by: Joseph Atkins-Turkish <jatkins-turkish@brex.com>

Spacerat force-pushed the joe/fix-csvloader-bug branch from 1399f4a to 78e19c6 Compare November 10, 2020 04:07

feng-tao approved these changes Nov 17, 2020

View reviewed changes

feng-tao merged commit c07cec9 into amundsen-io:master Nov 17, 2020

Spacerat deleted the joe/fix-csvloader-bug branch November 19, 2020 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: FsNeo4jCSVLoader fails if nodes have disjoint keys #408

fix: FsNeo4jCSVLoader fails if nodes have disjoint keys #408

Spacerat commented Nov 10, 2020 •

edited

feng-tao commented Nov 10, 2020

feng-tao left a comment

Spacerat commented Nov 19, 2020

	def _create_node_iterator(self) -> Iterator[GraphNode]:
	node_attributes = {
	'id': self._chart_id
	}

	if self._chart_name:
	node_attributes['name'] = self._chart_name

	if self._chart_type:
	node_attributes['type'] = self._chart_type

	if self._chart_url:
	node_attributes['url'] = self._chart_url

	node = GraphNode(
	key=self._get_chart_node_key(),
	label=DashboardChart.DASHBOARD_CHART_LABEL,
	attributes=node_attributes
	)
	yield node

fix: FsNeo4jCSVLoader fails if nodes have disjoint keys #408

fix: FsNeo4jCSVLoader fails if nodes have disjoint keys #408

Conversation

Spacerat commented Nov 10, 2020 • edited

Summary of Changes

Tests

Documentation

CheckList

feng-tao commented Nov 10, 2020

feng-tao left a comment

Choose a reason for hiding this comment

Spacerat commented Nov 19, 2020

Spacerat commented Nov 10, 2020 •

edited