fix: use parameters to allow special characters in neo4j cypher statement #382

rogertangcn · 2020-10-14T18:45:22Z

Summary of Changes

This change fixes the issue of Use query parameters rather than string building in Neo4jCsvPublisher

Replace csv package with pandas so that it can infer data type out of csv files
Replace string.Template with jinja2.Template so that it can support conditional tokens.
Use parameters instead of string concatenation to construct cypher statement

Tests

N/A

Documentation

N/A

CheckList

Make sure you have checked all steps below to ensure a timely review.

PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
PR includes a summary of changes.
PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
PR passes make test

rogertangcn · 2020-10-14T19:01:07Z

databuilder/publisher/neo4j_csv_publisher.py

@@ -232,7 +212,7 @@ def _create_indices(self, node_file: str) -> None:
        LOGGER.info('Creating indices. (Existing indices will be ignored)')

        with open(node_file, 'r', encoding='utf8') as node_csv:
-            for node_record in csv.DictReader(node_csv):
+            for node_record in pandas.read_csv(node_csv).to_dict(orient='records'):


Record field types are string when using csv.DictReader which is inadequate when switching to use parameter. It will create issue that property in numeric type in neo4j is suppled with string value.

pandas has ability to infer data type when reading csv. Parameters generated from pandas record for neo4j will have correct type that can prevent the issue.

rogertangcn · 2020-10-14T19:03:07Z

databuilder/publisher/neo4j_csv_publisher.py

+        template = Template("""
+            MERGE (node:{{ LABEL }} {key: $KEY})
+            ON CREATE SET {{ PROP_BODY }}
+            {% if update %} ON MATCH SET {{ PROP_BODY }} {% endif %}


Replace string.Template with jinja.Template in which if-condition is supported. Putting condition clause in template is to help readability

Also, {{ }} syntax in jinja is different from $VAR syntax in cypher (which is also the syntax for string.Template, source of confusion). The distinction is helpful for readability too.

thanks for the info

rogertangcn · 2020-10-14T23:06:45Z

databuilder/publisher/neo4j_csv_publisher.py


-            template_params[k] = v
+            props.append('{id}.{key} = {val}'.format(id=identifier, key=k, val=f'${k}'))


The property value is taken out wherein a param token (i.e. $param) is put in.

rogertangcn · 2020-10-14T23:08:47Z

example/sample_data/sample_table.csv

@@ -3,3 +3,4 @@ hive,gold,test_schema,test_table1,"1st test table","tag1,tag2,pii,high_quality",
 dynamo,gold,test_schema,test_table2,"2nd test table","high_quality,recommended",false,
 hive,gold,test_schema,test_view1,"1st test view","tag1",true,
 hive,gold,test_schema,test_table3,"3rd test","needs_documentation",false,
+hive,gold,test_schema,"test's_table4","4th test","needs_documentation",false,


This is a test data from here

cc @instazackwu

feng-tao · 2020-10-22T21:19:20Z

@allisonsuarez will do some test in Lyft staging.

feng-tao

thanks @allisonsuarez for quick test, lgtm as well with a few small nits

feng-tao · 2020-10-22T23:04:09Z

databuilder/publisher/neo4j_csv_publisher.py

-# Setting field_size_limit to solve the error below
-# _csv.Error: field larger than field limit (131072)
-# https://stackoverflow.com/a/54517228/5972935
-csv.field_size_limit(int(ctypes.c_ulong(-1).value // 2))


oh, we no longer set this line?

Right. csv module is no longer used in this module therefore taking it out.

I'm wondering if csv.field_size_limit() is a global setting. If that's the case, it could impact other modules that uses csv though. I will put it back to make sure it's backward compatible. It could be better if it's moved into somewhere that is "global" later.

sounds good, let's keep it here for now and we could remove it later. cc @jinhyukchang

feng-tao · 2020-10-22T23:04:54Z

databuilder/publisher/neo4j_csv_publisher.py

-MERGE (n1)-[r1:$TYPE]->(n2)-[r2:$REVERSE_TYPE]->(n1)
-$PROP_STMT RETURN n1.key, n2.key""")
-
-CREATE_UNIQUE_INDEX_TEMPLATE = Template('CREATE CONSTRAINT ON (node:${LABEL}) ASSERT node.key IS UNIQUE')


hey @rogertangcn , any reason to delete all the templates constant here?

oh, nvm, read your latter comment

feng-tao · 2020-10-22T23:07:31Z

databuilder/publisher/neo4j_csv_publisher.py

+        template = Template("""
+            MERGE (node:{{ LABEL }} {key: $KEY})
+            ON CREATE SET {{ PROP_BODY }}
+            {% if update %} ON MATCH SET {{ PROP_BODY }} {% endif %}


thanks for the info

feng-tao · 2020-10-22T23:09:04Z

requirements.txt

@@ -57,6 +57,8 @@ unicodecsv==0.14.1,<1.0

 httplib2>=0.18.0
 unidecode
+Jinja2==2.11.2


given the databuilder is a lib, we should make it a range

how about larger than 2.10.0 and less than 2.12 for now

feng-tao · 2020-10-22T23:09:09Z

requirements.txt

@@ -57,6 +57,8 @@ unicodecsv==0.14.1,<1.0

 httplib2>=0.18.0
 unidecode
+Jinja2==2.11.2
+pandas==1.1.3


how about larger than 0.21.0 and less than 1.2.0

feng-tao · 2020-10-22T23:09:40Z

example/sample_data/sample_table.csv

@@ -3,3 +3,4 @@ hive,gold,test_schema,test_table1,"1st test table","tag1,tag2,pii,high_quality",
 dynamo,gold,test_schema,test_table2,"2nd test table","high_quality,recommended",false,
 hive,gold,test_schema,test_view1,"1st test view","tag1",true,
 hive,gold,test_schema,test_table3,"3rd test","needs_documentation",false,
+hive,gold,test_schema,"test's_table4","4th test","needs_documentation",false,


cc @instazackwu

Signed-off-by: Roger Tang <roger.tang@workday.com>

feng-tao · 2020-10-23T05:45:47Z

thanks @rogertangcn for the update! @instazackwu the pr is going to merge, let us know if it resolves your issue!

…ement (#382) Signed-off-by: Roger Tang <roger.tang@workday.com> Co-authored-by: Roger Tang <roger.tang@workday.com> Signed-off-by: dikshathakur3119 <dikshathakur@lyft.com>

rogertangcn requested review from allisonsuarez, dikshathakur3119, feng-tao, jinhyukchang and a team as code owners October 14, 2020 18:45

feng-tao added the keep fresh Disables stalebot from closing an issue label Oct 14, 2020

rogertangcn commented Oct 14, 2020

View reviewed changes

rogertangcn changed the title ~~fix - use parameters to allow special characters in neo4j cypher statement~~ fix: use parameters to allow special characters in neo4j cypher statement Oct 14, 2020

rogertangcn commented Oct 14, 2020

View reviewed changes

rogertangcn force-pushed the special-char-as-params branch 4 times, most recently from 3cb86fe to d7bb9df Compare October 14, 2020 23:34

feng-tao approved these changes Oct 22, 2020

View reviewed changes

use parameters to allow special characters in neo4j cypher statement

3653b25

Signed-off-by: Roger Tang <roger.tang@workday.com>

rogertangcn force-pushed the special-char-as-params branch from d7bb9df to 3653b25 Compare October 23, 2020 05:32

feng-tao merged commit 6fd5035 into amundsen-io:master Oct 23, 2020

This was referenced Oct 23, 2020

Use query parameters rather than string building in Neo4jCsvPublisher amundsen-io/amundsen#513

Closed

Bug Report: Neo4j key values with an apostrophe ` cause syntax errors in publishing amundsen-io/amundsen#685

Closed

rogertangcn deleted the special-char-as-params branch October 23, 2020 16:59

hhobson mentioned this pull request Nov 3, 2020

databuilder 4.0.2 missing dependencies amundsen-io/amundsen#793

Closed

feng-tao mentioned this pull request Dec 2, 2020

chore: using csv.DictReader to read csv files to remove pandas #416

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use parameters to allow special characters in neo4j cypher statement #382

fix: use parameters to allow special characters in neo4j cypher statement #382

rogertangcn commented Oct 14, 2020 •

edited

rogertangcn Oct 14, 2020 •

edited

rogertangcn Oct 14, 2020 •

edited

feng-tao Oct 22, 2020

rogertangcn Oct 14, 2020

rogertangcn Oct 14, 2020

feng-tao Oct 22, 2020

feng-tao commented Oct 22, 2020

feng-tao left a comment

feng-tao Oct 22, 2020

rogertangcn Oct 23, 2020

feng-tao Oct 23, 2020

feng-tao Oct 22, 2020

feng-tao Oct 22, 2020

feng-tao Oct 22, 2020

feng-tao Oct 22, 2020

feng-tao Oct 22, 2020

feng-tao Oct 22, 2020

feng-tao Oct 22, 2020

feng-tao Oct 22, 2020

feng-tao commented Oct 23, 2020


		template_params[k] = v
		props.append('{id}.{key} = {val}'.format(id=identifier, key=k, val=f'${k}'))

fix: use parameters to allow special characters in neo4j cypher statement #382

fix: use parameters to allow special characters in neo4j cypher statement #382

Conversation

rogertangcn commented Oct 14, 2020 • edited

Summary of Changes

Tests

Documentation

CheckList

rogertangcn Oct 14, 2020 • edited

Choose a reason for hiding this comment

rogertangcn Oct 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feng-tao commented Oct 22, 2020

feng-tao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feng-tao commented Oct 23, 2020

rogertangcn commented Oct 14, 2020 •

edited

rogertangcn Oct 14, 2020 •

edited

rogertangcn Oct 14, 2020 •

edited