Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running sample_data_loader.py raises a ValueError #803

Closed
HenryLiMN opened this issue Nov 10, 2020 · 7 comments
Closed

Running sample_data_loader.py raises a ValueError #803

HenryLiMN opened this issue Nov 10, 2020 · 7 comments
Labels
status:in_progress Issue that is being worked on right now type:bug An unexpected problem or unintended behavior

Comments

@HenryLiMN
Copy link

Expected Behavior

I'm following the quick start guide: https://github.com/amundsen-io/amundsen/blob/master/docs/installation.md#bootstrap-a-default-version-of-amundsen-using-docker

Everything up to the sample data ingestion step works. The expected behavior is that the sample table metadata is ingested into the Neo4j graph.

Current Behavior

When I get to the step where it's supposed to ingest sample data using python3 example/scripts/sample_data_loader.py , it raises a ValueError and fails to ingest the sample data:

(venv) heli@Hes-MacBook-Pro amundsendatabuilder % python3 example/scripts/sample_data_loader.py
Traceback (most recent call last):
  File "example/scripts/sample_data_loader.py", line 285, in <module>
    'databuilder.models.table_stats.TableColumnStats')
  File "example/scripts/sample_data_loader.py", line 115, in run_csv_job
    publisher=Neo4jCsvPublisher()).launch()
  File "/Users/heli/Projects/amundsen/amundsendatabuilder/venv/lib/python3.7/site-packages/amundsen_databuilder-4.0.3-py3.7.egg/databuilder/job/job.py", line 77, in launch
  File "/Users/heli/Projects/amundsen/amundsendatabuilder/venv/lib/python3.7/site-packages/amundsen_databuilder-4.0.3-py3.7.egg/databuilder/job/job.py", line 67, in launch
  File "/Users/heli/Projects/amundsen/amundsendatabuilder/venv/lib/python3.7/site-packages/amundsen_databuilder-4.0.3-py3.7.egg/databuilder/task/task.py", line 65, in run
  File "/Users/heli/Projects/amundsen/amundsendatabuilder/venv/lib/python3.7/site-packages/amundsen_databuilder-4.0.3-py3.7.egg/databuilder/loader/file_system_neo4j_csv_loader.py", line 119, in load
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/csv.py", line 155, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/csv.py", line 151, in _dict_to_list
    + ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'stat_val'

Possible Solution

Steps to Reproduce

  1. Follow the steps in https://github.com/amundsen-io/amundsen/blob/master/docs/installation.md#bootstrap-a-default-version-of-amundsen-using-docker
  2. Get to the python3 example/scripts/sample_data_loader.py step and run it

Screenshots (if appropriate)

Context

Your Environment

Followed the quick start guide's steps exactly

  • Amunsen version used:
  • Data warehouse stores:
  • Deployment (k8s or native):
  • Link to your fork or repository:
@nadamakram
Copy link

I am having the same issue too

@feng-tao feng-tao added Project: Databuilder status:needs_reproducing For bugs that need to be reproduced in order to get fixed type:bug An unexpected problem or unintended behavior labels Nov 10, 2020
@feng-tao
Copy link
Member

will take a look. thanks for reporting.

@dorianj
Copy link
Contributor

dorianj commented Nov 10, 2020

i'm debugging this, will post stuff as I find it

i changed the log line at databuilder/loader/file_system_neo4j_csv_loader.py:160 to LOGGER.info('Creating file for {}: {}'.format(key, csv_record_dict.keys())) and it gives:

INFO:databuilder.loader.file_system_neo4j_csv_loader:Creating file for ('Stat', 6): dict_keys(['LABEL', 'KEY', 'stat_val:UNQUOTED', 'stat_name', 'start_epoch', 'end_epoch'])

then later, when the exception is raised, that is trying to insert stat_val (without the :UNQUOTED), thus the error. I'm not clear yet whether it should have that unquoted in the file or not

--

edit 1:
what seems to be happening is from this sample data:

cluster,db,schema,table_name,col_name,stat_name,stat_val,start_epoch,end_epoch
gold,hive,test_schema,test_table1,col1,"distinct values","8",1432300762,1562300762
gold,hive,test_schema,test_table1,col1,"min","""aardvark""",1432300762,1562300762

when the first stat_val is read/written, it's a number, and thus gets the :UNQUOTED suffix. when the second one is read, it is a string, and thus shouldn't be quoted, and the mismatch occurs.

previously this column was force set to :UNQUOTED at the source, but if you do that now, it results in a double append. i have an idea for a fix but i'm not familiar enough with the surroundings to be confident it's a good one, but i'll try it

--

edit 2: i tried out this fix by making the stat_val:UNQUOTED in the source, and adding de-dupe logic to databuilder/serializers/neo4_serializer.py:serialize_node. this successfully pushes through this error, but we get a new error:

Traceback (most recent call last):
  File "example/scripts/sample_data_loader.py", line 328, in <module>
    job_es_table.launch()
[...]
  File "/Users/dorian/dev/amundsen/amundsendatabuilder/databuilder/task/task.py", line 58, in run
    record = self.extractor.extract()
  File "/Users/dorian/dev/amundsen/amundsendatabuilder/databuilder/extractor/neo4j_search_data_extractor.py", line 156, in extract
    return self.neo4j_extractor.extract()
[...]
  File "/Users/dorian/dev/miniconda3/envs/adb/lib/python3.7/site-packages/neobolt/direct.py", line 755, in on_failure
    raise CypherError.hydrate(**metadata)
neobolt.exceptions.CypherTypeError: SUM(read.read_count) can only handle numerical values, or null.

i'll look closer, but I'm getting a bit stuck. I'm not sure how the data were loading before, given the mix of strings and numbers

@dorianj
Copy link
Contributor

dorianj commented Nov 10, 2020

meant to mention -- this was caused (or possibly more accurately, revealed) by amundsen-io/amundsendatabuilder#380

@feng-tao
Copy link
Member

@HenryLiMN fixed are merged in databuilder, could you pull the latest master and retry?

@feng-tao feng-tao added status:in_progress Issue that is being worked on right now and removed status:needs_reproducing For bugs that need to be reproduced in order to get fixed labels Nov 11, 2020
@feng-tao
Copy link
Member

feel free to reopen if the issue persists.

@HenryLiMN
Copy link
Author

It worked! Thanks for the quick fix 🚀

dorianj pushed a commit to dorianj/amundsen that referenced this issue Apr 25, 2021
Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>
feng-tao pushed a commit that referenced this issue May 7, 2021
Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>
zacr pushed a commit to SaltIO/amundsen that referenced this issue May 13, 2022
Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>
hansadriaans pushed a commit to DataChefHQ/amundsen that referenced this issue Jun 30, 2022
Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:in_progress Issue that is being worked on right now type:bug An unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants