Running sample_data_loader.py raises a ValueError #803

HenryLiMN · 2020-11-10T02:21:10Z

Expected Behavior

I'm following the quick start guide: https://github.com/amundsen-io/amundsen/blob/master/docs/installation.md#bootstrap-a-default-version-of-amundsen-using-docker

Everything up to the sample data ingestion step works. The expected behavior is that the sample table metadata is ingested into the Neo4j graph.

Current Behavior

When I get to the step where it's supposed to ingest sample data using python3 example/scripts/sample_data_loader.py , it raises a ValueError and fails to ingest the sample data:

(venv) heli@Hes-MacBook-Pro amundsendatabuilder % python3 example/scripts/sample_data_loader.py
Traceback (most recent call last):
  File "example/scripts/sample_data_loader.py", line 285, in <module>
    'databuilder.models.table_stats.TableColumnStats')
  File "example/scripts/sample_data_loader.py", line 115, in run_csv_job
    publisher=Neo4jCsvPublisher()).launch()
  File "/Users/heli/Projects/amundsen/amundsendatabuilder/venv/lib/python3.7/site-packages/amundsen_databuilder-4.0.3-py3.7.egg/databuilder/job/job.py", line 77, in launch
  File "/Users/heli/Projects/amundsen/amundsendatabuilder/venv/lib/python3.7/site-packages/amundsen_databuilder-4.0.3-py3.7.egg/databuilder/job/job.py", line 67, in launch
  File "/Users/heli/Projects/amundsen/amundsendatabuilder/venv/lib/python3.7/site-packages/amundsen_databuilder-4.0.3-py3.7.egg/databuilder/task/task.py", line 65, in run
  File "/Users/heli/Projects/amundsen/amundsendatabuilder/venv/lib/python3.7/site-packages/amundsen_databuilder-4.0.3-py3.7.egg/databuilder/loader/file_system_neo4j_csv_loader.py", line 119, in load
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/csv.py", line 155, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/csv.py", line 151, in _dict_to_list
    + ", ".join([repr(x) for x in wrong_fields]))
ValueError: dict contains fields not in fieldnames: 'stat_val'

Possible Solution

Steps to Reproduce

Follow the steps in https://github.com/amundsen-io/amundsen/blob/master/docs/installation.md#bootstrap-a-default-version-of-amundsen-using-docker
Get to the python3 example/scripts/sample_data_loader.py step and run it

Screenshots (if appropriate)

Context

Your Environment

Followed the quick start guide's steps exactly

Amunsen version used:
Data warehouse stores:
Deployment (k8s or native):
Link to your fork or repository:

The text was updated successfully, but these errors were encountered:

nadamakram · 2020-11-10T14:36:09Z

I am having the same issue too

feng-tao · 2020-11-10T17:46:07Z

will take a look. thanks for reporting.

dorianj · 2020-11-10T19:13:45Z

i'm debugging this, will post stuff as I find it

i changed the log line at databuilder/loader/file_system_neo4j_csv_loader.py:160 to LOGGER.info('Creating file for {}: {}'.format(key, csv_record_dict.keys())) and it gives:

INFO:databuilder.loader.file_system_neo4j_csv_loader:Creating file for ('Stat', 6): dict_keys(['LABEL', 'KEY', 'stat_val:UNQUOTED', 'stat_name', 'start_epoch', 'end_epoch'])

then later, when the exception is raised, that is trying to insert stat_val (without the :UNQUOTED), thus the error. I'm not clear yet whether it should have that unquoted in the file or not

--

edit 1:
what seems to be happening is from this sample data:

cluster,db,schema,table_name,col_name,stat_name,stat_val,start_epoch,end_epoch
gold,hive,test_schema,test_table1,col1,"distinct values","8",1432300762,1562300762
gold,hive,test_schema,test_table1,col1,"min","""aardvark""",1432300762,1562300762

when the first stat_val is read/written, it's a number, and thus gets the :UNQUOTED suffix. when the second one is read, it is a string, and thus shouldn't be quoted, and the mismatch occurs.

previously this column was force set to :UNQUOTED at the source, but if you do that now, it results in a double append. i have an idea for a fix but i'm not familiar enough with the surroundings to be confident it's a good one, but i'll try it

--

edit 2: i tried out this fix by making the stat_val:UNQUOTED in the source, and adding de-dupe logic to databuilder/serializers/neo4_serializer.py:serialize_node. this successfully pushes through this error, but we get a new error:

Traceback (most recent call last):
  File "example/scripts/sample_data_loader.py", line 328, in <module>
    job_es_table.launch()
[...]
  File "/Users/dorian/dev/amundsen/amundsendatabuilder/databuilder/task/task.py", line 58, in run
    record = self.extractor.extract()
  File "/Users/dorian/dev/amundsen/amundsendatabuilder/databuilder/extractor/neo4j_search_data_extractor.py", line 156, in extract
    return self.neo4j_extractor.extract()
[...]
  File "/Users/dorian/dev/miniconda3/envs/adb/lib/python3.7/site-packages/neobolt/direct.py", line 755, in on_failure
    raise CypherError.hydrate(**metadata)
neobolt.exceptions.CypherTypeError: SUM(read.read_count) can only handle numerical values, or null.

i'll look closer, but I'm getting a bit stuck. I'm not sure how the data were loading before, given the mix of strings and numbers

dorianj · 2020-11-10T20:11:25Z

meant to mention -- this was caused (or possibly more accurately, revealed) by amundsen-io/amundsendatabuilder#380

feng-tao · 2020-11-11T18:24:00Z

@HenryLiMN fixed are merged in databuilder, could you pull the latest master and retry?

feng-tao · 2020-11-11T21:35:02Z

feel free to reopen if the issue persists.

HenryLiMN · 2020-11-12T06:00:08Z

It worked! Thanks for the quick fix 🚀

Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>

feng-tao added Project: Databuilder status:needs_reproducing For bugs that need to be reproduced in order to get fixed type:bug An unexpected problem or unintended behavior labels Nov 10, 2020

feng-tao added status:in_progress Issue that is being worked on right now and removed status:needs_reproducing For bugs that need to be reproduced in order to get fixed labels Nov 11, 2020

DataBrenes mentioned this issue Nov 11, 2020

Sqlite3 error when running sample_data_loader.py #807

Closed

feng-tao closed this as completed Nov 11, 2020

lxlguy mentioned this issue Nov 12, 2020

Unable to load sample data #808

Closed

dorianj pushed a commit to dorianj/amundsen that referenced this issue Apr 25, 2021

removed preferences page, not used qanywhere (amundsen-io#803)

9adf912

Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>

feng-tao pushed a commit that referenced this issue May 7, 2021

removed preferences page, not used qanywhere (#803)

ee975a2

Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>

zacr pushed a commit to SaltIO/amundsen that referenced this issue May 13, 2022

removed preferences page, not used qanywhere (amundsen-io#803)

e318073

Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>

hansadriaans pushed a commit to DataChefHQ/amundsen that referenced this issue Jun 30, 2022

removed preferences page, not used qanywhere (amundsen-io#803)

bb296e8

Signed-off-by: Marcos Iglesias <miglesiasvalle@lyft.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running sample_data_loader.py raises a ValueError #803

Running sample_data_loader.py raises a ValueError #803

HenryLiMN commented Nov 10, 2020

nadamakram commented Nov 10, 2020

feng-tao commented Nov 10, 2020

dorianj commented Nov 10, 2020 •

edited

dorianj commented Nov 10, 2020

feng-tao commented Nov 11, 2020

feng-tao commented Nov 11, 2020

HenryLiMN commented Nov 12, 2020

Running sample_data_loader.py raises a ValueError #803

Running sample_data_loader.py raises a ValueError #803

Comments

HenryLiMN commented Nov 10, 2020

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Screenshots (if appropriate)

Context

Your Environment

nadamakram commented Nov 10, 2020

feng-tao commented Nov 10, 2020

dorianj commented Nov 10, 2020 • edited

dorianj commented Nov 10, 2020

feng-tao commented Nov 11, 2020

feng-tao commented Nov 11, 2020

HenryLiMN commented Nov 12, 2020

dorianj commented Nov 10, 2020 •

edited