Data file 'exists' check needs to be updated as duplicates still being saved to json_files #50

stuchalk · 2021-04-23T16:51:58Z

Currently, if the same data file is ingested a second time there are situations where the 'exists' check fails because the file being ingested is compared to only the most recent file (df_functions.py, def:updatedatafile, lines 81-89 of v0.2.1).

Therefore, the code needs to be updated to check the new data file against all versions that have been ingested. This should be done using the new 'jhash' field already added to the 'json_files' table, where an md5 hash of the 'file' field is stored. Although the 'jhash' field might well be unique across the table, using the 'file_lookup_id' and the 'jhash' to search the table would verify if the file had already been uploaded.

Note: in code the current 'generatedAt' field must be emptied (set to '') before the md5 hash generation.

The text was updated successfully, but these errors were encountered:

stuchalk added the bug Something isn't working label Apr 23, 2021

stuchalk added this to the Beta 2 milestone Apr 23, 2021

stuchalk self-assigned this Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data file 'exists' check needs to be updated as duplicates still being saved to json_files #50

Data file 'exists' check needs to be updated as duplicates still being saved to json_files #50

stuchalk commented Apr 23, 2021

Data file 'exists' check needs to be updated as duplicates still being saved to json_files #50

Data file 'exists' check needs to be updated as duplicates still being saved to json_files #50

Comments

stuchalk commented Apr 23, 2021