Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use batched update to clean up empty JSON objects in tags fields #4091

Closed
AetherUnbound opened this issue Apr 11, 2024 · 2 comments
Closed
Assignees
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@AetherUnbound
Copy link
Contributor

Description

While investigating the existing Clarifai tags for #3948 (comment), I had some trouble with the Postgres query used because I assumed the tags field would either be NULL or an array. It turns out that some of our older records which haven't been touched since 2020 (30,376,519 to be exact) have empty objects in them instead (e.g. {}). Example:

deploy@localhost:openledger> select * from image where jsonb_typeof(tags) = 'object' limit 1;
-[ RECORD 1 ]-------------------------
identifier              | 8eb4ea84-d61b-41cd-b5e7-4b37a084506d
created_on              | 2020-03-14 07:56:02.078247+00
updated_on              | 2020-12-08 05:34:25.570206+00
ingestion_type          | provider_api
provider                | flickr
source                  | flickr
foreign_identifier      | 27734505288
foreign_landing_url     | https://www.flickr.com/photos/81751903@N08/27734505288
url                     | https://live.staticflickr.com/882/27734505288_9575435144_b.jpg
thumbnail               | <null>
width                   | 1024
height                  | 682
filesize                | <null>
license                 | by
license_version         | 2.0
creator                 | russimages
creator_url             | https://www.flickr.com/photos/81751903@N08
title                   | wood duck drake surfacing
meta_data               | {"views": "131", "pub_date": "1524326359", "date_taken": "2018-04-20 16:33:28", "license_url": "https://creativecommons.org/licenses/by/2.0/"}
tags                    | {}
watermarked             | False
last_synced_with_source | 2020-12-08 05:34:25.570206+00
removed_from_source     | False
filetype                | <null>
category                | <null>
standardized_popularity | 0.752619128083812

This can complicate some of the logic necessary for updating tags down the line, and it may even be causing issues with updating those tags now.

I'd like to use the batched update DAG to set all of these values to NULL

Additional context

Related to #3948

@AetherUnbound AetherUnbound added 🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 11, 2024
@AetherUnbound AetherUnbound self-assigned this Apr 11, 2024
@AetherUnbound
Copy link
Contributor Author

I've kicked this off with the following run config:

{
    "batch_size": 50000,
    "dry_run": false,
    "query_id": "clean_empty_tags_objects",
    "resume_update": false,
    "select_query": "WHERE tags = '{}'",
    "table_name": "image",
    "update_query": "SET tags = NULL",
    "update_timeout": 3600
}

@AetherUnbound
Copy link
Contributor Author

This update has been completed 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

No branches or pull requests

1 participant