Filesize exceeds postgres integer column maximum size #1583
Labels
💻 aspect: code
Concerns the software code in the repository
🛠 goal: fix
Bug fix
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
🔧 tech: airflow
Involves Apache Airflow
💾 tech: postgres
Involves PostgreSQL
🐍 tech: python
Involves Python
Projects
Description
We recently had a Wikimedia provider script run failure due to the following:
This unfortunately appears to be a legitimately large file size, as the upstream audio source is over 8 hours in length: https://commons.wikimedia.org/w/index.php?curid=123060206
This raises an interesting question for us though - should we except files that are larger than the integer maximum of
2147483647
bytes (or ~2GB)? Is that something we care to have indexed? If so, we'll need to perform an alter on this column in the catalog which could take quite some time. Alternatively we could simply reject records with file sizes this large, perhaps as part of theMediaStore
logic. @WordPress/openverse-catalog what do you think?Update: Our intent now is to modify the column type to
bigint
to allow values which exceed the current maximum. A proposal for how this could be done can be found on the Make WP blog.Reproduction
delay
attribute of theWikimediaCommonsDataIngester
class to0.1
just recreate
just shell
airflow dags backfill -s 2022-09-16 -e 2022-09-16 -v wikimedia_commons_workflow
Additional context
Resolution
The text was updated successfully, but these errors were encountered: