Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesize exceeds postgres integer column maximum size #1583

Open
1 task
AetherUnbound opened this issue Sep 19, 2022 · 5 comments
Open
1 task

Filesize exceeds postgres integer column maximum size #1583

AetherUnbound opened this issue Sep 19, 2022 · 5 comments
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 💾 tech: postgres Involves PostgreSQL 🐍 tech: python Involves Python
Projects

Comments

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Sep 19, 2022

Description

We recently had a Wikimedia provider script run failure due to the following:

[2022-09-17, 00:16:40 UTC] {dbapi.py:231} INFO - Running statement: 
SELECT aws_s3.table_import_from_s3(
  'provider_data_audio_wikimedia_commons_20220916T000000',
  '',
  'DELIMITER E''	''',
  'openverse-catalog',
  'audio/wikimedia_commons/year=2022/month=09/wikimedia_audio_audio_v001_20220917000001.tsv',
  'us-east-1'
);
, parameters: None
[2022-09-17, 00:16:40 UTC] {taskinstance.py:1909} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/openverse_catalog/dags/common/loader/loader.py", line 45, in load_from_s3
    loaded, missing_columns, foreign_id_dup = sql.load_s3_data_to_intermediate_table(
  File "/usr/local/airflow/openverse_catalog/dags/common/loader/sql.py", line 156, in load_s3_data_to_intermediate_table
    loaded = postgres.run(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/hooks/dbapi.py", line 211, in run
    self._run_command(cur, sql_statement, parameters)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/hooks/dbapi.py", line 235, in _run_command
    cur.execute(sql_statement)
psycopg2.errors.NumericValueOutOfRange: value "3381853454" is out of range for type integer
CONTEXT:  COPY provider_data_audio_wikimedia_commons_20220916t000000, line 96, column filesize: "3381853454"

This unfortunately appears to be a legitimately large file size, as the upstream audio source is over 8 hours in length: https://commons.wikimedia.org/w/index.php?curid=123060206

This raises an interesting question for us though - should we except files that are larger than the integer maximum of 2147483647 bytes (or ~2GB)? Is that something we care to have indexed? If so, we'll need to perform an alter on this column in the catalog which could take quite some time. Alternatively we could simply reject records with file sizes this large, perhaps as part of the MediaStore logic. @WordPress/openverse-catalog what do you think?

Update: Our intent now is to modify the column type to bigint to allow values which exceed the current maximum. A proposal for how this could be done can be found on the Make WP blog.

Reproduction

  1. Set the delay attribute of the WikimediaCommonsDataIngester class to 0.1
diff --git a/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py b/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
index d73c9f2..c28b128 100644
--- a/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
+++ b/openverse_catalog/dags/providers/provider_api_scripts/wikimedia_commons.py
@@ -47,6 +47,7 @@ class WikimediaCommonsDataIngester(ProviderDataIngester):
         "audio": prov.WIKIMEDIA_AUDIO_PROVIDER,
     }
     endpoint = f"https://{HOST}/w/api.php"
+    delay = 0.1
     headers = {"User-Agent": prov.UA_STRING}
 
     # The batch_limit applies to the number of pages received by the API, rather
  1. just recreate
  2. just shell
  3. airflow dags backfill -s 2022-09-16 -e 2022-09-16 -v wikimedia_commons_workflow

Additional context

Resolution

  • 🙋 I would be interested in resolving this bug.
@AetherUnbound AetherUnbound added 🐍 tech: python Involves Python 💻 aspect: code Concerns the software code in the repository 💾 tech: postgres Involves PostgreSQL 🔧 tech: airflow Involves Apache Airflow 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon labels Sep 19, 2022
@AetherUnbound
Copy link
Contributor Author

Speaking with @stacimc on this, we think the best path forward would be to reject records which exceed this current maximum. We'll create a Make WP post about this decision in order to communicate it and provide a chance for feedback. This also doesn't preclude us from taking a different solution for handling data like this down the road.

@AetherUnbound
Copy link
Contributor Author

@AetherUnbound
Copy link
Contributor Author

Per the discussion in the post above, we have decided to use a stopgap solution in the form of #1358 and follow that up with a series of database migrations down the line.

@stacimc
Copy link
Contributor

stacimc commented Jan 27, 2023

@AetherUnbound Do you mind updating the issue description with what this issue is now tracking? I'm not sure what the scope of this ticket is now.

Should this be part of the stability project now that #1358 is complete or would this make sense in the Data Normalization milestone?

@AetherUnbound
Copy link
Contributor Author

I've added a note to the issue description and moved this over to the other milestone!

@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@openverse-bot openverse-bot added this to Backlog in Openverse Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow 💾 tech: postgres Involves PostgreSQL 🐍 tech: python Involves Python
Projects
Status: 📋 Backlog
Openverse
  
Backlog
Development

No branches or pull requests

3 participants