Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccMixter: Space present in URL #3906

Closed
AetherUnbound opened this issue Mar 12, 2024 · 0 comments · Fixed by #3907
Closed

ccMixter: Space present in URL #3906

AetherUnbound opened this issue Mar 12, 2024 · 0 comments · Fixed by #3907
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow

Comments

@AetherUnbound
Copy link
Contributor

Airflow log link

Note: Airflow is currently only accessible to maintainers & those given
access. If you would like access to Airflow, please reach out to a member of
@WordPress/openverse-maintainers
.

https://airflow.openverse.engineering/log?execution_date=2024-03-12T21%3A01%3A43%2B00%3A00&task_id=ingest_data.pull_audio_data&dag_id=cc_mixter_workflow&map_index=-1

Description

[2024-03-12, 14:02:30 PDT] {taskinstance.py:2728} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: space is present in URL: https://ccmixter.org/content/7OOP3D/Vocals Mixdown.mp3
query_params: {"format": "json", "limit": 100, "offset": 31000}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 200, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 217, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/catalog/dags/providers/factory_utils.py", line 55, in pull_media_wrapper
    data = ingester.ingest_records()
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 276, in ingest_records
    raise error from ingestion_error
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 241, in ingest_records
    self.record_count += self.process_batch(batch)
  File "/opt/airflow/catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 473, in process_batch
    store.add_item(**record)
  File "/opt/airflow/catalog/dags/common/storage/audio.py", line 182, in add_item
    audio = self._get_audio(**audio_data)
  File "/opt/airflow/catalog/dags/common/storage/audio.py", line 189, in _get_audio
    audio_metadata = self.clean_media_metadata(**kwargs)
  File "/opt/airflow/catalog/dags/common/storage/media.py", line 132, in clean_media_metadata
    media_data[field] = urls.validate_url_string(
  File "/opt/airflow/catalog/dags/common/urls.py", line 47, in validate_url_string
    raise SpaceInUrlError(f"space is present in URL: {url_string}")
common.urls.SpaceInUrlError: space is present in URL: https://ccmixter.org/content/7OOP3D/Vocals Mixdown.mp3

Reproduction

Confirmed this occurs locally with initial_query_params set to {"format": "json", "limit": 100, "offset": 31000}

DAG status

@AetherUnbound AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🔧 tech: airflow Involves Apache Airflow 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Mar 12, 2024
@AetherUnbound AetherUnbound self-assigned this Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs 🔧 tech: airflow Involves Apache Airflow
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant