Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results considered dead if SSL fails during dead link check, even though they might not actually be dead #4371

Open
sarayourfriend opened this issue May 22, 2024 · 1 comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 💬 talk: discussion Open for discussions and feedback

Comments

@sarayourfriend
Copy link
Contributor

Description

Some results are considered "dead" even though they are actually available with cleartext (and in this case, followed by a redirect).

    "event": "Failed to validate image! Reason: Cannot connect to host collection.mobiliernational.culture.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1006)')]",

collection.mobiliernational.culture.gouv.fr indeed has an expired certificate, but if you visit the page in cleartext, it redirects to a URL with a valid certificate, https://collection.mobilier-national.fr/recherche

There are other examples of this (and the ones I saw all looked to be French government agency related, but that could be just a coincidence from a cluster of results in particular queries).

@WordPress/openverse-catalog I'm not sure whether this needs a fix in the API or if it's something that we should address during data refresh? It would be nice if the API could follow these redirects, maybe it's safe to retry requests with HTTP when HTTPS fails on an SSL error? What do y'all think @WordPress/openverse-api and catalogue folks?

Reproduction

The logs unfortunately do not show a specific URL. I'm adding the URL to this particular log line in #4333 but for now I don't know the exact works that are failing with this. We can pull it from Elasticsearch by querying on the image urls for this pattern.

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 💬 talk: discussion Open for discussions and feedback 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs labels May 22, 2024
@AetherUnbound
Copy link
Contributor

Since we're moving away from adding cleaning steps as part of the data refresh, I think we'd like to try and avoid adding new steps there. I think having the API follow those redirects makes sense! We could then report the updated links in Redis, and (when the time comes) follow a procedure like #3585 for reincorporating that data back into the catalog. That feels like a good balance between checking all of the catalog URLs and updating links as they are come across by users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 💬 talk: discussion Open for discussions and feedback
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants