Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode and deduplicate tags in the catalog with the batched_update DAG #4452

Open
krysal opened this issue Jun 6, 2024 · 11 comments
Open

Decode and deduplicate tags in the catalog with the batched_update DAG #4452

krysal opened this issue Jun 6, 2024 · 11 comments
Assignees
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@krysal
Copy link
Member

krysal commented Jun 6, 2024

Problem

In #4143, @obulat proposed to add new cleaning steps to the fix tags in the Catalog, but the option of including them in the Ingestion Server was declined in favor of using the batched_update DAG.

It is important to fix the encoding quickly because it can cause a gray Nuxt error screen for pages that contain tags with character sequences that cannot be URI-encoded (such as udadd from "ciudaddelassiencias").

Description

We want to take the functions that were planned to include in said PR and translate them into parameters for this DAG. Given the complexity of the decoding transformation it might require some advanced functions of PostgreSQL, like a combination of pattern matching and PL/Python Functions.

In #1566, duplicated tags were previously removed, so we will apply the same solution given the decoding may cause new duplicates.

Additional context

Related to #4125 and #4199 (similar issue).

@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🛠 goal: fix Bug fix 🧱 stack: catalog Related to the catalog and Airflow DAGs 🗄️ aspect: data Concerns the data in our catalog and/or databases labels Jun 6, 2024
@krysal krysal added this to the Data normalization milestone Jun 6, 2024
@sarayourfriend sarayourfriend self-assigned this Jun 11, 2024
@sarayourfriend
Copy link
Contributor

I tried to run the DAG again today, and it turns out RDS does not support the PL/Pythonu extension 😞

RDS does support plperl, plrust, plpgsql, and plv8. Of those, plv8 (v8 being the same JS engine as Chrome), might be the most proximate for this use case, but I'll see. It might be that the reingestion of these records is the easiest, most reliable way to do it.

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Jul 16, 2024

@stacimc can you chime in on whether you think it's okay to go ahead and start thinking about how we would reingest these specific records identified as have potential unicode escape literals?

I think that's the safest approach anyway, because even an escaped literal unicode sequence could be valid metadata from the source. So even escaped literal sequences might be wrong to assume were caused by some previous bad transformation on our end. Most of them certainly were, we know there's some historical occurrence of that, but if even one isn't, then we've just mangled that data with no easy way to get it back (other than reingesting it anyway). And the only way to be sure will be to reingest those records.

Can you give any hints or tips of things you think need to be looked out for when thinking about this? The general idea I shared in the initial PR is here, for reference: #4475 (comment)

Basically: I don't think a DAG like the one I merged in that PR is actually safe. This isn't a safe or reliable thing to infer based on our data alone. Therefore, to fix this, we need to reingest the records.

A temporary fix for instances like https://openverse.org/image/3e3bc924-a7c0-4a6b-bcea-4a17160355d2 is to only apply the frontend's automated fix for works with indexation dates before a certain date (lets say 2021-01-01, based on Olga's query in this comment).

@stacimc
Copy link
Contributor

stacimc commented Jul 16, 2024

@sarayourfriend if the only way to safely determine what the tags should be is to reingest, then I think that's fine.

The Flickr reingestion DAG is absolutely not reliable for fixing this for a variety of reasons, but I do think we could implement something like you've described using a lot of existing code from the ProviderDataIngester and the loader tasks.

Actually I would look into extending the existing FlickDataIngester. You can add create_preingestion_tasks and create_postingestion_tasks hooks (see iNaturalist for an example) to the FlickrDataIngester for creating a temp table where you select the foreign_identifiers of the rows you want to update, in the style of the batched_update DAG, and eventually drop that table. Then modify the new ingester class to select batches of ids from this table and query the individual record endpoint from Flickr. You might need to tinker with the base class a little but I think this should all be doable pretty easily. The advantage is that records all get committed through the MediaStore, we'd get TSVs in S3 just like a regular provider DAG, etc. You'd just register a new DAG in provider_workflows.py with the new ingester class.

The only thing that would be an issue is that the upsert step of our provider DAGs currently merges tags rather than overwriting them 🤔 I brought this up on the linked PR too... I'm interested in changing the upsert strategy for tags to overwrite tags from the provider. Obviously we'd have to be careful to ensure we're still merging tags with a different provider field, so that we don't overwrite machine-generated tags for example, but this honestly feels like a bug as implemented. If we could make that upsert a little more sophisticated then I think this would work pretty simply. @AetherUnbound @WordPress/openverse-catalog any thoughts/objections to that in particular?

@AetherUnbound
Copy link
Contributor

That's a good point about potentially overwriting the tags rather than merging them! I wonder though, if it might actually be better to store all of the tags but (similar to provider and accuracy) add another field (perhaps live or current? or even deleted?) which shows that it's no longer present on the upstream source. That would feel more appropriate to me in-line with our desire to keep as much data in the catalog as possible in the long-term. What do you think?

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Jul 16, 2024
@sarayourfriend
Copy link
Contributor

Madison, in your version, would we basically update the existing tags where provider == current_provider with "live": False or something like that? And then merge in the new tags?

I don't mind that personally, we should add dates to new tags if we're doing that, I'd think, for provenance's sake. It's certainly the least destructive approach, which is great!

I suppose the main difficulty there is that if we set current or live to False/0/whatever, any query to the tags would now need to be aware of those fields. It's a broader change.

How about instead of keeping those "former" tags in tags, we put them in meta_data under something like a tag_history field, with date-string keys mapped to the list of tags from that date? Thinking from a general perspective of tracking the changes to individual works, that establishes a generic pattern we could follow. I haven't checked to see whether the top-level keys of meta_data could clash with provider data/how we organise the data in that column. If it's possible to have a general history field in meta_data that maps field name to an object of date -> value-at-that-date where date is the date of the value being replaced with the "next" value, it would remove the need to add the overhead (operational more than performance) of checking for a current/live/version field on individual tags and has the additional benefit of being able to track changes to any field on a given work.

We could also have a separate table for this, like image_history, etc, that is just (identifier UUID NOT NULL, data jsonb NOT NULL, date DATE NOT NULL, summary TEXT, ???). At least that way there's no chance of meta_data keys clashing, if that's a potential issue, and it doesn't add any logistical/operational overhead to querying the main tables. The same schema could be replicated in S3 if we didn't need it in the database.

I'm on board to take it slow on these considerations for this issue. I've downgraded the priority to medium from high, I think our current stop-gap is OK, and this kind of data-quality and revision work is worth making sure we get right and don't make destructive mistakes.

@stacimc
Copy link
Contributor

stacimc commented Jul 16, 2024

I wonder though, if it might actually be better to store all of the tags but (similar to provider and accuracy) add another field (perhaps live or current? or even deleted?)

I'm not opposed, although that raises some questions for me (for example, should deleted tags be ignored in search? I'd argue that might be a good idea, since it's very likely that if a tag was deleted from the record at source, it wasn't a "good quality" tag. Similarly, should they be displayed on the frontend?).

That would feel more appropriate to me in-line with our desire to keep as much data in the catalog as possible in the long-term.

I also feel like this is a little different from, for example, the machine-generated tags below a certain accuracy threshold. In that case those are all perfectly valid tags, and the issue is that we're making decisions about which ones we want to use/display. This case could be thought of as correcting data that has been invalidated upstream (in the way that when the title or description of a record changes at the source, we simply update that in the catalog). But... that is certainly a nuanced distinction! I certainly have no objection to keeping the data (beyond noting that it'll make the upsert a little more complicated, which is totally fine). Those questions about deleted tags in display/search can also be answered later. There's no regression from current behavior :) And the great thing about keeping the data is you can always change your mind later, which is obviously not the case if we delete it.

As far as a tag_history table or something more complicated, I think we should take a step back and do a full project proposal/etc. For example, immediately I'm questioning if we're going to do that, whether we should also be keeping history for other fields that can change upstream. And if not, why?

@sarayourfriend
Copy link
Contributor

As far as a tag_history table or something more complicated, I think we should take a step back and do a full project proposal/etc. For example, immediately I'm questioning if we're going to do that, whether we should also be keeping history for other fields that can change upstream. And if not, why?

Agreed that this is scope enough for a project. My assumption is that yes we should keep that data, if we can, but I'm also wary of the fact that I actually don't have a good reason why we should, and that without a good reason, we risk spending a considerable amount of money storing data we don't currently have a known use for. You're absolutely right to point out that it's very different in that respect from the machine generated tags where the argument to keep all of them is easier to make.

We also don't need to start a full historical record for all works right away. We could, for example, develop an idea for a generic approach but only use it for this targeted reingestion, and see how it goes, and then implement it more broadly if we decide it suits our purposes (and probably more freely iterate with a smaller amount of pre-existing data than if we started doing this for everything right away).

I suppose a big question is about non-dated provider DAGs? Would we diff every field for every work to see if it needs a new history entry?

@stacimc
Copy link
Contributor

stacimc commented Jul 16, 2024

I suppose a big question is about non-dated provider DAGs? Would we diff every field for every work to see if it needs a new history entry?

I think we could use triggers here to insert into some history table when there are updates to the media tables, although I'd have to play around with it -- I know you can create a trigger that runs BEFORE update, not sure off the top of my head what the best way to grab just the changed fields would be. You could have a separate TRIGGER before update for each field... Just not sure about performance implications 🤔

It would be very interesting to see if we can come up with any use case for storing this information. Honestly, I have some reservation about storing historical user-generated data from third party sites -- for example, I wonder if a Flickr user who changed the title/tags/etc on an image would be surprised that Openverse is retaining the old data, and if that could be cause for concern. If we scope a project for this I'd like to see that addressed in the proposal.

Circling back to the decode/deduplicate tags issue, do you think it should block on this? We're potentially talking about a pretty big and as yet unprioritized/unplanned project. One thing we could do, using the approach I outlined in this comment, is to add another preingestion task just to the new DAG that drops at least the tags with potential invalid characters. You could argue that tags that have gotten into an invalid state, can be dropped and replaced with current data from the provider. Then the regular load steps will merge in the current tags just fine.

Alternatively the preingestion task could add a stale field to all existing tags, just for this DAG for now. If we go that route instead of dropping we'd still need to consider the questions I asked earlier about whether to search on/display those.

@zackkrida
Copy link
Member

Just wanted to mention that I've implemented this postgres trigger history logging pattern in a project before. It was for a Gutenberg-esque CMS which stored page revision history. Postgres triggers are given access to some useful variables:

  • TG_OP - The operation on the row, create, update, delete
  • OLD - The row before the operation is applied
  • `NEW - The updated row

My project had a generic history table shared between different "post types" so I had columns for the "old" record and the "new" record that could be used to produce a diff. Something like this:

CREATE OR REPLACE FUNCTION log_history() RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'UPDATE' THEN
        INSERT INTO history (id, old, new, operation_type, change_timestamp)
        VALUES (OLD.id, row_to_json(OLD), row_to_json(NEW), 'UPDATE', current_timestamp);
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO history (id, old, new, operation_type, change_timestamp)
        VALUES (OLD.id, row_to_json(OLD), NULL, 'DELETE', current_timestamp);
    ELSIF TG_OP = 'INSERT' THEN
        INSERT INTO history (id, old, new, operation_type, change_timestamp)
        VALUES (NEW.id, NULL, row_to_json(NEW), 'INSERT', current_timestamp);
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

It worked well! The only thing we'll need to test is performance with large batch updates.


Agreed all around that this would be a standalone "data preservation/archiving" project, though.

@sarayourfriend
Copy link
Contributor

It would be very interesting to see if we can come up with any use case for storing this information. Honestly, I have some reservation about storing historical user-generated data from third party sites -- for example, I wonder if a Flickr user who changed the title/tags/etc on an image would be surprised that Openverse is retaining the old data, and if that could be cause for concern. If we scope a project for this I'd like to see that addressed in the proposal.

This occurred to me as well. There's a certain extent to which we should describe works the way the upstream provider describes them. Retaining changed information for search sounds bad, especially if extrapolated to a GLAM institution where descriptions are much more intentional and changes might be reflective of changing institutional policies that we would not want to misrepresent.

Circling back to the decode/deduplicate tags issue, do you think it should block on this? [...] You could argue that tags that have gotten into an invalid state, can be dropped and replaced with current data from the provider.

The issue with this, to me, is that there is no reliable way to detect tags in an invalid state. At the very least, there's no way to determine whether something is in that state because of a mistake we made with the data in the past, or whether it's something intentional from the upstream provider.

Even when scoped to just ones we "suspect" to have the issue, I don't think there's a way to reliably compare across the new set of tags from the provider and our own to determine which are ones we should drop vs keep. Dropping all the old tags (from search) and just keeping the new ones is the only 100% reliable way to make sure we match the upstream.

Your point about it being a big blocker though, is important. We don't need to block on this so long as we store the old tags in such a way that we could backfill any future broader history-tracking solution we decide on. For example, storing the old tags in S3 with minimal metadata stored in the key (identifier + date?) is a fine temporary solution that would allow us to unblock this specific issue, whilst preserving the ability to backfill whatever history-tracking method we try. If anything, choosing the most generic/abstract version of this (e.g., one that is explicitly unoptimised for anything other than storage) is probably the best first choice given we don't have a precise use case for this data now, and unbinds us from needing to turn this into a big project. So long as whatever we choose is minimally viable (date, identifier, explanatory note/event marker, plus the json blob, I think), we can defer an actual design for this kind of thing to later on.

Would that be alright? If we did that, we'd implement the changes to the DAG you suggested, Staci, in addition to loading the "old" row as JSON into S3 before replacing it altogether in the catalog DB. Create the table based on the query as well as upload the "old" rows in create_preingestion_tasks?

The only thing that would be an issue is that the upsert step of our provider DAGs currently merges tags rather than overwriting them

This seems like the biggest blocker to me, it isn't deferrable like tracking the old data, right? Can we introduce a new strategy that does what you've suggested? It should be possible to filter out tags from old from the provider with something like this, I should think:

jsonb_path_query_array(
    old.tags,
    format('$ ? (@.provider != %s)', old.provider)
)

@AetherUnbound
Copy link
Contributor

These are all really good points, and I think asking the question of what do we hope to get from this data is always a good place to start before we invest time into building out a way to store/track/record it.

I'm on board with the approach of making a minimally viable, storage-optimized approach for unblocking this specific issue. That's similar to the plan we've had with #4610 and the data normalization efforts thus-far. As long as we're preserving the old values somewhere, I think that's best here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📅 To Do
Development

No branches or pull requests

5 participants