Migrate previously archived events to object storage by mimurawil · Pull Request #420 · WikipediaLibrary/externallinks

mimurawil · 2025-03-15T00:18:47Z

Description

This PR is a continuation of #418 (parent task T380735).

This patch aims to migrate the schema in old link event archives into the current model we have in production, AND upload the archive to Swift store.

To keep performance, the migration method consists in handling the JSON archive files directly. Investigating all migrations done in LinkEvent table (14 migrations):

We only had 2 new fields added in 0013_add_linkevent_url_linkevent_content_type_and_more.py. content_type and object_id fields
The two new fields are calculated in 0014_migrate_url_pattern_relationships.py for existing rows in the table. This patch is using a similar logic to fill these fields for old archives

The new script is executed like below:

python manage.py linkevents_archive_fix_schema (update|upload) (filepaths) [--ouput] [--skip-validation] [--skip-upload]

update | upload: the desired action. update will update the schema if possible AND upload the archive to Swift unless --skip-upload is mentioned. upload will only upload the archive to Swift
filepaths: a list of files to be processed
--output: optional directory to output the migrated archive, if omitted the script will overwrite the same archive
--skip-validation: before writing the migrated archive, the script performs a model validation to make sure the archive is in correct shape, this option will skip this check (this is only for the update action)
--skip-upload: after writing the migrated archive, the script will upload the file to Swift, this option skips this step (this is only for the update action)

example:

python manage.py linkevents_archive_fix_schema update backup/links_linkevent_201907.0.json.gz backup/links_linkevent_201907.1.json.gz --output=test_archive

(This way, it is possible to use a shell command like nargs to execute this script on all files in a directory)

Rationale

Currently, when we serialize and compress events for archiving, json.gz files are dumped to the local filesystem and then manually rotated to external storage that is not publicly available. We got the go ahead to share these archives, so we should save them directly to a publicly-readable object store.

Phabricator Ticket

https://phabricator.wikimedia.org/T386477

How Has This Been Tested?

Similar to Store newly archived events in object storage #418, unit tests were added to ensure functionality, the python-swiftclient is mocked though
I manually tested following the steps below:
- Download some old archives (if you don't have any in you backup folder)
- Run the command referencing the files (like the example command above), add the --output=temp_archive option so you have a copy of the old archive
- Compare both files (pick a few samples) and see how the new archive have content_type and object_id, they both should have values as long as the attribute url contains only one value
- Run the command again but this time omit the --output option. This will overwrite the old archive
- Compare the file generated previously with the archive, they should be the same
- Repeat this test with a new archive

Screenshots of your changes (if appropriate):

n/a

Types of changes

What types of changes does your code introduce? Add an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

- This also updates old archives schema before uploading to Swift - Adding unit tests as well Bug: T386477

- Add instructions for local setup

jsnshrmn

Thanks for your patience on this.. I did local testing, and the schema changes and swift upload look good.

eg.

$ diff <(cat links_linkevent_201907.0.json | jq .[-1]) <(cat links_linkevent_201907.0.munged.js
on | jq .[-1])
20c20,22
<     ]
---
>     ],
>     "content_type": 7,
>     "object_id": 2

Based on our conversation in the phab task, I thought we would move to a filename scheme that included the day? right now it's year.month.counter (based on chunk size), whereas I was expecting year.month.day.

jsnshrmn · 2025-03-27T14:23:47Z

Oh yeah, the conversation was in this ticket:
https://phabricator.wikimedia.org/T387887

I don't think I was communicating very well here. I think it's fine to leave the grouping as it is, as long as it is consistent across all of the files we provide via object storage. eg. if everything is year.month.x then that's okay, we just don't want a mix of year.month.x and year.month.day.x

extlinks/links/management/commands/linkevents_archive_fix_schema.py

mimurawil · 2025-03-28T19:23:51Z

Thanks for all the feedback @jsnshrmn and @katydidnot.

Considering I'm already running the scripts to "load" + "migrate" + "dump" in https://phabricator.wikimedia.org/T387887, I'll be sending you the updated old archives in that ticket, so it seems this script in this PR is not necessary anymore as we won't need to execute for any file.
I believe I can reuse the "upload" function and just incorporate to the existing linkevents_archive script in case we want to re-upload an archive to Swift.

Does that sound reasonable? I can close this PR and quickly open a new one changing just the linkevents_archive script in that case.

katydidnot · 2025-03-28T19:31:58Z

Thanks for all the feedback @jsnshrmn and @katydidnot.

Considering I'm already running the scripts to "load" + "migrate" + "dump" in https://phabricator.wikimedia.org/T387887, I'll be sending you the updated old archives in that ticket, so it seems this script in this PR is not necessary anymore as we won't need to execute for any file. I believe I can reuse the "upload" function and just incorporate to the existing linkevents_archive script in case we want to re-upload an archive to Swift.

Does that sound reasonable? I can close this PR and quickly open a new one changing just the linkevents_archive script in that case.

Sounds reasonable to me. Thanks for all your work!

jsnshrmn · 2025-04-02T17:09:10Z

@mimurawil shall we close out this pr?

mimurawil changed the title ~~T386477~~ Migrate previously archived events to object storage Mar 17, 2025

mimurawil added 4 commits March 25, 2025 11:50

Migrate previously archived events to object storage

709432b

- This also updates old archives schema before uploading to Swift - Adding unit tests as well Bug: T386477

Use logger helper and single Swift container from env var

d7ea914

add new env var to template file

ef74efe

Remove project name key in keystone credential

a74a078

- Add instructions for local setup

jsnshrmn force-pushed the T386477 branch from 41c53a2 to a74a078 Compare March 25, 2025 16:50

jsnshrmn reviewed Mar 27, 2025

View reviewed changes

katydidnot reviewed Mar 27, 2025

View reviewed changes

extlinks/links/management/commands/linkevents_archive_fix_schema.py Show resolved Hide resolved

mimurawil closed this Apr 2, 2025

mimurawil deleted the T386477 branch April 2, 2025 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate previously archived events to object storage#420

Migrate previously archived events to object storage#420
mimurawil wants to merge 4 commits intoWikipediaLibrary:masterfrom
mimurawil:T386477

mimurawil commented Mar 15, 2025

Uh oh!

jsnshrmn left a comment •

edited

Loading

Uh oh!

jsnshrmn commented Mar 27, 2025

Uh oh!

Uh oh!

mimurawil commented Mar 28, 2025

Uh oh!

katydidnot commented Mar 28, 2025

Uh oh!

jsnshrmn commented Apr 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mimurawil commented Mar 15, 2025

Description

Rationale

Phabricator Ticket

How Has This Been Tested?

Screenshots of your changes (if appropriate):

Types of changes

Uh oh!

jsnshrmn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsnshrmn commented Mar 27, 2025

Uh oh!

Uh oh!

mimurawil commented Mar 28, 2025

Uh oh!

katydidnot commented Mar 28, 2025

Uh oh!

jsnshrmn commented Apr 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jsnshrmn left a comment •

edited

Loading