Skip to content

Migrate previously archived events to object storage#420

Closed
mimurawil wants to merge 4 commits intoWikipediaLibrary:masterfrom
mimurawil:T386477
Closed

Migrate previously archived events to object storage#420
mimurawil wants to merge 4 commits intoWikipediaLibrary:masterfrom
mimurawil:T386477

Conversation

@mimurawil
Copy link
Contributor

Description

This PR is a continuation of #418 (parent task T380735).

This patch aims to migrate the schema in old link event archives into the current model we have in production, AND upload the archive to Swift store.

To keep performance, the migration method consists in handling the JSON archive files directly. Investigating all migrations done in LinkEvent table (14 migrations):

  • We only had 2 new fields added in 0013_add_linkevent_url_linkevent_content_type_and_more.py. content_type and object_id fields
  • The two new fields are calculated in 0014_migrate_url_pattern_relationships.py for existing rows in the table. This patch is using a similar logic to fill these fields for old archives

The new script is executed like below:

python manage.py linkevents_archive_fix_schema (update|upload) (filepaths) [--ouput] [--skip-validation] [--skip-upload]
  • update | upload: the desired action. update will update the schema if possible AND upload the archive to Swift unless --skip-upload is mentioned. upload will only upload the archive to Swift
  • filepaths: a list of files to be processed
  • --output: optional directory to output the migrated archive, if omitted the script will overwrite the same archive
  • --skip-validation: before writing the migrated archive, the script performs a model validation to make sure the archive is in correct shape, this option will skip this check (this is only for the update action)
  • --skip-upload: after writing the migrated archive, the script will upload the file to Swift, this option skips this step (this is only for the update action)

example:

python manage.py linkevents_archive_fix_schema update backup/links_linkevent_201907.0.json.gz backup/links_linkevent_201907.1.json.gz --output=test_archive

(This way, it is possible to use a shell command like nargs to execute this script on all files in a directory)

Rationale

Currently, when we serialize and compress events for archiving, json.gz files are dumped to the local filesystem and then manually rotated to external storage that is not publicly available. We got the go ahead to share these archives, so we should save them directly to a publicly-readable object store.

Phabricator Ticket

https://phabricator.wikimedia.org/T386477

How Has This Been Tested?

  • Similar to Store newly archived events in object storage #418, unit tests were added to ensure functionality, the python-swiftclient is mocked though
  • I manually tested following the steps below:
    • Download some old archives (if you don't have any in you backup folder)
    • Run the command referencing the files (like the example command above), add the --output=temp_archive option so you have a copy of the old archive
    • Compare both files (pick a few samples) and see how the new archive have content_type and object_id, they both should have values as long as the attribute url contains only one value
    • Run the command again but this time omit the --output option. This will overwrite the old archive
    • Compare the file generated previously with the archive, they should be the same
    • Repeat this test with a new archive

Screenshots of your changes (if appropriate):

n/a

Types of changes

What types of changes does your code introduce? Add an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

@mimurawil mimurawil changed the title T386477 Migrate previously archived events to object storage Mar 17, 2025
- This also updates old archives schema before uploading to Swift
- Adding unit tests as well

Bug: T386477
- Add instructions for local setup
Copy link
Member

@jsnshrmn jsnshrmn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience on this.. I did local testing, and the schema changes and swift upload look good.

eg.

$ diff <(cat links_linkevent_201907.0.json | jq .[-1]) <(cat links_linkevent_201907.0.munged.js
on | jq .[-1])
20c20,22
<     ]
---
>     ],
>     "content_type": 7,
>     "object_id": 2

Based on our conversation in the phab task, I thought we would move to a filename scheme that included the day? right now it's year.month.counter (based on chunk size), whereas I was expecting year.month.day.

@jsnshrmn
Copy link
Member

Oh yeah, the conversation was in this ticket:
https://phabricator.wikimedia.org/T387887

I don't think I was communicating very well here. I think it's fine to leave the grouping as it is, as long as it is consistent across all of the files we provide via object storage. eg. if everything is year.month.x then that's okay, we just don't want a mix of year.month.x and year.month.day.x

@mimurawil
Copy link
Contributor Author

Thanks for all the feedback @jsnshrmn and @katydidnot.

Considering I'm already running the scripts to "load" + "migrate" + "dump" in https://phabricator.wikimedia.org/T387887, I'll be sending you the updated old archives in that ticket, so it seems this script in this PR is not necessary anymore as we won't need to execute for any file.
I believe I can reuse the "upload" function and just incorporate to the existing linkevents_archive script in case we want to re-upload an archive to Swift.

Does that sound reasonable? I can close this PR and quickly open a new one changing just the linkevents_archive script in that case.

@katydidnot
Copy link
Contributor

Thanks for all the feedback @jsnshrmn and @katydidnot.

Considering I'm already running the scripts to "load" + "migrate" + "dump" in https://phabricator.wikimedia.org/T387887, I'll be sending you the updated old archives in that ticket, so it seems this script in this PR is not necessary anymore as we won't need to execute for any file. I believe I can reuse the "upload" function and just incorporate to the existing linkevents_archive script in case we want to re-upload an archive to Swift.

Does that sound reasonable? I can close this PR and quickly open a new one changing just the linkevents_archive script in that case.

Sounds reasonable to me. Thanks for all your work!

@jsnshrmn
Copy link
Member

jsnshrmn commented Apr 2, 2025

@mimurawil shall we close out this pr?

@mimurawil mimurawil closed this Apr 2, 2025
@mimurawil mimurawil deleted the T386477 branch April 2, 2025 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants