Migrate previously archived events to object storage#420
Migrate previously archived events to object storage#420mimurawil wants to merge 4 commits intoWikipediaLibrary:masterfrom
Conversation
- This also updates old archives schema before uploading to Swift - Adding unit tests as well Bug: T386477
- Add instructions for local setup
There was a problem hiding this comment.
Thanks for your patience on this.. I did local testing, and the schema changes and swift upload look good.
eg.
$ diff <(cat links_linkevent_201907.0.json | jq .[-1]) <(cat links_linkevent_201907.0.munged.js
on | jq .[-1])
20c20,22
< ]
---
> ],
> "content_type": 7,
> "object_id": 2
Based on our conversation in the phab task, I thought we would move to a filename scheme that included the day? right now it's year.month.counter (based on chunk size), whereas I was expecting year.month.day.
|
Oh yeah, the conversation was in this ticket: I don't think I was communicating very well here. I think it's fine to leave the grouping as it is, as long as it is consistent across all of the files we provide via object storage. eg. if everything is year.month.x then that's okay, we just don't want a mix of year.month.x and year.month.day.x |
|
Thanks for all the feedback @jsnshrmn and @katydidnot. Considering I'm already running the scripts to "load" + "migrate" + "dump" in https://phabricator.wikimedia.org/T387887, I'll be sending you the updated old archives in that ticket, so it seems this script in this PR is not necessary anymore as we won't need to execute for any file. Does that sound reasonable? I can close this PR and quickly open a new one changing just the |
Sounds reasonable to me. Thanks for all your work! |
|
@mimurawil shall we close out this pr? |
Description
This PR is a continuation of #418 (parent task T380735).
This patch aims to migrate the schema in old link event archives into the current model we have in production, AND upload the archive to Swift store.
To keep performance, the migration method consists in handling the JSON archive files directly. Investigating all migrations done in LinkEvent table (14 migrations):
0013_add_linkevent_url_linkevent_content_type_and_more.py.content_typeandobject_idfields0014_migrate_url_pattern_relationships.pyfor existing rows in the table. This patch is using a similar logic to fill these fields for old archivesThe new script is executed like below:
updatewill update the schema if possible AND upload the archive to Swift unless--skip-uploadis mentioned.uploadwill only upload the archive to Swift--output: optional directory to output the migrated archive, if omitted the script will overwrite the same archive--skip-validation: before writing the migrated archive, the script performs a model validation to make sure the archive is in correct shape, this option will skip this check (this is only for theupdateaction)--skip-upload: after writing the migrated archive, the script will upload the file to Swift, this option skips this step (this is only for theupdateaction)example:
(This way, it is possible to use a shell command like
nargsto execute this script on all files in a directory)Rationale
Currently, when we serialize and compress events for archiving, json.gz files are dumped to the local filesystem and then manually rotated to external storage that is not publicly available. We got the go ahead to share these archives, so we should save them directly to a publicly-readable object store.
Phabricator Ticket
https://phabricator.wikimedia.org/T386477
How Has This Been Tested?
backupfolder)--output=temp_archiveoption so you have a copy of the old archivecontent_typeandobject_id, they both should have values as long as the attributeurlcontains only one value--outputoption. This will overwrite the old archiveScreenshots of your changes (if appropriate):
n/a
Types of changes
What types of changes does your code introduce? Add an
xin all the boxes that apply: