-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed reindex steps #4572
Conversation
56b026b
to
1f02039
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running this DAG locally. The DAG created the index, but it was empty. After some investigation I saw that the id
in my temp_import_audio
start with 20001, not 1. Could that be the reason.
Another question: indexer logs say that the data was copied to ES from audio
table. Shouldn't we copy from the temporary table?
I also tried and noticed this as well - I think it was because the index was not refreshed after the documents were added. I hit "Refresh index" in the Elasticvue UI and the 5000 documents showed up! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge bolus of code, great work on getting this off the ground Staci! I've got a number of comments here, the most blocking one being what @obulat pointed out - we should probably be referencing the temp table on the API rather than the base one for creating the indices. Other than that, I have a few code structure comments, but nothing else major!
return f"{media_type}-{uuid.uuid4().hex}" | ||
|
||
|
||
@task_group(group_id="create_temp_index") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay for being able to do this with decorators!
One other thought here - given our recent strategy around PR review requirements, it would be great to have some new tests for all of the relevant pieces here alongside this PR. I know much of this is using operators, but for any of the pieces that you think make the most sense 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had some time to review this, so I've left some requested changes. Namely, the blockers for me are:
- Use launch template version number to reference what is retrieved in the
get_launch_template_version
task, so that it is disambiguated from the launch template version resource: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2/client/describe_launch_template_versions.html - Move retrieval of the launch template version number out of the mapped task group to avoid the race condition possible if each task group retrieves the version number independently of the others
- Add a default value to the
next
call to avoid an unhandled StopIteration, or, if there are always values, then replace theget
calls with indexer syntax and add a comment explaining why it is safe to proceed without default values.
Otherwise, I agree with Madison's comments, but the rest of mine are not blockers, just suggestions or extra context/information to help you make choices if you want.
ad209e3
to
cb31813
Compare
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @AetherUnbound Excluding weekend1 days, this PR was ready for review 20 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
162fc49
to
2e0be2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic, we're getting closer to having this whole thing complete!! 🤩 Just a few small things but nothing blocking, local testing worked as expected 🚀
|
||
|
||
class TempConnectionHTTPOperator(HttpOperator): | ||
def setup_ec2_hook(func: callable) -> callable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I've only re-reviewed this from a process perspective rather than testing it locally (didn't want to duplicate what Madison has done) and it makes good sense! Nice work, really excited to see this come together 😀
I managed to run the audio refresh locally.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something is wrong with my setup: audio refresh ran fine, but running image refresh locally didn't work as expected. I don't want to block on it since you and Madison have successfully run the process locally.
I went through the code in Pycharm and marked all places where it complained of unused function parameters. Are they going to be used in the future, or can they be removed?
@obulat is there any chance you already had a |
I've retested this several times and cannot reproduce @obulat's errors; given that multiple other people are also unable to reproduce, and that I'm fairly sure the issue can be explained by the temp table already existing (which will be fixed as the remaining steps are added to the DAG), I'm going to go ahead and merge. This will be tested many more times before it goes live :) |
Fixes
Fixes #4148 by @stacimc
Description
Adds steps to the new data refresh DAGs to create the temporary ES index (which will later be promoted), and populating it via the distributed reindex:
Testing Instructions
./ov just api/init && ./ov just up
and then run a few DAGs in order to ingest some new data (I used Jamendo for audio and Cleveland for image, and manually marked them as success as soon as some records were recorded). Then run the new data refresh DAGs (egstaging_audio_data_refresh
). They should run successfully, but skip the tasks to create/terminate EC2 instances. Go to elasticvue and you should see a new index matching the output of thegenerate_index_name
task.It should have no aliases. If you refresh the index in Elasticvue, you'll also see that it has the expected number of records (so for example, if you started from sample data the main audio index will have 5000 records. If you ingested 100 records from Jamendo before running the data refresh, your new temp index will have 5100 records.)
Run
./ov just api/pgcli
andselect count(*) from audio
(or image, depending on which data refresh you ran). Then verify the count is the same as the number of docs in the new index in Elasticvue.Also note that you'll see somewhat unexpected behavior if you run the DAGs multiple times in a row right now, because we don't yet delete the temp table. That's expected! If you need to run it a second time, just drop the temp import tables manually in the API DB first.
It's not simple to test this against the staging environment when running locally because of the requirement that we skip these tasks locally, but I did so by configuring the
AIRFLOW_CONN_AWS_DEFAULT
locally and commenting out the logic to skip. This should technically be safe even in production because no data is destroyed or promoted, but if you do want to test this you should ONLY run the staging DAGs to be safe, and we should clean up the dangling tables/indices afterward. There are additional issues in the milestone for testing this formally in those environments so I do not think it is a requirement for testers here.Checklist
Update index.md
).main
) or a parent feature branch../ov just catalog/generate-docs
for catalogPRs) or the media properties generator (
./ov just catalog/generate-docs media-props
for the catalog or
./ov just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin