-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(corpus_importer): Add Celery to recap import #4116
Conversation
Move importer to task and add celery
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks about right to me, but for a more careful review, @ERosendo, can you please give it a look too, when you have a moment?
@mlissner Sure! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
I've added some comments to the code to improve the efficiency of data transfer to Redis and minimize database retrieval.
Here are some additional comments to further enhance the code.
-
it seems like there's no mechanism to resume progress if an error disrupts the command execution. I understand the intention is for this command to run only once, but what happens if an unexpected failure occurs midway through the process?
In such a scenario, without a mechanism to resume progress, we would be forced to manually re-run all the individual queries from the very beginning. This could be a time-consuming and tedious task, especially if the command involves a large number of queries or complex operations.
To address this potential drawback, we could implement a mechanism to periodically save the application's progress at predefined checkpoints. This would allow us to resume execution from the most recent checkpoint in case of a failure.
-
There's no way to easily track which documents failed during the extraction process. We can use Sentry reports to figure out what's causing the problems and fix it, but pinpointing which specific documents failed can be a pain. if we had a way to keep tabs on failed extractions, it would be way easier to retry those specific records instead of having to run the whole thing again – saving a bunch of time, especially for big batches.
On a similar note, the current command tries to extract data from every document, even if we only care about a few. Wouldn't it be great if we could just tell the command which documents to focus on by passing their IDs? This would make things way faster, especially when re-running failed extractions.
-
I believe we can use the
replica
database for all data extraction queries within this command. This approach offers several key advantages:-
by offloading the extraction workload to the replica, we completely shield the production database from any potential performance impact. Long-running queries associated with data extraction won't cause any slowdown in production operations.
-
The replica database provides the perfect environment for creating temporary indexes specifically designed to speed up these extraction queries. We use of this strategy during the
make_aws_manifest_files
command execution and worked properly. By strategically leveraging temporary indexes within the replica, we can significantly enhance the overall speed of the data extraction process.
I've analyzed some of the database queries used within this command and identified opportunities to significantly improve their performance through custom indexes. Here's a specific example:
The following query retrieves clusters for each federal district court, which based on development database can translates to roughly 123 queries (We have 123 courts that meets the criteria in that db). Each query currently takes around 100 seconds to complete in the development database, resulting in a total execution time exceeding 3 hours for just this one query. To validate these findings, I asked Ramiro to run a single court query in the production database and observed a similar execution time.
-
cluster = (
OpinionCluster.objects.filter(docket__court=court)
.exclude(source=SOURCES.RECAP)
.order_by("-date_filed")
.first()
)
After running EXPLAIN ANALYZE
on the query, I identified that PostgreSQL is attempting to traverse one of the available indexes in a backwards direction. This appears to be the primary cause of the query's slow execution. Switching to an index that allows forward scans would improve query execution time.
I apologize for posting the entire comment at once. GitHub limits comments to areas with code changes.
Let me know what you think.
Co-authored-by: Eduardo Rosendo <eduardojra96@gmail.com>
Co-authored-by: Eduardo Rosendo <eduardojra96@gmail.com>
Update ingest recap document to accept pk instead of recap document object.
Thank you @ERosendo for your thorough analysis - |
Also - add selected replica db with addt'l index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Suspect IssuesThis pull request was deployed and Sentry observed the following issues:
Did you find this useful? React with a 👍 or 👎 |
Refactor recap into opinion using celery
Some minor tweaks and celerization of the recap into opinions code.