Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(corpus_importer): Add Celery to recap import #4116

Merged
merged 9 commits into from
Jun 18, 2024

Conversation

flooie
Copy link
Contributor

@flooie flooie commented Jun 14, 2024

Refactor recap into opinion using celery

Some minor tweaks and celerization of the recap into opinions code.

Move importer to task and add celery

This comment was marked as outdated.

@flooie flooie requested a review from mlissner June 14, 2024 17:19
@flooie flooie requested a review from albertisfu June 14, 2024 20:36
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks about right to me, but for a more careful review, @ERosendo, can you please give it a look too, when you have a moment?

@flooie
Copy link
Contributor Author

flooie commented Jun 14, 2024

thanks @mlissner - @ERosendo I await your computer wisdom

@ERosendo
Copy link
Contributor

@mlissner Sure!

@flooie flooie requested review from ERosendo and removed request for albertisfu June 14, 2024 20:52
Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

I've added some comments to the code to improve the efficiency of data transfer to Redis and minimize database retrieval.

Here are some additional comments to further enhance the code.

  1. it seems like there's no mechanism to resume progress if an error disrupts the command execution. I understand the intention is for this command to run only once, but what happens if an unexpected failure occurs midway through the process?

    In such a scenario, without a mechanism to resume progress, we would be forced to manually re-run all the individual queries from the very beginning. This could be a time-consuming and tedious task, especially if the command involves a large number of queries or complex operations.

    To address this potential drawback, we could implement a mechanism to periodically save the application's progress at predefined checkpoints. This would allow us to resume execution from the most recent checkpoint in case of a failure.

  2. There's no way to easily track which documents failed during the extraction process. We can use Sentry reports to figure out what's causing the problems and fix it, but pinpointing which specific documents failed can be a pain. if we had a way to keep tabs on failed extractions, it would be way easier to retry those specific records instead of having to run the whole thing again – saving a bunch of time, especially for big batches.

    On a similar note, the current command tries to extract data from every document, even if we only care about a few. Wouldn't it be great if we could just tell the command which documents to focus on by passing their IDs? This would make things way faster, especially when re-running failed extractions.

  3. I believe we can use the replica database for all data extraction queries within this command. This approach offers several key advantages:

    • by offloading the extraction workload to the replica, we completely shield the production database from any potential performance impact. Long-running queries associated with data extraction won't cause any slowdown in production operations.

    • The replica database provides the perfect environment for creating temporary indexes specifically designed to speed up these extraction queries. We use of this strategy during the make_aws_manifest_files command execution and worked properly. By strategically leveraging temporary indexes within the replica, we can significantly enhance the overall speed of the data extraction process.

    I've analyzed some of the database queries used within this command and identified opportunities to significantly improve their performance through custom indexes. Here's a specific example:

    The following query retrieves clusters for each federal district court, which based on development database can translates to roughly 123 queries (We have 123 courts that meets the criteria in that db). Each query currently takes around 100 seconds to complete in the development database, resulting in a total execution time exceeding 3 hours for just this one query. To validate these findings, I asked Ramiro to run a single court query in the production database and observed a similar execution time.

cluster = (
            OpinionCluster.objects.filter(docket__court=court)
            .exclude(source=SOURCES.RECAP)
            .order_by("-date_filed")
            .first()
        )

After running EXPLAIN ANALYZE on the query, I identified that PostgreSQL is attempting to traverse one of the available indexes in a backwards direction. This appears to be the primary cause of the query's slow execution. Switching to an index that allows forward scans would improve query execution time.

I apologize for posting the entire comment at once. GitHub limits comments to areas with code changes.

Let me know what you think.

cl/corpus_importer/tasks.py Show resolved Hide resolved
cl/corpus_importer/tasks.py Outdated Show resolved Hide resolved
flooie and others added 4 commits June 18, 2024 13:45
Co-authored-by: Eduardo Rosendo <eduardojra96@gmail.com>
Co-authored-by: Eduardo Rosendo <eduardojra96@gmail.com>
Update ingest recap document to accept pk
instead of recap document object.
@flooie
Copy link
Contributor Author

flooie commented Jun 18, 2024

Thank you @ERosendo for your thorough analysis -
I will incorporate your suggestions into the upcoming PR for the rerun/criminal/bankruptcy runs.

@flooie flooie requested a review from ERosendo June 18, 2024 18:39
Copy link
Contributor

@ERosendo ERosendo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

@ERosendo ERosendo merged commit 73da414 into main Jun 18, 2024
13 checks passed
@ERosendo ERosendo deleted the upgrade-importer-recap-opinions branch June 18, 2024 20:02
Copy link

sentry-io bot commented Jun 19, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ RemoteProtocolError: Server disconnected without sending a response. cl.corpus_importer.tasks.ingest_recap_document View Issue
  • ‼️ ReadError cl.corpus_importer.tasks.ingest_recap_document View Issue
  • ‼️ ConnectError: All connection attempts failed cl.corpus_importer.tasks.ingest_recap_document View Issue
  • ‼️ HTTPStatusError: Server error '500 Internal Server Error' for url 'http://cl-doctor:5050/extract/recap/text/?strip... cl.corpus_importer.tasks.ingest_recap_document View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants