feat(corpus_importer): Add Celery to recap import #4116

flooie · 2024-06-14T17:19:01Z

Refactor recap into opinion using celery

Some minor tweaks and celerization of the recap into opinions code.

Move importer to task and add celery

mlissner

Looks about right to me, but for a more careful review, @ERosendo, can you please give it a look too, when you have a moment?

flooie · 2024-06-14T20:52:43Z

thanks @mlissner - @ERosendo I await your computer wisdom

ERosendo · 2024-06-14T20:52:50Z

@mlissner Sure!

ERosendo

Looks good overall!

I've added some comments to the code to improve the efficiency of data transfer to Redis and minimize database retrieval.

Here are some additional comments to further enhance the code.

it seems like there's no mechanism to resume progress if an error disrupts the command execution. I understand the intention is for this command to run only once, but what happens if an unexpected failure occurs midway through the process?

In such a scenario, without a mechanism to resume progress, we would be forced to manually re-run all the individual queries from the very beginning. This could be a time-consuming and tedious task, especially if the command involves a large number of queries or complex operations.

To address this potential drawback, we could implement a mechanism to periodically save the application's progress at predefined checkpoints. This would allow us to resume execution from the most recent checkpoint in case of a failure.
There's no way to easily track which documents failed during the extraction process. We can use Sentry reports to figure out what's causing the problems and fix it, but pinpointing which specific documents failed can be a pain. if we had a way to keep tabs on failed extractions, it would be way easier to retry those specific records instead of having to run the whole thing again – saving a bunch of time, especially for big batches.

On a similar note, the current command tries to extract data from every document, even if we only care about a few. Wouldn't it be great if we could just tell the command which documents to focus on by passing their IDs? This would make things way faster, especially when re-running failed extractions.
I believe we can use the replica database for all data extraction queries within this command. This approach offers several key advantages:
- by offloading the extraction workload to the replica, we completely shield the production database from any potential performance impact. Long-running queries associated with data extraction won't cause any slowdown in production operations.
- The replica database provides the perfect environment for creating temporary indexes specifically designed to speed up these extraction queries. We use of this strategy during the make_aws_manifest_files command execution and worked properly. By strategically leveraging temporary indexes within the replica, we can significantly enhance the overall speed of the data extraction process.
I've analyzed some of the database queries used within this command and identified opportunities to significantly improve their performance through custom indexes. Here's a specific example:

The following query retrieves clusters for each federal district court, which based on development database can translates to roughly 123 queries (We have 123 courts that meets the criteria in that db). Each query currently takes around 100 seconds to complete in the development database, resulting in a total execution time exceeding 3 hours for just this one query. To validate these findings, I asked Ramiro to run a single court query in the production database and observed a similar execution time.

cluster = (
            OpinionCluster.objects.filter(docket__court=court)
            .exclude(source=SOURCES.RECAP)
            .order_by("-date_filed")
            .first()
        )

After running EXPLAIN ANALYZE on the query, I identified that PostgreSQL is attempting to traverse one of the available indexes in a backwards direction. This appears to be the primary cause of the query's slow execution. Switching to an index that allows forward scans would improve query execution time.

I apologize for posting the entire comment at once. GitHub limits comments to areas with code changes.

Let me know what you think.

cl/corpus_importer/management/commands/recap_into_opinions.py

cl/corpus_importer/tasks.py

Co-authored-by: Eduardo Rosendo <eduardojra96@gmail.com>

Update ingest recap document to accept pk instead of recap document object.

flooie · 2024-06-18T18:35:07Z

Thank you @ERosendo for your thorough analysis -
I will incorporate your suggestions into the upcoming PR for the rerun/criminal/bankruptcy runs.

Also - add selected replica db with addt'l index

ERosendo

LGTM

sentry-io · 2024-06-19T12:09:30Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ RemoteProtocolError: Server disconnected without sending a response. cl.corpus_importer.tasks.ingest_recap_document View Issue
‼️ ReadError cl.corpus_importer.tasks.ingest_recap_document View Issue
‼️ ConnectError: All connection attempts failed cl.corpus_importer.tasks.ingest_recap_document View Issue
‼️ HTTPStatusError: Server error '500 Internal Server Error' for url 'http://cl-doctor:5050/extract/recap/text/?strip... cl.corpus_importer.tasks.ingest_recap_document View Issue

_{Did you find this useful? React with a 👍 or 👎}

feat(corpus_importer): Add Celery to recap import

7d83ab9

Move importer to task and add celery

This comment was marked as outdated.

Sign in to view

flooie requested a review from mlissner June 14, 2024 17:19

flooie added 2 commits June 14, 2024 14:03

Merge branch 'main' into upgrade-importer-recap-opinions

8c31549

Merge branch 'main' into upgrade-importer-recap-opinions

44c81a6

flooie requested a review from albertisfu June 14, 2024 20:36

mlissner approved these changes Jun 14, 2024

View reviewed changes

flooie requested review from ERosendo and removed request for albertisfu June 14, 2024 20:52

ERosendo requested changes Jun 17, 2024

View reviewed changes

cl/corpus_importer/management/commands/recap_into_opinions.py Outdated Show resolved Hide resolved

cl/corpus_importer/tasks.py Show resolved Hide resolved

cl/corpus_importer/tasks.py Outdated Show resolved Hide resolved

flooie and others added 4 commits June 18, 2024 13:45

Merge branch 'main' into upgrade-importer-recap-opinions

3c4eca3

Update cl/corpus_importer/tasks.py

ccfb824

Co-authored-by: Eduardo Rosendo <eduardojra96@gmail.com>

Update cl/corpus_importer/tasks.py

fb67269

Co-authored-by: Eduardo Rosendo <eduardojra96@gmail.com>

fix(recap->op): Pass only id to redis

bf2a806

Update ingest recap document to accept pk instead of recap document object.

flooie requested a review from ERosendo June 18, 2024 18:39

flooie added 2 commits June 18, 2024 14:54

fix(recap->op): Improve RD query

3569ce5

fix(recap->op): Improve cluster date filed query

23e93e1

Also - add selected replica db with addt'l index

ERosendo approved these changes Jun 18, 2024

View reviewed changes

ERosendo merged commit 73da414 into main Jun 18, 2024
13 checks passed

ERosendo deleted the upgrade-importer-recap-opinions branch June 18, 2024 20:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(corpus_importer): Add Celery to recap import #4116

feat(corpus_importer): Add Celery to recap import #4116

flooie commented Jun 14, 2024

This comment was marked as outdated.

mlissner left a comment

flooie commented Jun 14, 2024

ERosendo commented Jun 14, 2024

ERosendo left a comment

flooie commented Jun 18, 2024

ERosendo left a comment

sentry-io bot commented Jun 19, 2024 •

edited

Loading

feat(corpus_importer): Add Celery to recap import #4116

feat(corpus_importer): Add Celery to recap import #4116

Conversation

flooie commented Jun 14, 2024

This comment was marked as outdated.

mlissner left a comment

Choose a reason for hiding this comment

flooie commented Jun 14, 2024

ERosendo commented Jun 14, 2024

ERosendo left a comment

Choose a reason for hiding this comment

flooie commented Jun 18, 2024

ERosendo left a comment

Choose a reason for hiding this comment

sentry-io bot commented Jun 19, 2024 • edited Loading

Suspect Issues

sentry-io bot commented Jun 19, 2024 •

edited

Loading