Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuously get the latest cases from the iquery page #4081

Open
mlissner opened this issue May 27, 2024 · 4 comments
Open

Continuously get the latest cases from the iquery page #4081

mlissner opened this issue May 27, 2024 · 4 comments

Comments

@mlissner
Copy link
Member

As we move towards really great alerts, we need to make sure we have every case as it comes out. The way to do this (freely and quickly) is to scrape the iquery page.

A few notes:

  1. This is hard because some cases are sealed and others don't exist yet.

  2. It might be better to use the Hidden API to check docket numbers, and then use the result from that to scrape iquery. Somehow we need to figure out what is sealed and what is not; this might be one way?

  3. In a perfect world, we would know about new cases instantly, but if we can know in:

    • 1 minute: Great
    • 5 minutes: OK
    • 15 minutes: Will disappoint some people
    • 30 minutes: Bummer
  4. We'll want this as a daemon that runs around the clock.

We also have a client currently that wants this done, so it's a great time to work on this!

@albertisfu
Copy link
Contributor

Some notes from our talk today about this:

  • The daemon will exclusively perform forward probing.
    • The forward probing will be additive, starting from the bottom up to 256, and will include a jitter on every iteration.
      • For instance, the first iteration will be: 1, 2, 4, 6, 8, 16, 32, 64, 128, 256. If no high watermark is found in this iteration, a jitter will be added for the next one. The jitter will be random, up to 5%, so it can vary from an addition of 1 to 13.
      • Let's say the jitter in the next iteration is 2. The probes will be 3, 6, 12, 24, 48, 96, 192, 384, 768, 1536. Following this approach, probe distances can grow significantly, which may not be as useful.
      • Therefore, the jitter should perhaps be applied as an addition to each number from the original probing pattern (1, 2, 4, 6, 8, 16, 32, 64, 128, 256). So if the jitter is 3, the sequence will be 1+3, 2+3, 4+3, 6+3, 8+3, 16+3, 32+3, 64+3, 128+3, 256+3.
        @mlissner What option did you had in mind about the jitter?
    • The way we're going to select a "hit" is as follows:
      Start probing dockets following the previously defined pattern. If we reach a hit, try the next one. If it's also a hit, try the next one. If the next one is not a hit, we'll choose the previous one. We'll do this up to 10 probes per iteration (256 without jitter).
  • Once we find our "hit," we'll save it so it will trigger a post_signal (described below) and schedule tasks to scrape up to this new high watermark.
  • We'll listen for a post_save signal on every new saved docket, which will be ignored if the docket was created by our iquery page scraper.
  • The signal will store the pacer_case_id in Redis if it is greater than our previous high watermark.
  • The signal will schedule iquery scrape tasks from the previous watermark to the new one.
  • Tasks will be scheduled using a countdown that allows us to keep the scraping court rate under control. We'll start with one task per second per court and can be configurable in settings.
  • The celery visibility_timeout we have set is 6 hours, meaning we could schedule up to 21,600 tasks at once to avoid issues with Celery. It is unlikely we'll get this many dockets to scrape at once. However, to ensure it's never a problem, we can add a lower limit of tasks to schedule. Remaining tasks can be scheduled in the next triggered signal.
  • For probing and iquery scraping, we'll need to take care of courts that have blocked us or are down. We will stop probing or scraping them for a period of time. We can check the response status code or HTML message to determine if the page is down or has blocked us. We can schedule a court wait time configurable in settings to check the court again.

Did I miss something?

@mlissner
Copy link
Member Author

I had a different idea about jitter, but I like yours better.

We'll listen for a post_save signal on every new saved docket, which will be ignored if the docket was created by our iquery page scraper.

It'll also be ignored if created by a lower hit by the probe, like if the probe has a hit on 8 and 16, but misses on 32, you'd save the 8 and 16, but only trigger the signal with the 16, not the 8.

Did I miss something?

The only other detail is that the daemon has a loop cycle that it uses to schedule probes. Currently we're imagining a five minute cycle, but a configuration for it is probably a good idea. We'll also want to avoid probing for five minutes in a court where we just did a sweep.

@albertisfu
Copy link
Contributor

It'll also be ignored if created by a lower hit by the probe, like if the probe has a hit on 8 and 16, but misses on 32, you'd save the 8 and 16, but only trigger the signal with the 16, not the 8.

Oh, I see. I thought we were going to save only the latest hit when probing (16 in this case) so only a single signal is triggered. But I think it's a good idea to save all the hits found by the probe and do something to only trigger the signal on the latest one.
Also, before scheduling the tasks, it will be required to check if those pacer_case_id are already in the DB to avoid re-scraping them.

The only other detail is that the daemon has a loop cycle that it uses to schedule probes. Currently we're imagining a five minute cycle, but a configuration for it is probably a good idea. We'll also want to avoid probing for five minutes in a court where we just did a sweep.

Got it. Yeah, here we can set a timestamp for the court in Redis when scheduling the tasks for the sweep. Do you think it's enough to wait 5 minutes since the latest sweep task for the court was scheduled, or will we need to figure out somehow when all the tasks for the court have been completed and start the countdown from there?

@mlissner
Copy link
Member Author

Also, before scheduling the tasks, it will be required to check if those pacer_case_id are already in the DB to avoid re-scraping them

Sure, or it can happen in the task. I think that's probably better, so the save completes quickly.

Do you think it's enough to wait 5 minutes since the latest sweep task for the court was scheduled

I think so, yeah. If we're doing this correctly, we shouldn't have more than 256 items being scraped at a given time, and five minutes should be enough time to do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🚢 In Deployment / Watching
Development

No branches or pull requests

2 participants