Skip to content
develop
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
db
 
 
env
 
 
img
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Case Harvester

Case Harvester is a project designed to mine the Maryland Judiciary Case Search (MJCS) and build a near-complete database of Maryland court cases that can be queried and analyzed without the limitations of the MJCS interface. It is designed to leverage Amazon Web Services (AWS) for scalability and performance.

NOTE: Unless you are modifying Case Harvester for specific purposes, please do not run your own instance so that MJCS is spared unneccesary load. If you are a values-aligned researcher or journalist, you are welcome to access our public case database scraped from MJCS with defendant information redacted; reach out to us at dan@acab.enterprises.

There is also a web interface and set of APIs in the works for exploring the data scraped from the Maryland Judiciary Case Search. We are in need of web design help for that, so if you are interested in volunteering please get in touch.

Architecture

Case Harvester is split into three main components: spider, scraper, and parser. Each component is a part of a pipeline that finds, downloads, and parses case data from the MJCS. The following diagram shows at a high level how each of these components interact:

High level diagram

Spider

The spider component is responsible for discovering new case numbers. It does this by submitting search queries to the MJCS and iterating through the results. Because the MJCS only returns a maximum of 500 results, the search algorithm splits queries that return 500 results into a set of more narrowed queries which are then submitted. Each of these queries is then split again if more than 500 results are returned, and so forth, until the MJCS is exhaustively searched for case numbers. Each discovered case number is submitted to a PostgreSQL database, and then added to a queue for scraping:

Spider diagram

The Spider is launched using Elastic Container Service (ECS) Fargate tasks run at regularly scheduled intervals. These tasks run Case Harvester from a Docker image pulled from an Elastic Container Registry (ECR). Periodically the spider will save its state using a combination of DynamoDB and S3, which allows resuming failed or canceled spider runs.

Scraper

The scraper component downloads and stores the case details for every case number discovered by the spider. The full HTML for each case (example) is added to an S3 bucket. Version information is kept for each case, including a timestamp of when each version was downloaded, so changes to a case can be recorded and referenced.

Scraper diagram

The scraper is a continuously running ECS service that processes case numbers from the SQS scraper queue.

Parser

The parser component is a Lambda function that parses the fields of information in the HTML case details for each case, and stores that data in the PostgreSQL database. Each new item added to the scraper S3 bucket triggers a new parser Lambda invocation, which allows for significant scaling.

Parser diagram

Case details in the MJCS are formatted differently depending on the county and type of case (e.g. district vs circuit court, criminal vs civil, etc.), and whether it is in one of the new MDEC-compatible formats. MJCS assigns a code to each of these different case types. While the case details for all discovered cases are stored in S3, Case Harvester only has full parser support for the following types of cases (see issue #10):

  • CC: Circuit Court Civil Cases
  • DSCIVIL: District Court Civil Cases
  • DSCR: District Court Criminal Cases
  • DSK8: Baltimore City Criminal Cases
  • ODYCRIM: MDEC Criminal Cases
  • ODYTRAF: MDEC Traffic Cases
  • ODYCIVIL: MDEC Civil Cases

Each different parser breaks down the case details to a granular level and stores the data in a number of database tables. This schematic diagram illustrates how this data is represented in the database.

Schedule

Case Harvester essentially has two main tasks:

  1. Discover new case numbers, and
  2. Capture updates/changes to case details for known cases.

In order to find new case numbers, Case Harvester uses scheduled tasks to run the spider component according to the following schedule:

  • Every day, spider for cases filed within the last month
  • Every week, spider for cases filed within the last 6 months
  • Every month, spider for cases filed within the last year

The scraper component runs daily to capture any updated case details in MJCS. Case numbers are chosen for rescraping based on the case's age (since filing date), with newer cases being rescraped more frequently than older cases. For example, cases less than 3 months old are rescraped every day while a 4-year-old case will be rescraped every 2 weeks. These settings can be tuned with the RESCRAPE_COEFFICIENT, MAX_SCRAPE_AGE, and MAX_SCRAPE_AGE_INACTIVE configuration variables. When updates to a case are found by the scraper, a new version of the case details HTML is saved in S3 and the database is updated to reflect the most recent case details.

Installation

Case Harvester can be run or deployed from any workstation running Python 3, GNU Make, and jq. The required Python 3 modules are in requirements.txt and can be installed with pip3 install -r requirements.txt. Creating a Python virtual environment is recommended.

Next, configure AWS CLI with aws configure so that it can deploy Case Harvester using your account. Here you'll use an Access Key ID and Secret Access Key either for your root AWS account, or an IAM user or role that has sufficient permissions.

Deploy to AWS

Case Harvester uses Cloudformation stacks to deploy, configure, and connect all of the needed AWS resources. There are separate stacks for static resources (VPC, S3 bucket, RDS instance), spider, scraper, and parser. The first step is to set strong, unique passwords for the database users in secrets.json:

{
  "development":{
    "DatabaseMasterUsername":"root",
    "DatabaseMasterPassword":"badpassword",
    "DatabaseUsername":"db_user",
    "DatabasePassword":"badpassword",
    "DatabaseReadOnlyUsername":"ro_user",
    "DatabaseReadOnlyPassword":"badpassword"
  },
  "production":{
    "DatabaseMasterUsername":"root",
    "DatabaseMasterPassword":"badpassword",
    "DatabaseUsername":"db_user",
    "DatabasePassword":"badpassword",
    "DatabaseReadOnlyUsername":"ro_user",
    "DatabaseReadOnlyPassword":"badpassword"
  }
}

You can then deploy the cloudformation stacks to AWS by running:

make deploy

Once this is finished, you can intialize the database and configure Case Harvester to use the newly deployed resources by running:

make init

Note that the above commands deploy and initialize a development environment. To deploy to a production environment:

make deploy_production
make init_production

More make targets (such as deploying a specific stack or generating documentation) can be found by looking in the Makefile.

Usage

You can manually run the spider (or other components) on the command line by specifying a search time range and county:

./src/case_harvester.py spider --start-date 1/1/2000 --end-date 12/31/2000 --county 'BALTIMORE CITY'

By default, case_harvester.py runs in your development AWS environment (see Deploy to AWS). To run in your production environment, add the --environment production CLI flag:

./src/case_harvester.py spider --environment production -s 1/1/2000 -e 12/31/2000 --county 'BALTIMORE CITY'

Questions

For questions or more information, email dan@acab.enterprises.

About

AWS-based application for mining the Maryland Judiciary Case Search

Resources

License

Packages

No packages published

Languages