Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Case Harvester

Case Harvester

Case Harvester is a project designed to mine the Maryland Judiciary Case Search (MJCS) and build a near-complete database of Maryland court cases that can be queried and analyzed without the limitations of the MJCS interface. It is designed to leverage Amazon Web Services (AWS) for scalability and performance.

Our database of cases (with criminal defendant names redacted) is available to the public and can be found at, which is built using our Case Explorer software. REST and GraphQL APIs are also available. If you would like to download tables from our database exported monthly, you can find that at

NOTE: Unless you are modifying Case Harvester for specific purposes, please do not run your own instance so that MJCS is spared unneccesary load. Instead, use the options described above for viewing the data, or if you have an AWS account you are also welcome to clone our database directly.


Case Harvester is split into three main components: spider, scraper, and parser. Each component is a part of a pipeline that finds, downloads, and parses case data from the MJCS. The following diagram shows at a high level how each of these components interact:

High level diagram


The spider component is responsible for discovering new case numbers. It does this by submitting search queries to the MJCS and iterating through the results. Because the MJCS only returns a maximum of 500 results, the search algorithm splits queries that return 500 results into a set of more narrowed queries which are then submitted. Each of these queries is then split again if more than 500 results are returned, and so forth, until the MJCS is exhaustively searched for case numbers. Each discovered case number is submitted to a PostgreSQL database, and then added to a queue for scraping:

Spider diagram

The Spider is launched using Elastic Container Service (ECS) Fargate tasks run at regularly scheduled intervals. These tasks run Case Harvester from a Docker image pulled from an Elastic Container Registry (ECR). Periodically the spider will save its state using a combination of DynamoDB and S3, which allows resuming failed or canceled spider runs.


The scraper component downloads and stores the case details for every case number discovered by the spider. The full HTML for each case (example) is added to an S3 bucket. Version information is kept for each case, including a timestamp of when each version was downloaded, so changes to a case can be recorded and referenced.

Scraper diagram

The scraper is a continuously running ECS service that processes case numbers from the SQS scraper queue.


The parser component is a Lambda function that parses the fields of information in the HTML case details for each case, and stores that data in the PostgreSQL database. Each new item added to the scraper S3 bucket triggers a new parser Lambda invocation, which allows for significant scaling.

Parser diagram

Case details in the MJCS are formatted differently depending on the county and type of case (e.g. district vs circuit court, criminal vs civil, etc.), and whether it is in one of the new MDEC-compatible formats. MJCS assigns a code to each of these different case types. While the case details for all discovered cases are stored in S3, Case Harvester only has full parser support for the following types of cases (see issue #10):

  • ODYCRIM: MDEC Criminal Cases
  • ODYTRAF: MDEC Traffic Cases
  • ODYCIVIL: MDEC Civil Cases
  • ODYCVCIT: MDEC Civil Citations
  • DSCR: District Court Criminal Cases
  • DSCIVIL: District Court Civil Cases
  • DSCP: District Court Civil Citations
  • DSTRAF: District Court Traffic Cases
  • K: Circuit Court Criminal Cases
  • CC: Circuit Court Civil Cases
  • DV: Domestic Violence Cases
  • DSK8: Baltimore City Criminal Cases
  • PG: Prince George's County Circuit Court Criminal Cases
  • PGV: Prince George's County Circuit Court Civil Cases
  • MCCI: Montgomery County Civil Cases
  • MCCR: Montgomery County Criminal Cases

The following case types are still in need of parsers:

  • ODYCOA: Court of Appeals
  • ODYCOSA: Court of Special Appeals

Each different parser breaks down the case details to a granular level and stores the data in a number of database tables. This schematic diagram illustrates how this data is represented in the database.


Case Harvester essentially has two main tasks:

  1. Discover new case numbers, and
  2. Capture updates/changes to case details for known cases.

In order to find new case numbers, Case Harvester uses scheduled tasks to run the spider component according to the following schedule:

  • Every day, spider for cases filed within the last month
  • Every week, spider for cases filed within the last 6 months
  • Every month, spider for cases filed within the last year

The scraper component runs daily to capture any updated case details in MJCS. Case numbers are chosen for rescraping based on the case's age (since filing date), with newer cases being rescraped more frequently than older cases. For example, cases less than 3 months old are rescraped every day while a 4-year-old case will be rescraped every 2 weeks. These settings can be tuned with the RESCRAPE_COEFFICIENT, MAX_SCRAPE_AGE, and MAX_SCRAPE_AGE_INACTIVE configuration variables. When updates to a case are found by the scraper, a new version of the case details HTML is saved in S3 and the database is updated to reflect the most recent case details.


Case Harvester can be run or deployed from any workstation running Python 3.8, GNU Make, and jq. The required Python modules are in requirements.txt and can be installed with pip3 install -r requirements.txt. Creating a Python virtual environment is recommended.

Next, configure AWS CLI with aws configure so that it can deploy Case Harvester using your account. Here you'll use an Access Key ID and Secret Access Key either for your root AWS account, or an IAM user or role that has sufficient permissions.

Deploy to AWS

Case Harvester uses Cloudformation stacks to deploy, configure, and connect all of the needed AWS resources. There are separate stacks for static resources (VPC, S3 bucket, RDS instance), spider, scraper, and parser. The first step is to set strong, unique passwords for the database users in secrets.json:


You can then deploy the cloudformation stacks to AWS by running:

make deploy

Once this is finished, you can intialize the database and configure Case Harvester to use the newly deployed resources by running:

make init

Note that the above commands deploy and initialize a development environment. To deploy to a production environment:

make deploy_production
make init_production

More make targets (such as deploying a specific stack or generating documentation) can be found by looking in the Makefile.


You can manually run the spider (or other components) on the command line by specifying a search time range and county:

./src/ spider --start-date 1/1/2000 --end-date 12/31/2000 --county 'BALTIMORE CITY'

By default, runs in your development AWS environment (see Deploy to AWS). To run in your production environment, add the --environment production CLI flag:

./src/ spider --environment production -s 1/1/2000 -e 12/31/2000 --county 'BALTIMORE CITY'


For questions or more information, email


AWS-based application for mining the Maryland Judiciary Case Search







No packages published