Skip to content

Handling an Outage if Things Are Down

Ramiro Blanco edited this page May 2, 2024 · 5 revisions

We try to keep CourtListener up, but it's not always easy due to the huge spikes in traffic we can sometimes get. Follows is the general process for fixing the site when it's down.

The general idea is to (1), figure out which component of the site is broken and then to (2), fix it.

Figuring out what's broken

This part is pretty hard at the moment. First steps:

  1. Try the health check endpoint. It's easy and fast and often revealing.

  2. Look at Sentry and see what kinds of errors it's reporting. It'll often give you a good lead about where to begin.

Usually, the problem is of:

  1. Solr

    Solr is run from a large EC2 instance that's outside of k8s. You can check its status by adding your IP to the Security Group, SSH'ing into it…

     ssh -i .ssh/solr.pem ubuntu@ec2-35-91-65-155.us-west-2.compute.amazonaws.com
    

    …and then doing the standard checks listed below. To SSH into the server, you'll need it's SSH key.

  2. Postgresql

    Postgresql is run in AWS RDS. If it's failing, it's almost always because somebody is doing a nasty API query of some kind (see section below on sorting that out). You can check its health via its logs (see below) and the RDS monitoring dashboard in the AWS console.

  3. Django

    Django is run as a horizontally scaled k8s deployment. Usually if it fails it's because:

    • It needs more memory or servers (check this in k9s).
    • A new pod deployment is broken. This will turn up in Sentry or if you look at the cl-python k8s deployment in k8s, it'll show lots of crashing pods.
  4. Redis

    Redis is run via AWS Elasticache. Like RDS, it has a monitoring dashboard and logs in the AWS console.

  5. AWS SES

    This is what sends our emails. It has a quota. If the quota runs out, we will get lots of errors from Sentry. To check the status of SES, pull it up in the console, where it will tell you everything you need to know.

Checking logs

Solr

Solr logs are shipped to AWS Cloudwatch. You can find and filter them there. Note that the caller GET parameter in the logs can be a useful clue to figure out what part of CourtListener made a specific call to Solr.

If you're in the server, you can also view the logs with:

sudo docker logs solr -f --since 1m

Elastic

Elastic is run as a k8s cluster. It uses the fluent-bit to send its logs to AWS CloudWatch. To search the logs there, use Log Insights, with a query like:

filter log_processed.log.level != "WARN" 
  and log not like /some-regex/
  | fields @timestamp, @message

Checking metrics while under high load

export ES_USER=elastic
export ES_PASS=
export ES_ENDPOINT=localhost:9200

Search stats:

curl -X GET "https://${ES_ENDPOINT}/_stats/search" -k -u ${ES_USER}:${ES_PASS}

Target a single index:

curl -X GET "https://${ES_ENDPOINT}/recap_vectors/_stats/search" -k -u ${ES_USER}:${ES_PASS}

Indexing stats:

curl -X GET "https://${ES_ENDPOINT}/recap_vectors/_stats/indexing" -k -u ${ES_USER}:${ES_PASS}

Retrieve all the tasks related to search that are queued:

curl -X GET "https://${ES_ENDPOINT}/_tasks?detailed=true&actions=*search" -k -u ${ES_USER}:${ES_PASS}

Web logs

Web traffic could be logged at three levels:

  • At the CDN, CloudFront
  • At the Application Load Balancer (ALB)
  • At the docker pods or deployment, in k8s.

What we do is have CloudFront ship logs to S3. You can go into S3 and look at the logs directly, but it's horrible: They're split up, zipped, and it's a mega pain.

The better way to look at our web logs is to use AWS Athena, which has some queries saved in it for looking at the logs. Athena is sort of magic. It gives you a SQL API for files stored in S3, so you can literally query S3.

Go here:

https://us-west-2.console.aws.amazon.com/athena/home?region=us-west-2#/query-editor/history/9734dbf0-4ca5-44da-a944-d60f7d740dd0

That will get you a feel for the traffic. If you want to map IP addresses to users, use the recipe here.

Django

These logs don't tend to say much, but you can get them in k9s by selecting the deployment object type (: deployment), pressing l for logs, and then 0 to see the latest logs.

Celery

Celery logs can be viewed from k9s by selecting the logs for the celery-prefork deployment. To do that:

  • Open k9s
  • Go to deployments (:deployment <enter>)
  • Use arrows to select "celery"
  • Press l for the logs.

Redis

Redis logs are available in the Elasticache console.

Postgresql

Postgresql logs are available in the RDS console.

AWS SES

This doesn't seem to have logs.

Fixing it

Solr

Possible solutions include:

  1. Restart solr:

     sudo docker container ls
     sudo docker restart 11c681511841
    
  2. Scale up the Solr instance via EC2.

  3. Add more swap space to handle memory exhaustion.

  4. Figure out what queries are running it down and block offending IPs or users (see below).

Django

  1. Scale the k8s cluster by adding more nodes to the group. This should fix memory or CPU exhaustion.

  2. Revert to a known-good deployment by re-running the last good Github deployment.

Celery

Often, the problem is that Celery has a really long queue that's blowing up Redis. This is a bummer, and you can fix it by deleting the queue (del celery or del etl_tasks, say), but that indiscriminately blows up everything. A better way is to delete the bad tasks.

You can do that with the script below, which lets you fiddle with the task name and parameters:

import json
from django.conf import settings
from cl.lib.redis_utils import make_redis_interface

r = make_redis_interface("CELERY")
queue_name = "etl_tasks"
tasks_to_remove = ["cl.search.tasks.update_children_docs_by_query", "cl.search.tasks.es_save_document"]
related_instance = "ESRECAPDocument"
# Count all tasks in the queue
total_tasks = r.llen(queue_name)
chunk_size = 500
removed_tasks = 0
checked_tasks = 0

# Calculate number of chunks based in total_tasks and chunk_size
# Add an extra chunk to process remaining tasks.
chunks = (total_tasks // chunk_size) + 1
# Remove tasks from the queue in chunks.
for chunk in range(chunks):
    # Adjust the start index based on removed tasks.
    start_index = (chunk * chunk_size) - removed_tasks  
    end_index = start_index + chunk_size - 1
    
    tasks = r.lrange(queue_name, start_index, end_index)
    for task_data in tasks:
        task_json = json.loads(task_data)
        if task_json['headers']['task'] in tasks_to_remove and related_instance in task_json['headers']['argsrepr']:
            # Remove the task from the queue.
            r.lrem(queue_name, 1, task_data)
            removed_tasks+=1
        checked_tasks += 1
    print(f"Checked {checked_tasks} and removed {removed_tasks} tasks so far.")

print(f"Successfully removed {removed_tasks} tasks.")

Celery also keeps a couple other keys in redis for tasks that are scheduled, notably, the unacked queue. You can read a lot more about this here:

This script can be used to remove things from the unacked queue. We've used it once before, but by the time we did, the queue was clear, so count this script as untested:

import json
from django.conf import settings
from cl.lib.redis_utils import make_redis_interface

r = make_redis_interface("CELERY")
tasks_to_remove = ["cl.search.tasks.es_save_document"]
removed_tasks = 0
checked_tasks = 0
cursor = 0

while True:
    # Iterate over unacked_index
    cursor, items = r.zscan("unacked_index", cursor=cursor)
    for unack_key, score in items:
        task_value = r.hget("unacked", unack_key)
        task_json = json.loads(task_value)
        if task_json[0]['headers']['task'] in tasks_to_remove:
            # Remove the task from "unacked_index" and the "unacked" queue.
            r.hdel("unacked", unack_key)
            r.zrem("unacked_index", unack_key)
            removed_tasks += 1
        checked_tasks += 1
        print(f"Checked {checked_tasks} and removed {removed_tasks} unacked tasks so far.")

    if cursor == 0:
        break
print(f"Successfully removed {removed_tasks} unacked tasks.")

Redis

Usually, Redis's problem is that it runs out of memory. You can see this by loading its memory chart in AWS, and you can analyze it with a few commands:

redis-cli -h $REDIS_HOST --bigkeys

This will show you some meaningless progress info followed by a summary. Things to look for:

  • Are there any particularly huge keys in the summary? Once, we accidentally cached the sitemaps to Redis instead of the DB, and it was bad.
  • How big is the celery list? It should only have a few items, but sometimes it has a LOT. If it has a lot, Celery has fallen behind and needs more resources either directly (scale it up) or indirectly (the DB can't keep up with it, say).

Sometimes, like in an emergency, you can just delete the large keys with: DEL iauploads. That worked in #1460.

Another useful command is:

redis-cli -h $REDIS_HOST INFO

That'll give you an overview of which Redis DB's are using the memory — A clue!

Finally, this is a handy way to delete a lot of keys:

redis-cli -h $REDIS_HOST -n 1 --scan --pattern ':1:mlt-cluster*' | xargs redis-cli -n 1 del

Postgresql

  1. Figure out the cause of the problem.

  2. Scale the disk so it's faster.

  3. Scale the instance so it's more powerful.

Special tricks

Who owns an IP address?

I'm not good at this, but one place to start is with the whois command:

whois xxx.xxx.xxx.xxx

Blocking an IP address

Add it to the blocklist IP Set in the Web Application Firewall.

Getting the username of an API request by IP

You can get the username associated with an IP address that's making API requests:

from cl.lib.redis_utils import make_redis_interface
def get_user_by_ip(r, date_str, ip_address):
    # Get the key for the specific day
    key = f"api:v3.d:{date_str}.ip_map"

    # Get the user_id associated with the IP address
    user_id = r.hget(key, ip_address)
    return user_id

r = make_redis_interface("STATS")
get_user_by_ip(r, '2023-06-01', 'x.x.x.x')

I think this has a bug that it only returns one username even if there are more than one on an IP.

Blocking API access for a user

Two approaches:

  1. If they're abusing the RECAP APIs, simply yank their permission for those APIs.

  2. If they're abusing a different API, add their name to the throttle override section of the settings, push that to main, and deploy it via CI.

Dealing with a viral webpage

We use a CDN but without caching. Put the viral page behind the CDN with a custom behavior just to match it, which has caching. If the viral load is very high, you may also have a halo effect of people loading the homepage. If so, but it behind the CDN in the same way. Use the short cache policy that factors in the sessionid cookie value.

Standard Checks

  1. Check general health of the server with htop. Are the CPUs pegged? Is there a lot of IO wait?

  2. Check memory with free -h.

  3. Check IO usage with sudo iotop

  4. Check logs.