API load testing setup #4327

lbeaufort · 2020-04-29T21:57:29Z

Action item from #4314

Need to know what the system can handle before it "falls over". Need to know what the load was on April 15, for example.

Completion criteria:

After Aurora migration, measure "fallover point" for number/complexity of requests
Put in tickets for findings

Technical steps

Prepare the "locusts"

Get API queries and load data from 4/15 (Kibana max size reached, saved 10am-12pm) link: https://logs.fr.cloud.gov/goto/8993f1282d2ab11c4aabd09330e107de
Parse logs to generate locust queries (see old issue https://github.com/fecgov/openFEC/pull/4031/files)
Make a branch with locust tests for API
Figure out users to simulate - at peak, 1000 requests/minute = ~50/second = 500 users, (1/second?)

Set up environment

Figure out what it's going to take to set up the DB environment to replicate prod now that we're using Aurora. How many clusters, PR to terraform, check with DB team on timing, etc.
Q: Stage and dev have 2 clusters of 4.8xlarge, prod has 3. Can we give stage and prod 3 each and dev one? Alternately, can we spin up a 3rd replica for one day?
Per Rohan: Currently
DEV (1 master - 1 replica)
STG (1 master - 1 replica)
PRD (1 master - 4 replica)
(Change it manually during the test make it much easier than going thru terraform). Make sure to scale it down after the test to minimize cost.
Should test 2 vs. 4 clusters
Confirm access to increase cluster count
Make sure everything actually works by running a sample test on stage as-is
Make a testing plan. 2 vs 4 clusters, number of application instances and memory, CMS timeouts, API timeouts. Gunicorn workers (need research). Application profiling? https://github.com/benfred/py-spy.
Add downloads to locust tests? Will need to look at celery worker setups
First test production setup with 2 clusters
Test production setup with 4 clusters
Test more application memory

Communicate

Let FEC, API umbrella, and cloud.gov teams know which day we'll be doing the testing. Cloud.gov wants an email: https://cloud.gov/docs/compliance/pentest/

The text was updated successfully, but these errors were encountered:

lbeaufort · 2020-05-05T19:57:34Z

We should also let the team and cloud.gov know (ideally a week beforehand) which day we plan to test. Reference: https://cloud.gov/docs/compliance/pentest/

lbeaufort · 2020-05-07T17:43:49Z

Parsing script:

"""Extract API query from Kibana log file"""
import csv
import json

with open("LB_API_RTR_requests_4-15-20.csv", "r") as file:
    reader = csv.reader(file, delimiter=',')
    # Endpoint/uery lookup
    queries = {}
    count = 0
    for row in reader:
        # Column 2 has the queries. Throw out some bad data
        if 'v1' in row[1]:
            # Get the endpoint- everything after the 'v1' to the first '?'
            endpoint = row[1].partition("v1/")[2].partition("?")[0].partition(" ")[0]
            if endpoint[-1] != "/":
                endpoint += "/"
            # Get the query param string - everything between the ? and the ' '
            query_parameters = row[1].partition("?")[2].partition(" ")[0]
            # Parse arguments
            query_dict = {}
            if "&" in query_parameters:
                # Clean up some double &&
                query_parameters = query_parameters.replace("&&", "&")
                # Split each query pair out into a list
                parameter_groups = query_parameters.split("&")
                # Make a dictionary of the parameters (this is how locust needs them)
                for result in parameter_groups:
                    # Split the parameters from the values
                    if "=" in result and "api_key" not in result:
                        key, value = result.split("=")
                        if not query_dict.get(key):
                            query_dict[key] = [value]
                        else:
                            query_dict[key].append(value)
            if query_dict:
                # Add to the endpoint/query lookup
                if not queries.get(endpoint):
                    queries[endpoint] = [query_dict]
                else:
                    if query_dict not in queries[endpoint]:
                        queries[endpoint].append(query_dict)

print(json.dumps(queries, indent=1))

lbeaufort · 2020-05-26T14:22:30Z

Let the team know and set up a maintenance window in Pingdom, this could cause downtime

In the console:

Update DB size (currently 4.8 in prod) - start with the writer.
Update parameter group ("DB parameter group" NOT "DB cluster parameter group" - usually needs to correspond with the size - fec-aurora-master for the writer and fec-aurora-replica-5 for reader (we will clean these up later?)
Note: FYI reader(s) has enhanced monitoring for autoscaling
Choose "apply immediately"
Repeat for reader instance
Add one instance (production currently at 3 instances)
Steps: Select the cluster, Actions -> Add instance
Double check security groups
Set up autoscaling - we created a new policy that mirrored the policy for production docs
Run "warmup script" in stage for new reader - David will run this. this is a SQL package (pg_prewarm) that caches data in memory. How to run this? Can manually trigger with one statement one time after reboot, where does the script live, what machine runs it. DB130 and 029 on-prem servers. Could we run this with celery task? Currently on demand. David can share the script with the team. Best to run the readers individually. Best practice to run this from time to time - maybe after election? @dzhang-fec will document and share with the team.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.DBInstance.Modifying.html

lbeaufort · 2020-05-28T19:25:52Z

There were some issues with formatting the requests, so I also did some testing with the normal locust setup.

lbeaufort self-assigned this Apr 29, 2020

lbeaufort added this to the Sprint 12.3 milestone Apr 29, 2020

lbeaufort mentioned this issue Apr 29, 2020

Have a meeting to discuss meaningful metrics #4314

Closed

1 task

lbeaufort added the Needs refinement label Apr 29, 2020

JonellaCulmer added Work: Back-end and removed Needs refinement labels Apr 30, 2020

patphongs mentioned this issue Apr 30, 2020

Performance testing and fine tune: Slow API response with CMS timeout settings fecgov/fec-cms#3729

Closed

6 tasks

lbeaufort changed the title ~~After Aurora migration, measure "fallover point" for number/complexity of requests~~ API load testing setup May 5, 2020

lbeaufort mentioned this issue May 5, 2020

Performance testing and fine tune: API and CMS #4335

Closed

10 tasks

lbeaufort modified the milestones: Sprint 12.3, Sprint 12.4 May 19, 2020

lbeaufort closed this as completed May 28, 2020

lbeaufort mentioned this issue Aug 31, 2020

API memory usage too high #4578

Closed

15 tasks

patphongs mentioned this issue Feb 29, 2024

Epic: Stability and reliability fecgov/fec-epics#137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API load testing setup #4327

API load testing setup #4327

lbeaufort commented Apr 29, 2020 •

edited

lbeaufort commented May 5, 2020

lbeaufort commented May 7, 2020 •

edited

lbeaufort commented May 26, 2020 •

edited

lbeaufort commented May 28, 2020

API load testing setup #4327

API load testing setup #4327

Comments

lbeaufort commented Apr 29, 2020 • edited

Technical steps

lbeaufort commented May 5, 2020

lbeaufort commented May 7, 2020 • edited

lbeaufort commented May 26, 2020 • edited

lbeaufort commented May 28, 2020

lbeaufort commented Apr 29, 2020 •

edited

lbeaufort commented May 7, 2020 •

edited

lbeaufort commented May 26, 2020 •

edited