Baseline metrics - part 1: set up read replicas on stage #3618

patphongs · 2019-03-12T19:13:42Z

What we're after: We would like to start tracking our performance through the use of metrics. But before we start we need to take a baseline measurement.

Completion criteria:

Test against one read DB replica of stage. Increase replicas based on schedule_a and schedule_b endpoints performance testing.
~~Take performance baseline of all the endpoints and document~~
Test slower query endpoints: schedule_a and schedule_b
Receipts - 16

/committee/{committee_id}/schedules/schedule_a/by_employer/
/committee/{committee_id}/schedules/schedule_a/by_occupation/
/committee/{committee_id}/schedules/schedule_a/by_size/
/committee/{committee_id}/schedules/schedule_a/by_state/
/committee/{committee_id}/schedules/schedule_a/by_zip/
/schedules/schedule_a/
/schedules/schedule_a/by_employer/
/schedules/schedule_a/by_occupation/
/schedules/schedule_a/by_size/
/schedules/schedule_a/by_size/by_candidate/
/schedules/schedule_a/by_state/
/schedules/schedule_a/by_state/by_candidate/
/schedules/schedule_a/by_state/totals/
/schedules/schedule_a/by_zip/
/schedules/schedule_a/efile/
/schedules/schedule_a/{sub_id}/

Disbursements - 9

/committee/{committee_id}/schedules/schedule_b/by_purpose/
/committee/{committee_id}/schedules/schedule_b/by_recipient/
/committee/{committee_id}/schedules/schedule_b/by_recipient_id/
/schedules/schedule_b/
/schedules/schedule_b/by_purpose/
/schedules/schedule_b/by_recipient/
/schedules/schedule_b/by_recipient_id/
/schedules/schedule_b/efile/
/schedules/schedule_b/{sub_id}/

updated locust file and did the first round of testing.

[ ]~~Shut down replicas when we're done measuring the baseline~~
Ensure testing is completed within one sprint so that we can shut down the additional replicas when we're done and no longer need them
will run another two rounds of locust testing today and tomorrow and put things together. Will generate a quick report based on that.

locust testing result is uploaded here:
https://docs.google.com/spreadsheets/d/1SCrvrFEbCV57rtrXpClWrFg-lAeqyfjU4WDr6dBtW6Y/edit#gid=2097394488

Identify next steps and create followup ticket, including setting new performance goals and establishing a check-in process for regular tracking:

for schedule_a and schedule_b testing/bench-marking, we need to re-collect the data after the 2-year restriction removal is deployed to stg/prod.

For regular checking, I think the slow query analysis will do most of the important work on this.

Other work done:

prepare a PR for fixing the locust file testing data issue. also added two endpoints - schedule_c/sub_id and schedule_d/sub_id into endpoints_walk.py:
Feature/update locust and endpoints #3698
python script for batch query and query profiling(will demo)
script script for batch api walks and timing(will demo).

~~Part 2: fecgov/fec-cms#2743~~ (We closed that issue.)

qqss88 · 2019-04-02T19:01:04Z

@pkfec Did we talk about group testing on this one - each developer taking care of one endpoints? Or each developer come out some test cases for each endpoint and I'll put things together and automate the metrics collection process.

qqss88 · 2019-04-02T19:03:52Z

since most of the slow queries are from schedule_a and schedule_b, I'll make some test cases for those endpoints first.

pkfec · 2019-04-02T20:10:48Z

@qqss88 The idea was to test each api endpoint after lifting the 2 year restrictions. we talked about testing the endpoints by distributing them among the developers. If you PR is reviewed and ready to be merged, in the same PR we can assign the endpoints and ping the developers and ask them to test.

qqss88 · 2019-04-04T14:57:39Z

Top 10 slow queries reported last september :
https://docs.google.com/spreadsheets/d/1U0EIRbFSuNWJC5uvcb54Sv6KTRnokSSb87Zd_4ZhjrY/edit#gid=0

qqss88 · 2019-04-04T16:07:22Z

here is a link from previous locust test for reference:
https://drive.google.com/drive/folders/1F4lFVVC6gFfN7rwhfNl41WvhUBkksimp?ths=true

qqss88 · 2019-04-10T17:18:52Z

added sched_a_multi_cycle testing to locust file. fixed some data issues with existing locust file. tested locust file against prod and stage. will summarize the result later on.

patphongs · 2019-04-16T17:44:44Z

Moving to sprint 8.6 to do additional performance baselines in stage after API release cut today.

pkfec · 2019-05-01T14:22:59Z

@patphongs @qqss88 are you still doing performance baselines testing in stage? Do you have an estimate on how long it's going to take to wrap up the testing?

PaulClark2 · 2019-05-01T14:24:57Z

@qqss88 I'd still like to see a concise summary of baseline and testing results.

qqss88 · 2019-05-01T16:41:38Z

For sched_a and sched_b queries, because of the amount of data we have, the factors affecting query performance are mainly(assuming server resource usage and DB caching is not a factor here ):

how many 2-year partitions involved in the query
is the query taking advantage of index-scan? (data skewness could be an issue)
size of query result set returned for sorting and pagination (data skewness again)

Considering those factors, data skewness for a certain argument value plays important roles when the query is submitted. I did some data skewness analysis on both sched_a and sched_b arguments involved in the api and here is a quick summary:

top skewed argument value in sched_a:

cmte_id = 'C00401224'(actblue) takes around 70% in 2017-1018 cycle
cmte_tp='W' takes more than 70% in 2017/2018 cycle
line_num='11AI' takes around 99% in 2017/2018 cycle
contribution state: skewness is not that bad, top 3 is CA, NY and TX
top occupation: 'retired', 'not employed'(combination > 50%)
zip code: skewness is not bad, top zip code 20001
contribution cities: skewness not bad, top 3 cities: new York, Washington, los angels
contributor names: skewness not bad, top 3 names: 'Gilman, Priscilla', 'immasche, sonia', 'Davis, susan'
employers: skewed toward 4 values: not employed, none, retired, self employed

as a result, most queries with single top argument value + most recent 3 cycles are loading ok(from seconds to around 1min). But those 3 queries are not loading within the time-out window(502 error):

top committee type 'W' not loading with cycle141618:
https://api-stage.open.fec.gov/v1/schedules/schedule_a?
sort_hide_null=false&per_page=30&is_individual=true&two_year_transaction_period
=2016&two_year_transaction_period=2018&two_year_transaction_period=2014&
sort=-contribution_receipt_date&sort_nulls_last=false&api_key=DEMO_KEY&
recipient_committee_type=W

top line_num not loading with cycle 141618:
https://api-stage.open.fec.gov/v1/schedules/schedule_a?
sort_hide_null=false&per_page=30&is_individual=true&
two_year_transaction_period=2016&two_year_transaction_period=2018
&two_year_transaction_period=2014&sort=-contribution_receipt_date
&sort_nulls_last=false&api_key=DEMO_KEY&line_number=11AI

not_employed not loading:
https://api-stage.open.fec.gov/v1/schedules/schedule_a?
sort_hide_null=false&per_page=30&is_individual=true&
two_year_transaction_period=2018&two_year_transaction_period=2014&
sort=-contribution_receipt_date&sort_nulls_last=false&api_key=DEMO_KEY&
contributor_employer=not employed

sched_b:

sched_b general performance is much better(I think it is mainly due to the amount of data comparing to sched_a)。 Here is a quick summary of data skewness:

cmte_id: again, more than 90% is actblue
cmte_tp: more than 90% is 'w'
line_num: more than 90% is '23'
disbursements purpose: more than 90% on 'other'
city: around 60% on 'washington'
state: around 60% from 'DC'
7: recipient_name: skewness not bad, top value is 'democratic congressional campaign committee'

Performance is general good, top single argument value with 14-16-18 cycles comes back within seconds. But I failed on one combination filtering:

14-16-28 + line_num23 + actblue + travel
https://api-stage.open.fec.gov/v1/schedules/schedule_b/?spender_committee_type=W&sort=-disbursement_date&line_number=23&per_page=20&sort_null_only=false&disbursement_purpose_category=other&sort_hide_null=false&api_key=DEMO_KEY&committee_id=C00401224&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&disbursement_description=travel
(slow query analysis needed on this one, current agues is non-indexing scan involved)

this test is database-oriented. It is not a through test identifying all slow queries. Adding some server side resource consumption metrics will be better.

PaulClark2 · 2019-05-01T17:57:59Z

thanks @qqss88

patphongs changed the title ~~API baseline metrics~~ Baseline metrics Mar 12, 2019

patphongs added this to the Sprint 8.4 milestone Mar 12, 2019

JonellaCulmer mentioned this issue Mar 14, 2019

Baseline metrics - part 2 fecgov/fec-cms#2743

Closed

3 tasks

JonellaCulmer changed the title ~~Baseline metrics~~ Baseline metrics - part 1 Mar 14, 2019

JonellaCulmer added Work: Back-end Work: Database and removed Work: Back-end labels Mar 14, 2019

lbeaufort added the Work: Infrastructure label Mar 14, 2019

lbeaufort changed the title ~~Baseline metrics - part 1~~ Baseline metrics - part 1: set up read replicas on stage Mar 14, 2019

pkfec modified the milestones: Sprint 8.4, Sprint 8.5 Mar 28, 2019

pkfec assigned qqss88 Apr 2, 2019

patphongs modified the milestones: Sprint 8.5, Sprint 8.6 Apr 16, 2019

pkfec closed this as completed May 1, 2019

dorothyyeager mentioned this issue Jun 1, 2020

Revisit metrics for FEC website to plan for PI13 fecgov/fec-cms#3800

Closed

7 tasks

patphongs mentioned this issue Feb 29, 2024

Epic: Stability and reliability fecgov/fec-epics#137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Baseline metrics - part 1: set up read replicas on stage #3618

Baseline metrics - part 1: set up read replicas on stage #3618

patphongs commented Mar 12, 2019 •

edited by JonellaCulmer

qqss88 commented Apr 2, 2019

qqss88 commented Apr 2, 2019

pkfec commented Apr 2, 2019

qqss88 commented Apr 4, 2019

qqss88 commented Apr 4, 2019

qqss88 commented Apr 10, 2019

patphongs commented Apr 16, 2019

pkfec commented May 1, 2019

PaulClark2 commented May 1, 2019

qqss88 commented May 1, 2019 •

edited

PaulClark2 commented May 1, 2019

Baseline metrics - part 1: set up read replicas on stage #3618

Baseline metrics - part 1: set up read replicas on stage #3618

Comments

patphongs commented Mar 12, 2019 • edited by JonellaCulmer

qqss88 commented Apr 2, 2019

qqss88 commented Apr 2, 2019

pkfec commented Apr 2, 2019

qqss88 commented Apr 4, 2019

qqss88 commented Apr 4, 2019

qqss88 commented Apr 10, 2019

patphongs commented Apr 16, 2019

pkfec commented May 1, 2019

PaulClark2 commented May 1, 2019

qqss88 commented May 1, 2019 • edited

PaulClark2 commented May 1, 2019

patphongs commented Mar 12, 2019 •

edited by JonellaCulmer

qqss88 commented May 1, 2019 •

edited