Research improving postgresql counts performance #3586

patphongs · 2019-02-27T16:19:22Z

What we're after:
Research has shown that postgresql counts could be dragging down performance. By researching ways to improve postgresql counts performance and speed, we may find a viable path to remove the 2 year restriction from our website.

Observations:

Postgres versions may come into play. We are using 9.6. We need to see what cloud.gov is capable of.
Database cache or application cache may be something that we can do to alleviate database burden? Can we increase those caches? This may effect replica lag time.

Completion criteria:

Compare and document API performance with and without counts
Research what's been written about postgres count performance
Check for compatible postgres version on govcloud
Upgrade postgres version (10.3?) and test results for counts performance
Identify next steps to making counts more efficient
Move to new implementation issue to improve counts time

rjayasekera · 2019-03-07T15:36:25Z

helpful links:
https://dzone.com/articles/faster-postgresql-counting
https://www.citusdata.com/blog/2016/10/12/count-performance/

A note about count(1) vs count(). One might think that count(1) would be faster because count() appears to consult the data for a whole row. However the opposite is true. The star symbol is meaningless here, unlike its use in SELECT . PostgreSQL parses The expression count() as a special case taking no arguments. (Historically the expression ought to have been defined as count().) On the other hand count(1) takes an argument and PostgreSQL has to check at every row to see that ts argument, 1, is indeed still not NULL.

fecjjeng · 2019-03-08T05:12:41Z

If the user searched for CMTE_ID = 'C00401224', then the cmte_id index will not be used since ActBlue represent more than 56% of the data in sched_a_2017_2018 (and one can not force postgresql planner to use an index as we do in Oracle). cmte_id index is used when I tried to query all the other cmte_id (for example, C00000935 or C00003418).
However, if one add another criteria (such as contbr_zip%), then the return will be much faster since the contbr_zip index will be used (which is more evenly distributed).

Same case for entity_tp = 'IND', the index will not be used since >92% of the sched_a_2017_2018 data are 'IND'. But if one choose entity_tp = 'PTY', the index is used and the return is much faster.

If user majorly query by contributor name, we might be ok. Since I do not foresee any contributor name will be so screwed in percentage like 'C00401224' or 'IND'.

Bigger database power still needed. When I tried 'C00000935' with the biggest 5 cycles in stg-replica1, it still took 3 minutes to return, even index is indeed used. I tried the same query in prd, and have to kill the query after 10 minutes.

fecjjeng · 2019-03-08T05:18:31Z

Example:
explain analyze
SELECT count(*)
FROM disclosure.fec_fitem_sched_a
WHERE disclosure.fec_fitem_sched_a.two_year_transaction_period = 2018

-- index not used in case of 'C00401224', but 'C00000935' will use this index
AND disclosure.fec_fitem_sched_a.cmte_id IN ('C00401224')

-- index not used in case of 'IND', but 'PTY' will use index
AND disclosure.fec_fitem_sched_a.entity_tp = 'IND'

-- index used
AND contbr_zip like '20854%'

-- index used
AND contributor_name_text @@ to_tsquery('Jeff');

rjayasekera · 2019-03-19T21:53:50Z

Currently we use exact same query to get the no. of records in the results set. In other words we are running the exact same query twice. If we did not count using the same exact query then we encounter the same slowness issue when we run the same query to get the details.

We cannot simply remove the counts. This is a very useful thing for the user. Currently we have set a threshold of 500,000 rows. If the counts are over that limit we do an estimate, other wise we do an actual record count.

we will open up a another issue to deal with slow queries by monitoring the logs to find out which queries that users are having difficulties, and deal with them daily basis.

rjayasekera · 2019-03-19T22:04:22Z

Closing this issue in favor of #3641

rjayasekera · 2019-03-20T13:55:19Z

row counts are much faster after AWS RDS server upgrade
Before
select count(*) from disclosure.nml_sched_a;
+-----------+

count
435845431
+-----------+
SELECT 1
Time: 1520.199s (25 minutes)

After
select count(*) from disclosure.nml_sched_a;
+-----------+

count
435845431
+-----------+
SELECT 1
Time: 343.040s (5 minutes)

patphongs added this to the Sprint 8.3 milestone Feb 27, 2019

JonellaCulmer added High priority Work: Back-end labels Feb 27, 2019

JonellaCulmer assigned rjayasekera Mar 5, 2019

rjayasekera closed this as completed Mar 19, 2019

patphongs mentioned this issue Feb 29, 2024

Epic: Stability and reliability fecgov/fec-epics#137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research improving postgresql counts performance #3586

Research improving postgresql counts performance #3586

patphongs commented Feb 27, 2019 •

edited by JonellaCulmer

rjayasekera commented Mar 7, 2019 •

edited

fecjjeng commented Mar 8, 2019 •

edited

fecjjeng commented Mar 8, 2019 •

edited

rjayasekera commented Mar 19, 2019

rjayasekera commented Mar 19, 2019

rjayasekera commented Mar 20, 2019

Research improving postgresql counts performance #3586

Research improving postgresql counts performance #3586

Comments

patphongs commented Feb 27, 2019 • edited by JonellaCulmer

rjayasekera commented Mar 7, 2019 • edited

fecjjeng commented Mar 8, 2019 • edited

fecjjeng commented Mar 8, 2019 • edited

rjayasekera commented Mar 19, 2019

rjayasekera commented Mar 19, 2019

rjayasekera commented Mar 20, 2019

patphongs commented Feb 27, 2019 •

edited by JonellaCulmer

rjayasekera commented Mar 7, 2019 •

edited

fecjjeng commented Mar 8, 2019 •

edited

fecjjeng commented Mar 8, 2019 •

edited