Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research improving postgresql counts performance #3586

Closed
6 tasks
patphongs opened this issue Feb 27, 2019 · 6 comments
Closed
6 tasks

Research improving postgresql counts performance #3586

patphongs opened this issue Feb 27, 2019 · 6 comments

Comments

@patphongs
Copy link
Member

patphongs commented Feb 27, 2019

What we're after:
Research has shown that postgresql counts could be dragging down performance. By researching ways to improve postgresql counts performance and speed, we may find a viable path to remove the 2 year restriction from our website.

Observations:

  • Postgres versions may come into play. We are using 9.6. We need to see what cloud.gov is capable of.
  • Database cache or application cache may be something that we can do to alleviate database burden? Can we increase those caches? This may effect replica lag time.

Completion criteria:

  • Compare and document API performance with and without counts
  • Research what's been written about postgres count performance
  • Check for compatible postgres version on govcloud
  • Upgrade postgres version (10.3?) and test results for counts performance
  • Identify next steps to making counts more efficient
  • Move to new implementation issue to improve counts time
@rjayasekera
Copy link
Contributor

rjayasekera commented Mar 7, 2019

helpful links:
https://dzone.com/articles/faster-postgresql-counting
https://www.citusdata.com/blog/2016/10/12/count-performance/

A note about count(1) vs count(). One might think that count(1) would be faster because count() appears to consult the data for a whole row. However the opposite is true. The star symbol is meaningless here, unlike its use in SELECT . PostgreSQL parses The expression count() as a special case taking no arguments. (Historically the expression ought to have been defined as count().) On the other hand count(1) takes an argument and PostgreSQL has to check at every row to see that ts argument, 1, is indeed still not NULL.

@fecjjeng
Copy link
Contributor

fecjjeng commented Mar 8, 2019

If the user searched for CMTE_ID = 'C00401224', then the cmte_id index will not be used since ActBlue represent more than 56% of the data in sched_a_2017_2018 (and one can not force postgresql planner to use an index as we do in Oracle). cmte_id index is used when I tried to query all the other cmte_id (for example, C00000935 or C00003418).
However, if one add another criteria (such as contbr_zip%), then the return will be much faster since the contbr_zip index will be used (which is more evenly distributed).

Same case for entity_tp = 'IND', the index will not be used since >92% of the sched_a_2017_2018 data are 'IND'. But if one choose entity_tp = 'PTY', the index is used and the return is much faster.

If user majorly query by contributor name, we might be ok. Since I do not foresee any contributor name will be so screwed in percentage like 'C00401224' or 'IND'.

Bigger database power still needed. When I tried 'C00000935' with the biggest 5 cycles in stg-replica1, it still took 3 minutes to return, even index is indeed used. I tried the same query in prd, and have to kill the query after 10 minutes.

@fecjjeng
Copy link
Contributor

fecjjeng commented Mar 8, 2019

Example:
explain analyze
SELECT count(*)
FROM disclosure.fec_fitem_sched_a
WHERE disclosure.fec_fitem_sched_a.two_year_transaction_period = 2018

-- index not used in case of 'C00401224', but 'C00000935' will use this index
AND disclosure.fec_fitem_sched_a.cmte_id IN ('C00401224')

-- index not used in case of 'IND', but 'PTY' will use index
AND disclosure.fec_fitem_sched_a.entity_tp = 'IND'

-- index used
AND contbr_zip like '20854%'

-- index used
AND contributor_name_text @@ to_tsquery('Jeff');

@rjayasekera
Copy link
Contributor

Currently we use exact same query to get the no. of records in the results set. In other words we are running the exact same query twice. If we did not count using the same exact query then we encounter the same slowness issue when we run the same query to get the details.

We cannot simply remove the counts. This is a very useful thing for the user. Currently we have set a threshold of 500,000 rows. If the counts are over that limit we do an estimate, other wise we do an actual record count.

we will open up a another issue to deal with slow queries by monitoring the logs to find out which queries that users are having difficulties, and deal with them daily basis.

@rjayasekera
Copy link
Contributor

Closing this issue in favor of #3641

@rjayasekera
Copy link
Contributor

row counts are much faster after AWS RDS server upgrade
Before
select count(*) from disclosure.nml_sched_a;
+-----------+

count
435845431
+-----------+
SELECT 1
Time: 1520.199s (25 minutes)

After
select count(*) from disclosure.nml_sched_a;
+-----------+

count
435845431
+-----------+
SELECT 1
Time: 343.040s (5 minutes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants