New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datastore dump + search performance #3523

Merged
merged 12 commits into from May 18, 2017

Conversation

4 participants
@wardi
Contributor

wardi commented Apr 7, 2017

Profiling dumping datastore resources and the datastore_search API it uses to identify the easiest improvements we can make.

Timings for python -m cProfile paster datastore dump on a datastore table with ~920k rows

  • commit c113257
  • total time: 218.7 s - all percentages below relate to this
  • 68.8 % within datastore_search (datastore/logic/action)
    • 41.1 % within format_results (datastore/db)
      • 21.8 % convert (datastore/db)
      • 13.5 % fetchone (sqlalchemy)
    • 27.4 % within execute (sqlalchemy)
  • 27.7 % within writerow (datastore/writer)
    • 18.6 % writerow (unicodecsv)
    • 6.5 % _json_dump_nested (datastore/writer)

More than 40% of our time is in sqlalchemy fetchone + execute. Switching to the postgres COPY function instead of iterating over select results is promising. We can also skip intermediate total row calculations and see if index-based offsets will perform better for large tables.

Almost 30% of our time is spent formatting the results as CSV. We can lean on postgres to format the CSV for us we could skip almost all of this work.

More than 20% of our time is spent converting results returned from postgres with column type and python isinstance checks. We can investigate optimizations in the python code and use more efficient data libraries and conversions.

@wardi wardi self-assigned this Apr 4, 2017

@wardi

This comment has been minimized.

Show comment
Hide comment
@wardi

wardi Apr 7, 2017

Contributor

With 856ae96 the time spent in datastore_search was significantly reduced:

datastore_search seconds for 920k records
before change 150.4
after change 84.1

Action API users will see further benefit from JSON not needing to be re-encoded when sent to the client.

We're taking advantage of the new RawJSON support in simplejson 3.10 and the postgres row_to_json() function to render row JSON.

Overall time for dumping to CSV was only slightly improved because we're still converting from JSON -> python dictionaries -> CSV. We're also still looping over every row in python instead of processing a whole batch of rows at a time. Fixing those issues is next.

Contributor

wardi commented Apr 7, 2017

With 856ae96 the time spent in datastore_search was significantly reduced:

datastore_search seconds for 920k records
before change 150.4
after change 84.1

Action API users will see further benefit from JSON not needing to be re-encoded when sent to the client.

We're taking advantage of the new RawJSON support in simplejson 3.10 and the postgres row_to_json() function to render row JSON.

Overall time for dumping to CSV was only slightly improved because we're still converting from JSON -> python dictionaries -> CSV. We're also still looping over every row in python instead of processing a whole batch of rows at a time. Fixing those issues is next.

@wardi

This comment has been minimized.

Show comment
Hide comment
@wardi

wardi Apr 8, 2017

Contributor

a553c1d hands off CSV, TSV and our JSON-lists-style dumps to postgres with a new records_format parameter. It also starts using COPY instead of looping over results. This code is running ~ 3.5x the speed of what we have now.

datastore dump csv seconds for 920k records
before this branch 218.7
after a553c1d 62.8

This change is a little rough and I'd like to run separate profiles on each of the export formats because their code paths are quite different now. I also need to implement select filtering for the CSV and TSV output (not required for the dump command but should be supported by the API)

The new profile shows almost half our time being spent in queries not related to dumping the actual data, so there should be another opportunity for optimization there that will help all datastore_search calls.

Contributor

wardi commented Apr 8, 2017

a553c1d hands off CSV, TSV and our JSON-lists-style dumps to postgres with a new records_format parameter. It also starts using COPY instead of looping over results. This code is running ~ 3.5x the speed of what we have now.

datastore dump csv seconds for 920k records
before this branch 218.7
after a553c1d 62.8

This change is a little rough and I'd like to run separate profiles on each of the export formats because their code paths are quite different now. I also need to implement select filtering for the CSV and TSV output (not required for the dump command but should be supported by the API)

The new profile shows almost half our time being spent in queries not related to dumping the actual data, so there should be another opportunity for optimization there that will help all datastore_search calls.

@wardi

This comment has been minimized.

Show comment
Hide comment
@wardi

wardi Apr 8, 2017

Contributor

Total calculation is unnecessary for dumping all rows, so d7ab29e disables that. This branch is now running at more than 6x the speed of master for dumping CSV files.

datastore dump csv seconds for 920k records
before this branch 218.7
after d7ab29e 34.4

This time is dominated (>90%) by the postgres copy operation. We can explain and start optimizing the postgres query to go further, but after a 6x improvement I'm tempted to circle back around and collect some of the other performance numbers and make sure everything is still working properly.

Contributor

wardi commented Apr 8, 2017

Total calculation is unnecessary for dumping all rows, so d7ab29e disables that. This branch is now running at more than 6x the speed of master for dumping CSV files.

datastore dump csv seconds for 920k records
before this branch 218.7
after d7ab29e 34.4

This time is dominated (>90%) by the postgres copy operation. We can explain and start optimizing the postgres query to go further, but after a 6x improvement I'm tempted to circle back around and collect some of the other performance numbers and make sure everything is still working properly.

@wardi wardi removed their assignment Apr 11, 2017

@amercader amercader requested review from amercader and removed request for amercader Apr 11, 2017

@amercader

This comment has been minimized.

Show comment
Hide comment
@amercader

amercader Apr 11, 2017

Member

@benscott @aliceh75 This has the potential to affect IDataStore plugins (apart from providing significant performance improvements). As you were using it (IIRC) perhaps you want to give it a try with your instance to flag any potential issues?

Member

amercader commented Apr 11, 2017

@benscott @aliceh75 This has the potential to affect IDataStore plugins (apart from providing significant performance improvements). As you were using it (IIRC) perhaps you want to give it a try with your instance to flag any potential issues?

@wardi

This comment has been minimized.

Show comment
Hide comment
@wardi

wardi Apr 14, 2017

Contributor

More numbers, this time covering calls to datastore_search API with curl in a few different ways. I'm comparing efcfe97 (before include_total), bd361dc (before this PR) and f51610a. All numbers are the smallest "real" seconds from three runs of time curl .... Blank cells represent unsupported calls.

parameters A: efcfe97 B: bd361dc C: f51610a A/B A/C
distinct, 1 field 1.002 s 1.028 s 0.952 s 0.97 1.05
distinct, 1 field, no total 0.650 s 0.615 s 1.54 1.63
distinct, 10k 2.746 s 1.288 s 0.684 s 2.13 4.01
distinct, 10k, no total 1.000 s 0.371 s 2.75 7.40
10k 2.763 s 1.259 s 0.685 s 2.19 4.03
10k, no total 0.969 s 0.394 s 2.85 7.01
10k, lists, no total 0.401 s 6.89
10k, csv, no total 0.158 s 17.49
100k 11.883 10.485 s 3.420 s 1.13 3.47
100k, no total 10.211 s 3.106 s 1.16 3.82
100k, lists, no total 3.315 s 3.58
100k, csv, no total 0.862 s 13.78

After this change for 10k row queries of datastore tables with ~ 1M rows expect datastore_search to be ~ 4x as fast as CKAN 2.6 with no change to your client code, ~ 7x as fast when using include_total=false, and more than 17x as fast when using include_total=false and records_format=csv or tsv.

Contributor

wardi commented Apr 14, 2017

More numbers, this time covering calls to datastore_search API with curl in a few different ways. I'm comparing efcfe97 (before include_total), bd361dc (before this PR) and f51610a. All numbers are the smallest "real" seconds from three runs of time curl .... Blank cells represent unsupported calls.

parameters A: efcfe97 B: bd361dc C: f51610a A/B A/C
distinct, 1 field 1.002 s 1.028 s 0.952 s 0.97 1.05
distinct, 1 field, no total 0.650 s 0.615 s 1.54 1.63
distinct, 10k 2.746 s 1.288 s 0.684 s 2.13 4.01
distinct, 10k, no total 1.000 s 0.371 s 2.75 7.40
10k 2.763 s 1.259 s 0.685 s 2.19 4.03
10k, no total 0.969 s 0.394 s 2.85 7.01
10k, lists, no total 0.401 s 6.89
10k, csv, no total 0.158 s 17.49
100k 11.883 10.485 s 3.420 s 1.13 3.47
100k, no total 10.211 s 3.106 s 1.16 3.82
100k, lists, no total 3.315 s 3.58
100k, csv, no total 0.862 s 13.78

After this change for 10k row queries of datastore tables with ~ 1M rows expect datastore_search to be ~ 4x as fast as CKAN 2.6 with no change to your client code, ~ 7x as fast when using include_total=false, and more than 17x as fast when using include_total=false and records_format=csv or tsv.

@wardi

This comment has been minimized.

Show comment
Hide comment
@wardi

wardi Apr 16, 2017

Contributor

Updated #3523 (comment) above with numbers for f51610a

Contributor

wardi commented Apr 16, 2017

Updated #3523 (comment) above with numbers for f51610a

@wardi

This comment has been minimized.

Show comment
Hide comment
@wardi

wardi Apr 21, 2017

Contributor

My last change bumps up the page size for the dump controller from 10k to 32k rows based on some timings that show it shaves 10-20s off our 920k row export: image

Note that xml is the last transformation being done in python (and cElementTree). If we generate our xml output in postgres it will have overall performance similar to the json format.

This work is complete. Here's a comparison of the before+after #3523 export times for my 920k row dataset in the each format:
image

Contributor

wardi commented Apr 21, 2017

My last change bumps up the page size for the dump controller from 10k to 32k rows based on some timings that show it shaves 10-20s off our 920k row export: image

Note that xml is the last transformation being done in python (and cElementTree). If we generate our xml output in postgres it will have overall performance similar to the json format.

This work is complete. Here's a comparison of the before+after #3523 export times for my 920k row dataset in the each format:
image

@TkTech TkTech added this to In Progress in 15 million datasets Apr 25, 2017

@wardi wardi referenced this pull request May 5, 2017

Closed

datastore upsert performance #3556

@amercader

This looks great @wardi. Definitely worth a blog post once the rest of the DataStore work is merged!

Only two minor questions before merging.

@@ -28,7 +28,7 @@
boolean_validator = get_validator('boolean_validator')
DUMP_FORMATS = 'csv', 'tsv', 'json', 'xml'
PAGINATE_BY = 10000
PAGINATE_BY = 32000

This comment has been minimized.

@amercader

amercader May 16, 2017

Member

Is this some magic number? Just curious

@amercader

amercader May 16, 2017

Member

Is this some magic number? Just curious

This comment has been minimized.

@wardi

wardi May 16, 2017

Contributor

it comes from these benchmarks #3523 (comment) : there's not much benefit to going higher but 32k is a pretty big improvement over 10k, especially for csv

@wardi

wardi May 16, 2017

Contributor

it comes from these benchmarks #3523 (comment) : there's not much benefit to going higher but 32k is a pretty big improvement over 10k, especially for csv

:param records_format: the format for the records return value:
'objects' (default) list of {fieldname1: value1, ...} dicts,
'lists' list of [value1, value2, ...] lists,
'csv' string containing comma-separated values with no header,

This comment has been minimized.

@amercader

amercader May 16, 2017

Member

I can see how this (not returning headers) is useful to stream rows, eg to populate a widget as you scroll, but also I would expect the headers to be returned if I was using the API to get a subset of the data.

What are your views on this? Is it worth adding another parameter that allows you to return headers or not?

@amercader

amercader May 16, 2017

Member

I can see how this (not returning headers) is useful to stream rows, eg to populate a widget as you scroll, but also I would expect the headers to be returned if I was using the API to get a subset of the data.

What are your views on this? Is it worth adding another parameter that allows you to return headers or not?

This comment has been minimized.

@TkTech

TkTech May 16, 2017

Contributor

It's probably worth adding, it's a pretty typical option from APIs returning CSV.

@TkTech

TkTech May 16, 2017

Contributor

It's probably worth adding, it's a pretty typical option from APIs returning CSV.

This comment has been minimized.

@wardi

wardi May 16, 2017

Contributor

we already have the dump controller that streams CSV with headers. If we want to add filtering to produce custom CSVs I'd rather add it to that controller.

@wardi

wardi May 16, 2017

Contributor

we already have the dump controller that streams CSV with headers. If we want to add filtering to produce custom CSVs I'd rather add it to that controller.

This comment has been minimized.

@wardi

wardi May 16, 2017

Contributor

including a header row in this call also 100% duplicates what's returned in the "fields" list

@wardi

wardi May 16, 2017

Contributor

including a header row in this call also 100% duplicates what's returned in the "fields" list

This comment has been minimized.

@amercader

amercader May 18, 2017

Member

Sorry, I thought that when using records_format=csv/tsv the output was actually a CSV/TSV file, not that the records property contained a CSV string. It makes sense now.

@amercader

amercader May 18, 2017

Member

Sorry, I thought that when using records_format=csv/tsv the output was actually a CSV/TSV file, not that the records property contained a CSV string. It makes sense now.

@amercader amercader merged commit 59bf205 into master May 18, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@amercader amercader deleted the 3523-datastore-search-performance branch May 18, 2017

records_format = data_dict.get(u'records_format')
json_values = records_format in (u'objects', u'lists')
for field_id in field_ids:
fmt = u'to_json({0})' if records_format == u'lists' else u'{0}'

This comment has been minimized.

@wardi

wardi Jun 30, 2017

Contributor

@amercader Looks like this change requires postgres 9.3. Should I rewrite it with the functions available in 9.2 or should we bump the postgres version requiremnet?

@wardi

wardi Jun 30, 2017

Contributor

@amercader Looks like this change requires postgres 9.3. Should I rewrite it with the functions available in 9.2 or should we bump the postgres version requiremnet?

@amercader

This comment has been minimized.

Show comment
Hide comment
@amercader

amercader Jul 3, 2017

Member

Is it going to be difficult to rewrite with 9.2 support? This is meant to be in CKAN 2.7 so perhaps is a bit late to bump the Postgres requirement without warning.

Member

amercader commented Jul 3, 2017

Is it going to be difficult to rewrite with 9.2 support? This is meant to be in CKAN 2.7 so perhaps is a bit late to bump the Postgres requirement without warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment