perf: Unbuffered cursors for large result sets #24365

ankush · 2024-01-15T14:46:23Z

If you're reading 1000s of rows from MySQL, the default behaviour is to
read all of them in memory at once.

One of the use case for reading large rows is reporting where a lot of
data is read and then processed in Python. The read row is hoever not
used again but still consumes memory until entire function exits.

SSCursor (Server Side Cursor) allows fetching one row at a time.

Note: This is slower than fetching everything at once AND has risk of
connection loss. So, don't use this as a crutch. If possible rewrite
code so processing is done in SQL.

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
  1008    104.5 MiB    104.5 MiB           1    @profile
  1009                                          def test_run_memory_profile(self):
  1010    104.5 MiB      0.0 MiB           1            frappe.db.sql("select * from `tabGL Entry` limit 1")  # warmup
  1011    195.5 MiB      91 MiB       50001            for gl in frappe.db.sql("select * from `tabGL Entry` order by modified limit 50000", as_dict=True, as_iterator=True):
  1012    195.5 MiB      0.0 MiB       50000                    continue  # consume iterator
  1013
  1014    109.0 MiB    -86.5 MiB           1            pass # notice drop due to gc trigger

After:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
  1013    104.5 MiB    104.5 MiB           1    @profile
  1014                                          def test_reads(self):
  1015    105.0 MiB      0.0 MiB           2            with frappe.db.unbuffered_cursor():
  1016    104.5 MiB      0.0 MiB           1                    frappe.db.sql("select * from `tabGL Entry` limit 1")  # warmup
  1017    105.0 MiB      0.5 MiB       50002                    for gl in frappe.db.sql(
  1018    104.5 MiB      0.0 MiB           1                            "select * from `tabGL Entry` order by modified limit 50000", as_dict=True, as_iterator=True
  1019                                                          ):
  1020    105.0 MiB      0.0 MiB       50000                            continue  # just consume the iterator
  1021    105.0 MiB      0.0 MiB           1                    pass

Extends #19810
Closes #18826

codecov · 2024-01-15T15:24:56Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (03b6d8a) 62.12% compared to head (ab96b60) 62.02%.
Report is 10 commits behind head on develop.

❗ Current head ab96b60 differs from pull request most recent head ff88fa0. Consider uploading reports for the commit ff88fa0 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #24365      +/-   ##
===========================================
- Coverage    62.12%   62.02%   -0.11%     
===========================================
  Files          786      786              
  Lines        74999    75139     +140     
  Branches      6422     6422              
===========================================
+ Hits         46596    46607      +11     
- Misses       24743    24872     +129     
  Partials      3660     3660

Flag	Coverage Δ
server	`70.93% <87.50%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

If you're reading 1000s of rows from MySQL, the default behaviour is to read all of them in memory at once. One of the use case for reading large rows is reporting where a lot of data is read and then processed in Python. The read row is hoever not used again but still consumes memory until entire function exits. SSCursor (Server Side Cursor) allows fetching one row at a time. Note: This is slower than fetching everything at once AND has risk of connection loss. So, don't use this as a crutch. If possible rewrite code so processing is done in SQL.

* feat: `frappe.db.sql` results as iterator - Also avoid self.last_result that holds on to large result set reference. (cherry picked from commit 588157d) # Conflicts: # frappe/database/database.py * perf: avoid duplicate copies of result set When as_list, as_dict is done we hold on to original result set until next query is performed. This can be HUGE for large queries. (cherry picked from commit d5b2706) * test: add perf test for references (cherry picked from commit 03b6d8a) * chore: conflict * perf: Unbuffered cursors for large result sets (#24365) If you're reading 1000s of rows from MySQL, the default behaviour is to read all of them in memory at once. One of the use case for reading large rows is reporting where a lot of data is read and then processed in Python. The read row is hoever not used again but still consumes memory until entire function exits. SSCursor (Server Side Cursor) allows fetching one row at a time. Note: This is slower than fetching everything at once AND has risk of connection loss. So, don't use this as a crutch. If possible rewrite code so processing is done in SQL. --------- Co-authored-by: Ankush Menat <ankush@frappe.io>

# [15.10.0](v15.9.0...v15.10.0) (2024-01-16) ### Bug Fixes * add a check for `gpg` existing ([f0d65f1](f0d65f1)) * add empty space for notification mark read ([#24276](#24276)) ([e566f51](e566f51)) * check if autoname is promt before setting __newname ([9f08ab2](9f08ab2)) * collapse sidebar on picking workspace ([#24312](#24312)) ([#24314](#24314)) ([b3ef407](b3ef407)) * convert status field data to String before guessing the style ([#24226](#24226)) ([#24289](#24289)) ([1f5fb04](1f5fb04)) * don't add fallback for child table ([#24105](#24105)) ([1de3db8](1de3db8)) * Error when displaying dashboard with number card using average and sum functions ([#23883](#23883)) ([#24287](#24287)) ([5cc2281](5cc2281)) * Handle edge case while searching in current context ([460e1c2](460e1c2)) * include workspaces without domain restriction ([2f21a76](2f21a76)) * Make as_iterator work when there are no child queries ([55a26bf](55a26bf)) * **minor:** add optional chaining for this.$input ([#24340](#24340)) ([1302f08](1302f08)) * **minor:** check if markdown_preview exists ([#24336](#24336)) ([b512ad9](b512ad9)) * **minor:** increase rate limit for web form ([#24295](#24295)) ([#24297](#24297)) ([f1c139d](f1c139d)) * **minor:** return if no steps are defined. ([#24338](#24338)) ([373b0d4](373b0d4)) * misc ([#24303](#24303)) ([#24305](#24305)) ([3d515f2](3d515f2)) * mobile sidebar disappearing ([#24316](#24316)) ([#24342](#24342)) ([b21671b](b21671b)) * **mobile-ui:** tabs should scroll instead of stack ([#24309](#24309)) ([#24311](#24311)) ([fccf204](fccf204)) * **MultiCheck:** Use df.sort_options to enable/disable sort ([#24202](#24202)) ([#24291](#24291)) ([2a87904](2a87904)) * pass parent doctype on dashboard chart ([#24236](#24236)) ([#24238](#24238)) ([5a506dd](5a506dd)) * print perm check logs from DB query (backport [#24263](#24263)) ([#24268](#24268)) ([74eaaa5](74eaaa5)) * **response:** fixup non-ASCII character filenames ([9c6a58e](9c6a58e)) * sanitize html instead of escaping when creating/updating workspace ([#24284](#24284)) ([0be6579](0be6579)) * select field should not have debounce ([dc076e1](dc076e1)) * **sentry:** set scope for background jobs ([ed21f11](ed21f11)) * set correct recipient when reply to own email ([#24256](#24256)) ([#24260](#24260)) ([0b5923f](0b5923f)) * translate show all activity label ([#24363](#24363)) ([#24364](#24364)) ([4d2c3e5](4d2c3e5)) * **UX:** show status indicator in moblie view ([#24306](#24306)) ([#24308](#24308)) ([5940ce5](5940ce5)) ### Features * `frappe.db.sql` results `as_iterator` (backport [#19810](#19810)) ([#24346](#24346)) ([99a3a35](99a3a35)), closes [#24365](#24365) * Skip locked rows while selecting ([#24298](#24298)) ([#24302](#24302)) ([09ef3d6](09ef3d6))

* feat: `frappe.db.sql` results as iterator - Also avoid self.last_result that holds on to large result set reference. (cherry picked from commit 588157d) # Conflicts: # frappe/database/database.py * perf: avoid duplicate copies of result set When as_list, as_dict is done we hold on to original result set until next query is performed. This can be HUGE for large queries. (cherry picked from commit d5b2706) * test: add perf test for references (cherry picked from commit 03b6d8a) * chore: conflict * perf: Unbuffered cursors for large result sets (#24365) If you're reading 1000s of rows from MySQL, the default behaviour is to read all of them in memory at once. One of the use case for reading large rows is reporting where a lot of data is read and then processed in Python. The read row is hoever not used again but still consumes memory until entire function exits. SSCursor (Server Side Cursor) allows fetching one row at a time. Note: This is slower than fetching everything at once AND has risk of connection loss. So, don't use this as a crutch. If possible rewrite code so processing is done in SQL. --------- Co-authored-by: Ankush Menat <ankush@frappe.io> (cherry picked from commit 99a3a35) # Conflicts: # frappe/database/database.py # frappe/database/mariadb/database.py # pyproject.toml

#24346) (#24562) * feat: `frappe.db.sql` results `as_iterator` (backport #19810) (#24346) * feat: `frappe.db.sql` results as iterator - Also avoid self.last_result that holds on to large result set reference. (cherry picked from commit 588157d) # Conflicts: # frappe/database/database.py * perf: avoid duplicate copies of result set When as_list, as_dict is done we hold on to original result set until next query is performed. This can be HUGE for large queries. (cherry picked from commit d5b2706) * test: add perf test for references (cherry picked from commit 03b6d8a) * chore: conflict * perf: Unbuffered cursors for large result sets (#24365) If you're reading 1000s of rows from MySQL, the default behaviour is to read all of them in memory at once. One of the use case for reading large rows is reporting where a lot of data is read and then processed in Python. The read row is hoever not used again but still consumes memory until entire function exits. SSCursor (Server Side Cursor) allows fetching one row at a time. Note: This is slower than fetching everything at once AND has risk of connection loss. So, don't use this as a crutch. If possible rewrite code so processing is done in SQL. --------- Co-authored-by: Ankush Menat <ankush@frappe.io> (cherry picked from commit 99a3a35) # Conflicts: # frappe/database/database.py # frappe/database/mariadb/database.py # pyproject.toml * chore: conflicts * chore: remove test for dead functionality --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Ankush Menat <ankush@frappe.io>

# [14.64.0](v14.63.0...v14.64.0) (2024-01-30) ### Bug Fixes * **Custom Field:** default fieldname in rename fieldname prompt ([#24492](#24492)) ([#24580](#24580)) ([7cdda1e](7cdda1e)) * **grid_row:** sort options based on selected data first, so as to maintain order ([b0e4b19](b0e4b19)) * ignore dead columns in user_settings ([#24572](#24572)) ([#24573](#24573)) ([5d2441d](5d2441d)) * improve translatability of search results ([#24498](#24498)) ([a74ba6c](a74ba6c)) * Missing traduction in the query popup ([051d622](051d622)) * **mobile:** scroll issue after workspace change ([#24555](#24555)) ([#24585](#24585)) ([7245292](7245292)) * Return empty result if no perm level access (backport [#24591](#24591)) ([#24592](#24592)) ([adcbeee](adcbeee)) * **search:** Fix URL encoding for search result ([#24558](#24558)) ([44ec1e3](44ec1e3)) * sentry minor fix ([#24588](#24588)) ([23f77ef](23f77ef)) * translatability ([#24553](#24553)) ([41d2fe2](41d2fe2)) ### Features * `frappe.db.sql` results `as_iterator` (backport [#19810](#19810)) (backport [#24346](#24346)) ([#24562](#24562)) ([7f3a12b](7f3a12b)), closes [#24365](#24365) ### Reverts * Revert "fix(data_import): respect the value of show_failed_logs checkbox" ([3c7f494](3c7f494))

ankush force-pushed the unbuffered_queries branch from 41bce62 to ddde583 Compare January 15, 2024 14:57

ankush marked this pull request as ready for review January 15, 2024 14:58

ankush requested review from a team and surajshetty3416 and removed request for a team January 15, 2024 14:58

ankush force-pushed the unbuffered_queries branch from ddde583 to 4c10f81 Compare January 15, 2024 15:00

ankush force-pushed the unbuffered_queries branch from ab96b60 to f9902c4 Compare January 16, 2024 05:23

ankush force-pushed the unbuffered_queries branch from f9902c4 to ff88fa0 Compare January 16, 2024 05:27

ankush merged commit a2525e5 into frappe:develop Jan 16, 2024
20 checks passed

ankush deleted the unbuffered_queries branch January 16, 2024 05:30

ankush mentioned this pull request Jan 16, 2024

duplication of frappe.db.sql results #19927

Closed

ankush mentioned this pull request Jan 16, 2024

feat: frappe.db.sql results as_iterator (backport #19810) #24346

Merged

github-actions bot locked as resolved and limited conversation to collaborators Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Unbuffered cursors for large result sets #24365

perf: Unbuffered cursors for large result sets #24365

ankush commented Jan 15, 2024 •

edited

Loading

codecov bot commented Jan 15, 2024 •

edited

Loading

perf: Unbuffered cursors for large result sets #24365

perf: Unbuffered cursors for large result sets #24365

Conversation

ankush commented Jan 15, 2024 • edited Loading

codecov bot commented Jan 15, 2024 • edited Loading

Codecov Report

ankush commented Jan 15, 2024 •

edited

Loading

codecov bot commented Jan 15, 2024 •

edited

Loading