Skip to content

Regression in json performance for local files #21450

@ariel-miculas

Description

@ariel-miculas

I ran some tests with clickbench, reading from local files is worse after we merged #20823

[ec2-user@ip-172-31-0-185 datafusion]$ ./benchmarks/bench.sh compare json-test-on-main test-json-improvement
Comparing json-test-on-main and test-json-improvement
--------------------
Benchmark clickbench_2.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃ json-test-on-main ┃ test-json-improvement ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │        2938.54 ms │           36468.92 ms │ 12.41x slower │
│ QQuery 1  │        4189.48 ms │           36706.26 ms │  8.76x slower │
│ QQuery 2  │        3021.24 ms │           36695.04 ms │ 12.15x slower │
│ QQuery 3  │              FAIL │                  FAIL │  incomparable │
│ QQuery 4  │        3518.24 ms │           37016.08 ms │ 10.52x slower │
│ QQuery 5  │        3138.41 ms │           37131.63 ms │ 11.83x slower │
│ QQuery 6  │              FAIL │                  FAIL │  incomparable │
│ QQuery 7  │        4191.68 ms │           36874.60 ms │  8.80x slower │
│ QQuery 8  │        4405.33 ms │           37054.97 ms │  8.41x slower │
│ QQuery 9  │        3473.41 ms │           37308.28 ms │ 10.74x slower │
│ QQuery 10 │        4351.06 ms │           36934.39 ms │  8.49x slower │
│ QQuery 11 │        3306.45 ms │           37101.39 ms │ 11.22x slower │
│ QQuery 12 │        3226.21 ms │           37235.60 ms │ 11.54x slower │
│ QQuery 13 │        3970.11 ms │           37244.27 ms │  9.38x slower │
│ QQuery 14 │        3246.59 ms │           37085.69 ms │ 11.42x slower │
│ QQuery 15 │        4563.53 ms │           37182.89 ms │  8.15x slower │
│ QQuery 16 │        4506.85 ms │           37391.07 ms │  8.30x slower │
│ QQuery 17 │        4377.16 ms │           37381.49 ms │  8.54x slower │
│ QQuery 18 │        3555.18 ms │           37603.25 ms │ 10.58x slower │
│ QQuery 19 │        4568.01 ms │           36996.50 ms │  8.10x slower │
│ QQuery 20 │        3193.87 ms │           37069.19 ms │ 11.61x slower │
│ QQuery 21 │        4415.33 ms │           37185.73 ms │  8.42x slower │
│ QQuery 22 │        3312.73 ms │           37190.81 ms │ 11.23x slower │
│ QQuery 23 │              FAIL │                  FAIL │  incomparable │
│ QQuery 24 │        4382.53 ms │           37093.81 ms │  8.46x slower │
│ QQuery 25 │        4339.69 ms │           37121.90 ms │  8.55x slower │
│ QQuery 26 │        4425.42 ms │           37106.02 ms │  8.38x slower │
│ QQuery 27 │        4505.30 ms │           37059.04 ms │  8.23x slower │
│ QQuery 28 │        3582.82 ms │           37409.12 ms │ 10.44x slower │
│ QQuery 29 │        4440.96 ms │           36868.93 ms │  8.30x slower │
│ QQuery 30 │        4675.71 ms │           37081.23 ms │  7.93x slower │
│ QQuery 31 │        4276.55 ms │           37165.64 ms │  8.69x slower │
│ QQuery 32 │        3615.42 ms │           37662.39 ms │ 10.42x slower │
│ QQuery 33 │        4446.09 ms │           37558.30 ms │  8.45x slower │
│ QQuery 34 │        4521.66 ms │           37647.72 ms │  8.33x slower │
│ QQuery 35 │        4321.41 ms │           37225.06 ms │  8.61x slower │
│ QQuery 36 │              FAIL │                  FAIL │  incomparable │
│ QQuery 37 │              FAIL │                  FAIL │  incomparable │
│ QQuery 38 │              FAIL │                  FAIL │  incomparable │
│ QQuery 39 │              FAIL │                  FAIL │  incomparable │
│ QQuery 40 │              FAIL │                  FAIL │  incomparable │
│ QQuery 41 │              FAIL │                  FAIL │  incomparable │
│ QQuery 42 │              FAIL │                  FAIL │  incomparable │
└───────────┴───────────────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Total Time (json-test-on-main)       │  131002.95ms │
│ Total Time (test-json-improvement)   │ 1225857.24ms │
│ Average Time (json-test-on-main)     │    3969.79ms │
│ Average Time (test-json-improvement) │   37147.19ms │
│ Queries Faster                       │            0 │
│ Queries Slower                       │           33 │
│ Queries with No Change               │            0 │
│ Queries with Failure                 │           10 │
└──────────────────────────────────────┴──────────────┘

The issue is the into_stream function of objects store's get_result reads data in 8KiB chunks for local files, so we need to either replace it with custom code or use a completely separate path for local files, as it was done previously

Originally posted by @ariel-miculas in #20823 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    regressionSomething that used to work no longer does

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions