-
Notifications
You must be signed in to change notification settings - Fork 19.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset: fix dataset performance drops signifcantly on high dimensions data. #15355
Conversation
ordinalMeta update. encodeDef
fix dimSize param for typedarray to provider
to reduce the cost of source dimension parse
Thanks for your contribution! The pull request is marked to be |
…imensions more than 30.
(1). If no dimensions specified on dataset, series can not really share one storage instance. (Because the each series will create its own dimension name (like ['x', 'y', 'value'], ['x', 'value', 'y'], and the storage hash is based on those names. So the hash can not match). (2). Each time `setOption` update series (but not change `dataset`), new data stack dimensions (and corresponding chunks) will be keep added to the shared data storage, and then the chunks will be more and more. (3). When "unused dimension omit" happen, the index of SeriesData['dimensions'] and SeriesData['getDimension'] is not the dimensionIndex that users known. But there are someplace still use dimensionIndex to visit them. (especially in visualMap and other similar cases that user can input dimension index via option). (4). If user only specify type but no name in dimensions, their will be some bug when "unused dimension omit" happen. (Because unused dimensions will not auto-generate dimension name by `createDimensions` and so that it has no dimension name in storage, and can not be queried by dimension name). (5). If different series option specify its own `dimensions` but share one `dataset`, the `source` get by `sourceManager` is different `source` instances in each series. Those `source` instances contain different `dimensionDefine` but reference the same data. And then a data storage created by based on s **This commit try resolve those issues by this way:** 1. Do not save the "dimName->dimIndex map" in data storage any more. Because: 1. In fact data storage do not need this map to read/write data. 2. dimNames are usually created based on each series info (like ['x', 'y', 'value'], ['x', 'value', 'y'], ...) if not specified by user. So they are different between series. But even those series have different generated dimension names, they can still share one storage, because essentially they visit the same dataset source by dimIndex. 2. Make `SeriesDimensionDefine` (that is, each item of `SeriesData['dimensionInfos']`) contain `storageDimensionIndex` to indicate its corresponding data store dimension index. And alway user `storageDimensionIndex` to visit dat storage rather than dimName. `storageDimensionIndex` is created in `createDimension`. 3. create a new structure `SeriesDimensionRequest` for each series. It contains the info generated by `createDimension` (like dimensionDefineList, whether dimension omitted, source for this series). 3.1 `sourceManager` use `seriesDimensionRequest` to find the shared storage by generate storage dimensions and hash based on the `dimCount` and `dimensionDefineList` (which are created by `createDimension` and saved in `seriesDimensionRequest`). 3.2 `dataStack` add "data stack dimensions" to `dimensionDefineList` in `seriesDimensionRequest`. 3.3 `seriesData` use `seriesDimensionRequest` to init its dimensions, and use `seriesDimensionRequest` query dimName by dimIndex from source, or query dimIndex by dimName from source for "omitted dimension". If different series option specify its own `dimensions` but share one `dataset`, the `source` get by `sourceManager` is different `source` instances in each series. Those `source` instances contain different `dimensionDefine` but reference the same "raw data". The data storage generated based on the "raw data" can be shared between series, but the `dimensionDefine` should not be shared. `SeriesDimensionRequest` encapsulate these mess and work each series to query dimension name or index. 3.4 `seriesDimensionRequest` do not create new data structure as possible as it can, but reference shared data structure (like `source` instance). So it will not cost memory issue. 4. Change the previous `storage.appendDimension` to `storage.ensureCalculationDimension` for data stack. That is, if its dimension has been created, reuse it. 5. Remove previous `canUse` method. Whether a storage can be shared by series is all determined by hash. The hash is generated in two ways: 5.1 For source format "arrayRows" (i.e., [[12, 33], [55, 99], ...]), dimension name do not need to be added to hash, because this kind of data actually visited by index. If two series have different dimension name (like 'x', 'y') for single index, they also can share the storage. 5.2 For source format "objectRows" (i.e, [{a: 12, b: 33}, {b: 55, a: 99}, ...]), property name 'a', 'b' will be added to hash, because this kind of data actually visited by property name. 5.3 And as before, dimension type, ordinal meta id, source header, series layout will also be added to hash. 6. Make `DataStorage` method immutable: `DataStorage['filterSelf']` -> `DataStorage['filter']` `DataStorage['selectRange']` **PENDING:** 1. Should deprecate `dimName = seriesData.getDimension(dimLoose)` and `series.get(dimName)` but always use `dimIdx = seriesData.getDimensionIndex(dimLoose)` and `dataStorage.get(dimIdx)` instead. For examples: ```js // Previously const val = data.get(data.getDimension(dim), dataIdx); // Now const val = data.getStorage().get(data.getDimensionIndex(dim), dataIdx); ``` `seriesData.getDimension(dimLoose)` has a feature that convert dimIdx to dimName, which is not essentially necessary (because dimIdx can be used to visit data directly), but this feature require a "dimIdx->dimName map" in `SeriesData` (why? because when some dimensions are omitted, we can not use dimIdx on `SeriesData['dimensions']` directly). 2. Radar has bug when using `series.encode`. This commit do not fix the bug but keep as it is.
Some changes has been made from 2bf188f to 39c7111 There are these issues existing before this commit:
This commit try resolve those issues by this way:
PENDING
// Previously
const val = data.get(data.getDimension(dim), dataIdx);
// Now
const val = data.getStorage().get(data.getDimensionIndex(dim), dataIdx);
Test case<test/dataset-case.html> About
|
…used by extensions.
…package.json, for update eslint-plugin version, for npm run checktype fail, for old tslib that the eslint-plugin used in the old version).
Congratulations! Your PR has been merged. Thanks for your contribution! 👍 |
Brief Information
This pull request is in the type of:
What does this PR do?
This PR refactors the
List
module and aims to fix the terrible performance when many series are sharing the same high dimension dataset.Result of test case on my MacBook
Case from dataset performance for large data #11907. 1000 dimensions dataset :
Before: About 45s to render.
After: About 300ms to render and only takes 20MB heap memory.
test/sample-compare.html. 1 million, 4 dimensions dataset with LTTB sampling
Before: About 40ms
After: About 25ms
Fixed issues
#11907
Details explain
The major reason why high dimension data is so slow previously is that echarts will process all dimensions and recreate storage from the dataset in each series. In most cases, it won't cost much. But in #11907, There is extra high dimension data and each dimension will be distributed to a line series. In this case, echarts need to process the 1000 dimensions 1000 times. It brings a lot of pressure to the GC.
So the main purpose of this refactoring is quite clear: This dataset which is shared by all the series should only be processed once. To achieve this purpose, we need to divide the original
List
module into two modules:List
to avoid breaks.It sounds quite straightforward and simple. So why we didn't do it before? Because we need to collect enough information before parsing data and it's hard to prepare the parsed data for sharing ahead. For example, we need to know which axis of a cartesian coordinate system is a category, so we can parse data of this dimension as category data instead of converting it to numbers. We also detect the type from data. But we don't trust it. Because detect every value is costly, we can only detect part of them. Also, we can accept a number as a category, a string as a date, a string number('123') as a number, etc. So the easiest way is parsing the source when initializing each series after all necessary information is prepared.
In this PR keep the idea of preparing the storage ahead based on the information dataset given. If series needs the source to be parsed differently. We discard the shared storage and create a new one for this series. There are several factors that will invalidate the shared storage:
This invalidate strategy will ensure the behavior is correct but may have extra costs in some cases. So there is a best practice to avoid this:
Always specify the name and type of dimensions in the dataset.