Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset: fix dataset performance drops signifcantly on high dimensions data. #15355

Merged
merged 71 commits into from
Aug 22, 2021

Conversation

pissang
Copy link
Contributor

@pissang pissang commented Jul 15, 2021

Brief Information

This pull request is in the type of:

  • bug fixing
  • new feature
  • others

What does this PR do?

This PR refactors the List module and aims to fix the terrible performance when many series are sharing the same high dimension dataset.

Result of test case on my MacBook

  1. Case from dataset performance for large data #11907. 1000 dimensions dataset :
    Before: About 45s to render.
    After: About 300ms to render and only takes 20MB heap memory.

  2. test/sample-compare.html. 1 million, 4 dimensions dataset with LTTB sampling
    Before: About 40ms
    After: About 25ms

Fixed issues

#11907

Details explain

The major reason why high dimension data is so slow previously is that echarts will process all dimensions and recreate storage from the dataset in each series. In most cases, it won't cost much. But in #11907, There is extra high dimension data and each dimension will be distributed to a line series. In this case, echarts need to process the 1000 dimensions 1000 times. It brings a lot of pressure to the GC.

So the main purpose of this refactoring is quite clear: This dataset which is shared by all the series should only be processed once. To achieve this purpose, we need to divide the original List module into two modules:

  • DataStorage: process and store the multi-dimension data. It can be shared by multiple series.
  • SeriesData: A data reading wrapper of the storage for each series. Also manages the relationship of data and graphic info(visual, elements). The interface is almost the same as the original List to avoid breaks.

It sounds quite straightforward and simple. So why we didn't do it before? Because we need to collect enough information before parsing data and it's hard to prepare the parsed data for sharing ahead. For example, we need to know which axis of a cartesian coordinate system is a category, so we can parse data of this dimension as category data instead of converting it to numbers. We also detect the type from data. But we don't trust it. Because detect every value is costly, we can only detect part of them. Also, we can accept a number as a category, a string as a date, a string number('123') as a number, etc. So the easiest way is parsing the source when initializing each series after all necessary information is prepared.

In this PR keep the idea of preparing the storage ahead based on the information dataset given. If series needs the source to be parsed differently. We discard the shared storage and create a new one for this series. There are several factors that will invalidate the shared storage:

  1. series.seriesLayoutBy is different.
  2. series.sourceHeader is different.
  3. series also specify dimensions and it's different from the dataset.
  4. dataset detects one dimension to be number but the series needs it to be ordinal.
  5. dataset detects one dimension to be ordinal but the series needs it to be number or time.

This invalidate strategy will ensure the behavior is correct but may have extra costs in some cases. So there is a best practice to avoid this:

Always specify the name and type of dimensions in the dataset.

pissang added 28 commits July 8, 2021 16:09
 fix dimSize param for typedarray to provider
to reduce the cost of source dimension parse
@echarts-bot
Copy link

echarts-bot bot commented Jul 15, 2021

Thanks for your contribution!
The community will review it ASAP. In the meanwhile, please checkout the coding standard and Wiki about How to make a pull request.

The pull request is marked to be PR: author is committer because you are a committer of this project.

pissang and others added 7 commits August 4, 2021 13:42
(1). If no dimensions specified on dataset, series can not really share one storage instance. (Because the each series will create its own dimension name (like ['x', 'y', 'value'], ['x', 'value', 'y'], and the storage hash is based on those names. So the hash can not match).
(2). Each time `setOption` update series (but not change `dataset`), new data stack dimensions (and corresponding chunks) will be keep added to the shared data storage, and then the chunks will be more and more.
(3). When "unused dimension omit" happen, the index of SeriesData['dimensions'] and SeriesData['getDimension'] is not the dimensionIndex that users known. But there are someplace still use dimensionIndex to visit them. (especially in visualMap and other similar cases that user can input dimension index via option).
(4). If user only specify type but no name in dimensions, their will be some bug when "unused dimension omit" happen. (Because unused dimensions will not auto-generate dimension name by `createDimensions` and so that it has no dimension name in storage, and can not be queried by dimension name).
(5). If different series option specify its own `dimensions` but share one `dataset`, the `source` get by `sourceManager` is different `source` instances in each series. Those `source` instances contain different `dimensionDefine` but reference the same data. And then a data storage created by based on s

**This commit try resolve those issues by this way:**
1. Do not save the "dimName->dimIndex map" in data storage any more. Because:
    1. In fact data storage do not need this map to read/write data.
    2. dimNames are usually created based on each series info (like ['x', 'y', 'value'], ['x', 'value', 'y'], ...) if not specified by user. So they are different between series. But even those series have different generated dimension names, they can still share one storage, because essentially they visit the same dataset source by dimIndex.
2. Make `SeriesDimensionDefine` (that is, each item of `SeriesData['dimensionInfos']`) contain `storageDimensionIndex` to indicate its corresponding data store dimension index. And alway user `storageDimensionIndex` to visit dat storage rather than dimName. `storageDimensionIndex` is created in `createDimension`.
3. create a new structure `SeriesDimensionRequest` for each series. It contains the info generated by `createDimension` (like dimensionDefineList, whether dimension omitted, source for this series).
    3.1 `sourceManager` use `seriesDimensionRequest` to find the shared storage by generate storage dimensions and hash based on the `dimCount` and `dimensionDefineList` (which are created by `createDimension` and saved in `seriesDimensionRequest`).
    3.2 `dataStack` add "data stack dimensions" to `dimensionDefineList` in `seriesDimensionRequest`.
    3.3 `seriesData` use `seriesDimensionRequest` to init its dimensions, and use `seriesDimensionRequest` query dimName by dimIndex from source, or query dimIndex by dimName from source for "omitted dimension". If different series option specify its own `dimensions` but share one `dataset`, the `source` get by `sourceManager` is different `source` instances in each series. Those `source` instances contain different `dimensionDefine` but reference the same "raw data". The data storage generated based on the "raw data" can be shared between series, but the `dimensionDefine` should not be shared. `SeriesDimensionRequest` encapsulate these mess and work each series to query dimension name or index.
    3.4 `seriesDimensionRequest` do not create new data structure as possible as it can, but reference shared data structure (like `source` instance). So it will not cost memory issue.
4. Change the previous `storage.appendDimension` to `storage.ensureCalculationDimension` for data stack. That is, if its dimension has been created, reuse it.
5. Remove previous `canUse` method. Whether a storage can be shared by series is all determined by hash. The hash is generated in two ways:
    5.1 For source format "arrayRows" (i.e., [[12, 33], [55, 99], ...]), dimension name do not need to be added to hash, because this kind of data actually visited by index. If two series have different dimension name (like 'x', 'y') for single index, they also can share the storage.
    5.2 For source format "objectRows" (i.e, [{a: 12, b: 33}, {b: 55, a: 99}, ...]), property name 'a', 'b' will be added to hash, because this kind of data actually visited by property name.
    5.3 And as before, dimension type, ordinal meta id, source header, series layout will also be added to hash.
6. Make `DataStorage` method immutable:
    `DataStorage['filterSelf']` -> `DataStorage['filter']`
    `DataStorage['selectRange']`

**PENDING:**
1. Should deprecate `dimName = seriesData.getDimension(dimLoose)` and `series.get(dimName)` but always use `dimIdx = seriesData.getDimensionIndex(dimLoose)` and `dataStorage.get(dimIdx)` instead. For examples:
```js
// Previously
const val = data.get(data.getDimension(dim), dataIdx);
// Now
const val = data.getStorage().get(data.getDimensionIndex(dim), dataIdx);
```
`seriesData.getDimension(dimLoose)` has a feature that convert dimIdx to dimName, which is not essentially necessary (because dimIdx can be used to visit data directly), but this feature require a "dimIdx->dimName map" in `SeriesData` (why? because when some dimensions are omitted, we can not use dimIdx on `SeriesData['dimensions']` directly).
2. Radar has bug when using `series.encode`. This commit do not fix the bug but keep as it is.
@100pah
Copy link
Member

100pah commented Aug 19, 2021

Some changes has been made from 2bf188f to 39c7111

There are these issues existing before this commit:

  1. If no dimensions specified on dataset, series can not really share one storage instance. (Because the each series will create its own dimension name (like ['x', 'y', 'value'], ['x', 'value', 'y'], and the storage hash is based on those names. So the hash can not match).
  2. Each time setOption update series (but not change dataset), new data stack dimensions (and corresponding chunks) will be keep added to the shared data storage, and then the chunks will be more and more.
  3. When "unused dimension omit" happen, the index of SeriesData['dimensions'] and SeriesData['getDimension'] is not the dimensionIndex that users known. But there are someplace still use dimensionIndex to visit them. (especially in visualMap and other similar cases that user can input dimension index via option).
  4. If user only specify only type but no name in dimensions, their will be some bug when "unused dimension omit" happen. (Because unused dimensions will not auto-generate dimension name by createDimensions and so that it has no dimension name in storage, and can not be queried by dimension name).

This commit try resolve those issues by this way:

  1. Do not save the "dimName->dimIndex map" in data storage any more. Because:
    1. In fact data storage do not need this map to read/write data.
    2. dimNames are usually created based on each series info (like ['x', 'y', 'value'], ['x', 'value', 'y'], ...) if not specified by user. So they are different between series. But even those series have different generated dimension names, they can still share one storage, because essentially they visit the same dataset source by dimIndex.
  2. Make SeriesDimensionDefine (that is, each item of SeriesData['dimensionInfos']) contain storageDimensionIndex to indicate its corresponding data store dimension index. And always use storageDimensionIndex to visit dat storage rather than dimName. storageDimensionIndex is created in createDimension.
  3. create a new structure SeriesDataSchema for each series. It contains the info generated by createDimension (like (1) dimensionDefineList, (2) whether dimension omitted, (3) source for this series, (4) ...).
    3.1. sourceManager use seriesDataSchema to find the shared storage by generate storage dimensions and hash based on the dimCount and dimensionDefineList (which are created by createDimension and saved in seriesDataSchema).
    3.2. dataStack add "data stack dimensions" to dimensionDefineList in seriesDataSchema.
    3.3. seriesData use seriesDataSchema to init its dimensions, and use seriesDataSchema query dimName by dimIndex from source, or query dimIndex by dimName from source (only when "dimension omit" happen). If different series option specify its own dimensions but share one dataset, the source get by sourceManager is different source instances in each series. Those source instances contain different dimensionDefine but reference the same "raw data". The data storage generated based on the "raw data" can be shared between series, but the dimensionDefine should not be shared. SeriesDataSchema encapsulate these mess and work each series to query dimension name or index.
    3.4. seriesDataSchema do not create new data structure as possible as it can, but reference shared data structure (like source instance). So it will not cost memory issue.
  4. Change the previous storage.appendDimension to storage.ensureCalculationDimension for data stack. That is, if its dimension has been created, reuse it.
  5. Remove previous canUse method. Whether a storage can be shared by series is all determined by hash. The hash is generated in two ways:
    5.1. For source format "arrayRows" (i.e., [[12, 33], [55, 99], ...]), dimension name do not need to be added to hash, because this kind of data actually visited by index. If two series have different dimension name (like 'x', 'y') for single index, they also can share the storage.
    5.2. For source format "objectRows" (i.e, [{a: 12, b: 33}, {b: 55, a: 99}, ...]), property name 'a', 'b' will be added to hash, because this kind of data actually visited by property name.
    5.3. And as before, dimension type, ordinal meta id, source header, series layout will also be added to hash.
  6. Make DataStorage method immutable:
    DataStorage['filterSelf'] -> DataStorage['filter']
    DataStorage['selectRange']

PENDING

  1. Should deprecate dimName = seriesData.getDimension(dimLoose) and series.get(dimName) but always use dimIdx = seriesData.getDimensionIndex(dimLoose) and dataStorage.get(dimIdx) instead. For examples:
// Previously
const val = data.get(data.getDimension(dim), dataIdx);
// Now
const val = data.getStorage().get(data.getDimensionIndex(dim), dataIdx);

seriesData.getDimension(dimLoose) has a feature that convert dimIdx to dimName, which is not essentially necessary (because dimIdx can be used to visit data directly), but this feature require a "dimIdx->dimName map" in SeriesData (why? because when some dimensions are omitted, we can not use dimIdx on SeriesData['dimensions'] directly).

  1. Radar has bug when using series.encode. This commit do not fix the bug but keep as it is.

Test case

<test/dataset-case.html>
<test/dataset-performance.html>

About SeriesDataSchema

Picture4

@100pah 100pah merged commit 4569dc1 into master Aug 22, 2021
@echarts-bot
Copy link

echarts-bot bot commented Aug 22, 2021

Congratulations! Your PR has been merged. Thanks for your contribution! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants