Skip to content

[IOTDB-2078] Split large TsFile tool#4736

Merged
SteveYurongSu merged 17 commits intomasterfrom
tsfile_split
Jan 19, 2022
Merged

[IOTDB-2078] Split large TsFile tool#4736
SteveYurongSu merged 17 commits intomasterfrom
tsfile_split

Conversation

@samperson1997
Copy link
Copy Markdown
Contributor

@samperson1997 samperson1997 commented Jan 10, 2022

Background

IoTDB will compact some large TsFiles in rel/0.12, which causes many problem to memory control and task management. We need a tool to split the large TsFile.

Introduction

The split tool will:

  1. Split the file into N new files, and the files are all about 1 GB (this is configured in target_compaction_file_size=1073741824)
    ---- This will make sure the files will not be compacted after restarting in 0.13.

  2. Shrink the size of chunks into chunk_point_num_lower_bound_in_compaction points.
    (Notice: these two configuration is introduced in PR [IOTDB-2176] Limit target chunk size when performing inner space compaction #4698)

  3. Change file names: version (+1 ~ +N) and level (10)
    ---- This will make sure the files will not be compacted after restarting in 0.12.

For example, here we have a file in which there are 5 devices and there are 10 points in one chunk:

            POSITION|	CONTENT
            -------- 	-------
                   0|	[magic head] TsFile
                   6|	[version number] 3
|||||||||||||||||||||	[Chunk Group] of device_1, num of Chunks:5
                   7|	[Chunk Group Header]
                    |		[marker] 0
                    |		[deviceID] device_1
                  17|	[Chunk] of sensor_1, numOfPoints:10, time range:[1,10], tsDataType:INT64, 
                     	startTime: 1 endTime: 10 count: 10 [minValue:-2132873995738506399,maxValue:7316185511240794694,firstValue:495858743216872418,lastValue:7316185511240794694,sumValue:2.8991794269797573E19]
                    |		[chunk header] marker=5, measurementId=sensor_1, dataSize=111, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=111 cap=111]
                    |		[page]  CompressedSize:108, UncompressedSize:158
                 142|	[Chunk] of sensor_2, numOfPoints:10, time range:[1,10], tsDataType:INT64, 
                     	startTime: 1 endTime: 10 count: 10 [minValue:-7380203651378471462,maxValue:6278335328401351016,firstValue:-3492088478750410801,lastValue:-4075684766777050358,sumValue:-1.7012190516574462E19]
                    |		[chunk header] marker=5, measurementId=sensor_2, dataSize=111, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=111 cap=111]
                    |		[page]  CompressedSize:108, UncompressedSize:158
                 267|	[Chunk] of sensor_3, numOfPoints:10, time range:[1,10], tsDataType:INT64, 
                     	startTime: 1 endTime: 10 count: 10 [minValue:-8061058841690306722,maxValue:8747337672426803323,firstValue:-7090470927176588607,lastValue:3461791129425469107,sumValue:-9.599427033006014E18]
                    |		[chunk header] marker=5, measurementId=sensor_3, dataSize=112, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=112 cap=112]
                    |		[page]  CompressedSize:109, UncompressedSize:158
                 393|	[Chunk] of sensor_4, numOfPoints:10, time range:[1,10], tsDataType:INT64, 
                     	startTime: 1 endTime: 10 count: 10 [minValue:-7370899186004023873,maxValue:7887931722970317763,firstValue:350314169393044505,lastValue:1720583668582096001,sumValue:1.6486835235844387E19]
                    |		[chunk header] marker=5, measurementId=sensor_4, dataSize=111, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=111 cap=111]
                    |		[page]  CompressedSize:108, UncompressedSize:158
                 518|	[Chunk] of sensor_5, numOfPoints:10, time range:[1,10], tsDataType:INT64, 
                     	startTime: 1 endTime: 10 count: 10 [minValue:-4126359258295558513,maxValue:8570159718211508220,firstValue:6693073330649300527,lastValue:-1388643936107197122,sumValue:2.0423031386752184E19]
                    |		[chunk header] marker=5, measurementId=sensor_5, dataSize=111, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=111 cap=111]
                    |		[page]  CompressedSize:108, UncompressedSize:158
|||||||||||||||||||||	[Chunk Group] of device_1 ends
......
......

With the split tool, the file is split into new files. Here are some more detailed features:

  • There are 6 points in one chunk, according to the config (chunk_point_num_lower_bound_in_compaction=6).
  • Different sensors are located in different chunks (as the example below):
            POSITION|	CONTENT
            -------- 	-------
                   0|	[magic head] TsFile
                   6|	[version number] 3
|||||||||||||||||||||	[Chunk Group] of device_1, num of Chunks:85
                   7|	[Chunk Group Header]
                    |		[marker] 0
                    |		[deviceID] device_1
                  17|	[Chunk] of sensor_1, numOfPoints:6, time range:[1,6], tsDataType:INT64, 
                     	startTime: 1 endTime: 6 count: 6 [minValue:-2132873995738506399,maxValue:4457162859305436130,firstValue:495858743216872418,lastValue:-2132873995738506399,sumValue:1.3268296367297372E19]
                    |		[chunk header] marker=5, measurementId=sensor_1, dataSize=87, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=87 cap=87]
                    |		[page]  CompressedSize:85, UncompressedSize:93
                 118|	[Chunk] of sensor_1, numOfPoints:6, time range:[7,12], tsDataType:INT64, 
                     	startTime: 7 endTime: 12 count: 6 [minValue:-7487702676116836276,maxValue:7316185511240794694,firstValue:1890241849737198692,lastValue:-7487702676116836276,sumValue:3.6431929898584658E18]
                    |		[chunk header] marker=5, measurementId=sensor_1, dataSize=87, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=87 cap=87]
                    |		[page]  CompressedSize:85, UncompressedSize:93
......
......
                1531|	[Chunk] of sensor_1, numOfPoints:6, time range:[91,96], tsDataType:INT64, 
                     	startTime: 91 endTime: 96 count: 6 [minValue:-6641858653183290081,maxValue:8259028081005285450,firstValue:6199046354625801717,lastValue:8259028081005285450,sumValue:1.0838419365197494E19]
                    |		[chunk header] marker=5, measurementId=sensor_1, dataSize=87, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=87 cap=87]
                    |		[page]  CompressedSize:85, UncompressedSize:93
                1632|	[Chunk] of sensor_1, numOfPoints:4, time range:[97,100], tsDataType:INT64, 
                     	startTime: 97 endTime: 100 count: 4 [minValue:-4270937283700548394,maxValue:4151444911820934153,firstValue:-4270937283700548394,lastValue:3023244406664712726,sumValue:6.182514990012971E18]
                    |		[chunk header] marker=5, measurementId=sensor_1, dataSize=59, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=59 cap=59]
                    |		[page]  CompressedSize:57, UncompressedSize:93
                1705|	[Chunk] of sensor_2, numOfPoints:6, time range:[1,6], tsDataType:INT64, 
                     	startTime: 1 endTime: 6 count: 6 [minValue:-5858870345010455298,maxValue:6278335328401351016,firstValue:-3492088478750410801,lastValue:-4198337637729517376,sumValue:-6.5295047110644613E18]
                    |		[chunk header] marker=5, measurementId=sensor_2, dataSize=87, serializedSize=14
                    |		[chunk] java.nio.HeapByteBuffer[pos=0 lim=87 cap=87]
                    |		[page]  CompressedSize:85, UncompressedSize:93
......
......

Notice:
Data in different devices may be split into one same file, so that the files won't be too small. In an experiment, a 28.6GB file is split into 24 files, among which the smallest file is 1.04GB, and the largest file is 1.74GB.

No matter in 0.12 (compaction is decided by level in file name) or 0.13 (compaction is decided by file size), all these files won't be compacted later after restarting.

Usage

./TsFileSplitTool fileName

Limitation

Split tool does not support these scenario currently:

  • TsFile with modification
  • TsFile with aligned timeseries

@samperson1997 samperson1997 added 0.13 will be released in 0.13 In Progress Module - TsFile data file format labels Jan 10, 2022
@samperson1997 samperson1997 marked this pull request as ready for review January 10, 2022 12:24
@coveralls
Copy link
Copy Markdown

coveralls commented Jan 10, 2022

Coverage Status

Coverage increased (+0.08%) to 67.957% when pulling 9a1b3b3 on tsfile_split into a0ff477 on master.

@samperson1997 samperson1997 requested a review from HTHou January 12, 2022 05:49
@SteveYurongSu
Copy link
Copy Markdown
Member

Please check the following cases before executing the tool:

  • TsFile with modification
  • TsFile with aligned timeseries

And it's better for us to provide a shell to execute the routine :D

String[] filePathSplit = filename.split(IoTDBConstant.FILE_NAME_SEPARATOR);
int versionIndex = Integer.parseInt(filePathSplit[filePathSplit.length - 3]) + 1;
// to avoid compaction after restarting. NOTICE: This will take effect only in
filePathSplit[filePathSplit.length - 2] = "10";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the "10" a input parameter is better :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, I also added a default num as a global param private static final String defaultLevelNum = "10"

IoTDBDescriptor.getInstance().getConfig().getTargetCompactionFileSize();

/** Maximum index of plans executed within this TsFile. */
protected long maxPlanIndex = Long.MIN_VALUE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

protected long maxPlanIndex = Long.MIN_VALUE;

/** Minimum index of plans executed within this TsFile. */
protected long minPlanIndex = Long.MAX_VALUE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

for (int i = 0; i < filePathSplit.length; i++) {
sb.append(filePathSplit[i]);
if (i != filePathSplit.length - 1) {
sb.append("-");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"-"

Is it in Constant?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to IoTDBConstant.FILE_NAME_SEPARATOR

@samperson1997 samperson1997 removed the 0.13 will be released in 0.13 label Jan 18, 2022
@sonarqubecloud
Copy link
Copy Markdown

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 4 Code Smells

0.0% 0.0% Coverage
6.3% 6.3% Duplication

@samperson1997
Copy link
Copy Markdown
Contributor Author

Please check the following cases before executing the tool:

  • TsFile with modification
  • TsFile with aligned timeseries

And it's better for us to provide a shell to execute the routine :D

Exception detection, shell script and User Guide documents in Chinese and English are added in latest commit. Really appreciate your detailed code review and suggestions!! : )

@samperson1997 samperson1997 added the 0.13 will be released in 0.13 label Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

0.13 will be released in 0.13 Module - TsFile data file format

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants