Skip to content

Conversation

@kaori-seasons
Copy link

What problem does this PR solve?

Issue Number: Related to issue-50718

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaori-seasons
Copy link
Author

@BiteTheDDDDt Hi, do you have some free time to look at this PR lately?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds base64 encoding support for AggState data type serialization in CSV exports. The implementation ensures that binary aggregate state data can be safely transmitted in CSV format by encoding it as base64 strings.

  • Introduces DataTypeAggStateSerde class to handle AggState serialization with base64 encoding
  • Adds comprehensive test suites for base64 encoding/decoding and CSV serialization
  • Updates DataTypeAggState to use the new serialization wrapper

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
be/src/vec/data_types/serde/data_type_agg_state_serde.h Defines the DataTypeAggStateSerde class for base64 serialization of AggState data
be/src/vec/data_types/serde/data_type_agg_state_serde.cpp Implements base64 encoding logic for CSV and Hive text formats
be/src/vec/data_types/data_type_agg_state.h Updates get_serde() to return the new AggState serde wrapper
be/test/vec/runtime/agg_state_csv_test.cpp Adds unit tests for base64 encoding/decoding functionality
be/test/vec/data_types/serde/data_type_agg_state_serde_test.cpp Adds comprehensive tests for AggState serialization with various data types and edge cases
be/test/CMakeLists.txt Updates build configuration to include new test files
Comments suppressed due to low confidence (1)

be/test/vec/data_types/serde/data_type_agg_state_serde_test.cpp:1

  • The loop condition i >= len - 2 combined with unsigned decrement can cause undefined behavior. When len is less than 2, len - 2 underflows to a very large value (due to unsigned arithmetic), causing incorrect behavior. Consider using signed arithmetic or restructuring the loop to avoid this issue.
// Licensed to the Apache Software Foundation (ASF) under one

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kaori-seasons
Copy link
Author

run buildall

Status DataTypeAggStateSerde::serialize_one_cell_to_json(const IColumn& column, int64_t row_num,
BufferWritable& bw,
FormatOptions& options) const {
// Check column type: AggState may be stored as ColumnString or ColumnFixedLengthObject
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可能有问题,agg state存储的column是根据内部agg function的get_serialized_type方法来的,不只有ColumnString 和 ColumnFixedLengthObject,比如bitmap的一些相关函数会用ColumnBitmap,一些复杂类型相关的函数会用ColumnArray,ColumnMap,ColumnStruct


// Get serialized string data
const auto& serialized_data = tmp_col->get_data_at(0);
_encode_to_base64(serialized_data.data, serialized_data.size, bw);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果这里直接改了编码解码方式,那旧版本导出的数据怎么在新版本导入呢?是否应该考虑做一个开关来控制这里的行为

}

private:
// Internal method: encode binary data to base64 and write to buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么只有编码做了base64,解码不用做base64吗?这样doris自己导出的数据自己还能导入吗,能否在这方面写一些regression test,类似regression-test/suites/nereids_p0/outfile/agg_state/ 目录下的

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

您好,我刚看到。上周末出差了, 这周花点时间处理

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants