-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement](MaxCompute)Refactoring maxCompute catalog using Storage API. #40225
Conversation
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 38016 ms
|
TPC-DS: Total hot run time: 193561 ms
|
ClickBench: Total hot run time: 32.45 s
|
private TableTunnel tunnel; | ||
@SerializedName(value = "region") | ||
private String region; | ||
|
||
@SerializedName(value = "accessKey") | ||
private String accessKey; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these 2 fields no need to persist.
All property should be persisited in CatalogProperty
if (region.startsWith("oss-")) { | ||
// may use oss-cn-beijing, ensure compatible | ||
region = region.replace("oss-", ""); | ||
if (Strings.isNullOrEmpty(quota)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a default value for QUOTA?
9b818be
to
c518e57
Compare
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TeamCity be ut coverage result: |
TPC-H: Total hot run time: 38454 ms
|
TPC-DS: Total hot run time: 192576 ms
|
ClickBench: Total hot run time: 32.31 s
|
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 38524 ms
|
TPC-DS: Total hot run time: 188053 ms
|
TeamCity be ut coverage result: |
ClickBench: Total hot run time: 32.58 s
|
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
TPC-H: Total hot run time: 38414 ms
|
TPC-DS: Total hot run time: 187765 ms
|
ClickBench: Total hot run time: 32.4 s
|
TeamCity be ut coverage result: |
…pute catalogs from previous versions. (apache#41386) before pr apache#40225
… API. (apache#40225) Refactoring maxCompute catalog using Storage API. Storage API : https://help.aliyun.com/zh/maxcompute/user-guide/open-storage-sample-java-sdk?spm=a2c4g.11186623.0.i0 ``` The following are required: CREATE CATALOG mc PROPERTIES ( "type" = "max_compute", "mc.default.project" = "xxx", "mc.access_key" = "xxx", "mc.secret_key" = "xxxx", "mc.endpoint" = "xxxx" ); Optional parameters: Configuration Item Default Value "mc.quota" = "pay-as-you-go" "mc.split_strategy" = "byte_size" Split according to file size "mc.split_byte_size" = "268435456" You can set the file size of each split "mc.split_strategy" = "row_count" Split according to the number of rows of data "mc.split_row_count" = "1048576" You can set how many lines to read for each split ```
… maxcompute. (apache#40888) before pr apache#40225 ## Proposed changes Fixed a bug where when reading maxcompute, if there are null values in a batch, null values will always be read out.
…pute catalogs from previous versions. (apache#41386) before pr apache#40225
… API. (apache#40225) Refactoring maxCompute catalog using Storage API. Storage API : https://help.aliyun.com/zh/maxcompute/user-guide/open-storage-sample-java-sdk?spm=a2c4g.11186623.0.i0 ``` The following are required: CREATE CATALOG mc PROPERTIES ( "type" = "max_compute", "mc.default.project" = "xxx", "mc.access_key" = "xxx", "mc.secret_key" = "xxxx", "mc.endpoint" = "xxxx" ); Optional parameters: Configuration Item Default Value "mc.quota" = "pay-as-you-go" "mc.split_strategy" = "byte_size" Split according to file size "mc.split_byte_size" = "268435456" You can set the file size of each split "mc.split_strategy" = "row_count" Split according to the number of rows of data "mc.split_row_count" = "1048576" You can set how many lines to read for each split ```
… maxcompute. (apache#40888) before pr apache#40225 ## Proposed changes Fixed a bug where when reading maxcompute, if there are null values in a batch, null values will always be read out.
…pute catalogs from previous versions. (apache#41386) before pr apache#40225
… API. (apache#40225) Refactoring maxCompute catalog using Storage API. Storage API : https://help.aliyun.com/zh/maxcompute/user-guide/open-storage-sample-java-sdk?spm=a2c4g.11186623.0.i0 ``` The following are required: CREATE CATALOG mc PROPERTIES ( "type" = "max_compute", "mc.default.project" = "xxx", "mc.access_key" = "xxx", "mc.secret_key" = "xxxx", "mc.endpoint" = "xxxx" ); Optional parameters: Configuration Item Default Value "mc.quota" = "pay-as-you-go" "mc.split_strategy" = "byte_size" Split according to file size "mc.split_byte_size" = "268435456" You can set the file size of each split "mc.split_strategy" = "row_count" Split according to the number of rows of data "mc.split_row_count" = "1048576" You can set how many lines to read for each split ```
… maxcompute. (apache#40888) before pr apache#40225 ## Proposed changes Fixed a bug where when reading maxcompute, if there are null values in a batch, null values will always be read out.
…pute catalogs from previous versions. (apache#41386) before pr apache#40225 (cherry picked from commit e8a1a16)
…pute catalogs from previous versions. (apache#41386) before pr apache#40225
…pute catalogs from previous versions. (apache#41386) before pr apache#40225
…pute catalogs from previous versions. (apache#41386) before pr apache#40225
This PR #40225 try to pass time zone info from BE to JNI, and it use `_state->timezone_obj().name()` to get the timezone name. But when we do some rolling upgrade of BE, it may coredump like: ``` *** SIGSEGV address not mapped to object (@0x610) received by PID 72661 (TID 73538 OR 0x7f2e898d1640) from PID 1552; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/common/signal_handler.h:421 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 2# JVM_handle_linux_signal in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 4# 0x00007F3070D3E520 in /lib/x86_64-linux-gnu/libc.so.6 5# cctz::time_zone::name[abi:cxx11]() const in /mnt/hdd01/ci/compatibility-deploy/be/lib/doris_be 6# doris::vectorized::JniConnector::open(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/jni_connector.cpp:87 7# doris::vectorized::AvroJNIReader::init_fetch_table_schema_reader() at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/format/avro/avro_jni_reader.cpp:119 8# std::_Function_handler::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291 9# doris::WorkThreadPool::work_thread(int) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/util/work_thread_pool.hpp:159 10# execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:84 11# start_thread at ./nptl/pthread_create.c:442 12# 0x00007F3070E22850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83 172.20.50.206 last coredump sql: 2024-10-13 04:12:23,985 [query] ``` This PR use another method: `_state->timezone()`, which just return a string, instead of reading and initializing time zone info file, to avoid potential coredump.
This PR apache#40225 try to pass time zone info from BE to JNI, and it use `_state->timezone_obj().name()` to get the timezone name. But when we do some rolling upgrade of BE, it may coredump like: ``` *** SIGSEGV address not mapped to object (@0x610) received by PID 72661 (TID 73538 OR 0x7f2e898d1640) from PID 1552; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/common/signal_handler.h:421 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 2# JVM_handle_linux_signal in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 4# 0x00007F3070D3E520 in /lib/x86_64-linux-gnu/libc.so.6 5# cctz::time_zone::name[abi:cxx11]() const in /mnt/hdd01/ci/compatibility-deploy/be/lib/doris_be 6# doris::vectorized::JniConnector::open(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/jni_connector.cpp:87 7# doris::vectorized::AvroJNIReader::init_fetch_table_schema_reader() at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/format/avro/avro_jni_reader.cpp:119 8# std::_Function_handler::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291 9# doris::WorkThreadPool::work_thread(int) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/util/work_thread_pool.hpp:159 10# execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:84 11# start_thread at ./nptl/pthread_create.c:442 12# 0x00007F3070E22850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83 172.20.50.206 last coredump sql: 2024-10-13 04:12:23,985 [query] ``` This PR use another method: `_state->timezone()`, which just return a string, instead of reading and initializing time zone info file, to avoid potential coredump.
…2003) bp #41956 This PR #40225 try to pass time zone info from BE to JNI, and it use `_state->timezone_obj().name()` to get the timezone name. But when we do some rolling upgrade of BE, it may coredump like: ``` *** SIGSEGV address not mapped to object (@0x610) received by PID 72661 (TID 73538 OR 0x7f2e898d1640) from PID 1552; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/common/signal_handler.h:421 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 2# JVM_handle_linux_signal in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 4# 0x00007F3070D3E520 in /lib/x86_64-linux-gnu/libc.so.6 5# cctz::time_zone::name[abi:cxx11]() const in /mnt/hdd01/ci/compatibility-deploy/be/lib/doris_be 6# doris::vectorized::JniConnector::open(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/jni_connector.cpp:87 7# doris::vectorized::AvroJNIReader::init_fetch_table_schema_reader() at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/format/avro/avro_jni_reader.cpp:119 8# std::_Function_handler::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291 9# doris::WorkThreadPool::work_thread(int) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/util/work_thread_pool.hpp:159 10# execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:84 11# start_thread at ./nptl/pthread_create.c:442 12# 0x00007F3070E22850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83 172.20.50.206 last coredump sql: 2024-10-13 04:12:23,985 [query] ``` This PR use another method: `_state->timezone()`, which just return a string, instead of reading and initializing time zone info file, to avoid potential coredump.
This PR apache#40225 try to pass time zone info from BE to JNI, and it use `_state->timezone_obj().name()` to get the timezone name. But when we do some rolling upgrade of BE, it may coredump like: ``` *** SIGSEGV address not mapped to object (@0x610) received by PID 72661 (TID 73538 OR 0x7f2e898d1640) from PID 1552; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/common/signal_handler.h:421 1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 2# JVM_handle_linux_signal in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so 4# 0x00007F3070D3E520 in /lib/x86_64-linux-gnu/libc.so.6 5# cctz::time_zone::name[abi:cxx11]() const in /mnt/hdd01/ci/compatibility-deploy/be/lib/doris_be 6# doris::vectorized::JniConnector::open(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/jni_connector.cpp:87 7# doris::vectorized::AvroJNIReader::init_fetch_table_schema_reader() at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/format/avro/avro_jni_reader.cpp:119 8# std::_Function_handler::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291 9# doris::WorkThreadPool::work_thread(int) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/util/work_thread_pool.hpp:159 10# execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:84 11# start_thread at ./nptl/pthread_create.c:442 12# 0x00007F3070E22850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83 172.20.50.206 last coredump sql: 2024-10-13 04:12:23,985 [query] ``` This PR use another method: `_state->timezone()`, which just return a string, instead of reading and initializing time zone info file, to avoid potential coredump.
Since I refactored the maxcompute code before, some properties changed when creating the catalog, so I refactored the corresponding documents here. pr : apache/doris#40225 # Versions - [x] dev - [x] 3.0 - [x] 2.1 - [ ] 2.0 # Languages - [x] Chinese - [x] English
Proposed changes
Refactoring maxCompute catalog using Storage API.
Storage API : https://help.aliyun.com/zh/maxcompute/user-guide/open-storage-sample-java-sdk?spm=a2c4g.11186623.0.i0