-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support writing data to hdfs by executing INSERT INTO/CTAS. #5663
Conversation
✅ Deploy Preview for meta-velox canceled.
|
2b5f452
to
95af4ff
Compare
velox/dwio/common/DataSink.cpp
Outdated
@@ -59,6 +61,28 @@ void WriteFileDataSink::registerLocalFileFactory() { | |||
DataSink::registerFactory(localWriteFileSink); | |||
} | |||
|
|||
#ifdef VELOX_ENABLE_HDFS3 | |||
std::unique_ptr<DataSink> hdfsWriteFileSink( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be defined in the HDFS API?
@majetideepak sure, let me do it in a separate pull request. |
@majetideepak When we support the native parquet write function in Gluten, we found that Velox Parquet Writer only support write data into local file system. And we filed PR#5400 to support writing data in HDFS path when calling the velox parquet writer. Can you also have a look if you have available time? Thanks. |
95af4ff
to
40b1daf
Compare
7bd1dc1
to
c42244a
Compare
Hi @majetideepak Based on #5700 I refactored the code logic, could you help me review it again. |
@@ -19,7 +19,11 @@ if(VELOX_ENABLE_S3) | |||
target_sources(velox_s3fs PRIVATE S3FileSystem.cpp S3Util.cpp) | |||
|
|||
target_include_directories(velox_s3fs PUBLIC ${AWSSDK_INCLUDE_DIRS}) | |||
target_link_libraries(velox_s3fs Folly::folly ${AWSSDK_LIBRARIES}) | |||
target_link_libraries(velox_s3fs Folly::folly ${AWSSDK_LIBRARIES} xsimd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need s3 related changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf Thank you for your reply.
Because we included velox/dwio/common/DataBuffer.h
in the velox/common/file/File.h
file, and DataBuffer.h
depends on xsimd
and gtest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does DataBuffer.h
depend on xsimd and gtest? We need to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CMakeLists.txt
file of s3 can not be modified this time, but the CMakeLists.txt
file of hdfs depends on it.
if we remove xsimd The following exception occurs at compile time:
In file included from /Users/wypb/data/code/apache/temp/velox/velox/connectors/hive/storage_adapters/hdfs/RegisterHdfsFileSystem.cpp:23:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/dwio/common/DataSink.h:23:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/dwio/common/DataBuffer.h:22:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/buffer/Buffer.h:26:
/Users/wypb/data/code/apache/temp/velox/./velox/common/base/SimdUtil.h:24:10: fatal error: 'xsimd/xsimd.hpp' file not found
#include <xsimd/xsimd.hpp>
if remove gtest, The following exception occurs at compile time:
In file included from /Users/wypb/data/code/apache/temp/velox/velox/connectors/hive/storage_adapters/hdfs/RegisterHdfsFileSystem.cpp:23:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/dwio/common/DataSink.h:23:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/dwio/common/DataBuffer.h:22:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/buffer/Buffer.h:27:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/common/memory/Memory.h:38:
/Users/wypb/data/code/apache/temp/velox/./velox/common/base/GTestMacros.h:24:10: fatal error: 'gtest/gtest_prod.h' file not found
#include <gtest/gtest_prod.h>
^~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including velox/common/base/GTestMacros.h
inside velox/common/memory/Memory.h
seems unnecessary.
I added a PR https://github.com/facebookincubator/velox/pull/5930/files to remove it. Let's see if something breaks.
@majetideepak Deepak, could you also help review this PR? thanks. |
LGTM. |
@JkSelf thank you for your review. |
@majetideepak Deepak, do you have any further comment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wypb left some high-level comments.
velox/common/file/CMakeLists.txt
Outdated
@@ -17,6 +17,10 @@ include_directories(.) | |||
add_library(velox_file File.cpp FileSystems.cpp Utils.cpp) | |||
target_link_libraries(velox_file velox_common_base Folly::folly) | |||
|
|||
if(NOT VELOX_DISABLE_GOOGLETEST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? Why does velox_file depend on gtest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deleted the append(dwio::common::DataBuffer<char>& buffer)
api, so this dependency can be deleted.
@@ -19,7 +19,11 @@ if(VELOX_ENABLE_S3) | |||
target_sources(velox_s3fs PRIVATE S3FileSystem.cpp S3Util.cpp) | |||
|
|||
target_include_directories(velox_s3fs PUBLIC ${AWSSDK_INCLUDE_DIRS}) | |||
target_link_libraries(velox_s3fs Folly::folly ${AWSSDK_LIBRARIES}) | |||
target_link_libraries(velox_s3fs Folly::folly ${AWSSDK_LIBRARIES} xsimd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does DataBuffer.h
depend on xsimd and gtest? We need to fix this.
velox/common/file/File.h
Outdated
@@ -143,6 +144,9 @@ class WriteFile { | |||
// Appends data to the end of the file. | |||
virtual void append(std::string_view data) = 0; | |||
|
|||
// Appends data to the end of the file. | |||
virtual void append(dwio::common::DataBuffer<char>& buffer) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why need this new API? Can we use the existing std::string_view
API on top of DataBuffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it and found that the std::string_view
API is enough to write Parquet files, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wypb thanks for the fix!
Hi @kgpai Thanks for your review! Do you have any further comment? |
e6417fd
to
0286836
Compare
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
HI @kgpai I just synchronized the latest code to solve the problem of |
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@wypb This should land soon. |
@kgpai Thank you. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…bookincubator#5663) Summary: Current INSERT INTO/CTAS only supprt write data to local file system, If we run the following SQL, the SQL will execute successfully, but no data can be queried from the table: ``` presto> create schema hive.tpch_parquet_1px with (location='hdfs://hdfsCluster/user/hive/warehouse/tpch_parquet_1px.db/'); CREATE SCHEMA presto> presto> create table hive.tpch_parquet_1px.lineitem with (format = 'PARQUET') as select * from tpch.sf1.lineitem; CREATE TABLE: 6001215 rows Query 20230714_073619_00015_8z7nw, FINISHED, 11 nodes Splits: 301 total, 301 done (100.00%) 2:41 [6M rows, 1.97GB] [37.3K rows/s, 12.5MB/s] presto> select * from hive.tpch_parquet_1px.lineitem limit 10; l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct | l_shipmode | l_comment ------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+----------------+------------+----------- (0 rows) Query 20230714_073912_00016_8z7nw, FINISHED, 2 nodes Splits: 1 total, 1 done (100.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] ``` because the data is actually written to the worker's local file system. ![image](https://github.com/facebookincubator/velox/assets/5170878/6554d670-8389-4c4e-89b2-c978fea55d14) With this PR we can writing data to hdfs by executing INSERT INTO/CTAS. Pull Request resolved: facebookincubator#5663 Reviewed By: Yuhta Differential Revision: D48195142 Pulled By: kgpai fbshipit-source-id: 1b52c59fe338a23d9006e261b3d5737534cf1efd
…bookincubator#5663) Summary: Current INSERT INTO/CTAS only supprt write data to local file system, If we run the following SQL, the SQL will execute successfully, but no data can be queried from the table: ``` presto> create schema hive.tpch_parquet_1px with (location='hdfs://hdfsCluster/user/hive/warehouse/tpch_parquet_1px.db/'); CREATE SCHEMA presto> presto> create table hive.tpch_parquet_1px.lineitem with (format = 'PARQUET') as select * from tpch.sf1.lineitem; CREATE TABLE: 6001215 rows Query 20230714_073619_00015_8z7nw, FINISHED, 11 nodes Splits: 301 total, 301 done (100.00%) 2:41 [6M rows, 1.97GB] [37.3K rows/s, 12.5MB/s] presto> select * from hive.tpch_parquet_1px.lineitem limit 10; l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct | l_shipmode | l_comment ------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+----------------+------------+----------- (0 rows) Query 20230714_073912_00016_8z7nw, FINISHED, 2 nodes Splits: 1 total, 1 done (100.00%) 0:01 [0 rows, 0B] [0 rows/s, 0B/s] ``` because the data is actually written to the worker's local file system. ![image](https://github.com/facebookincubator/velox/assets/5170878/6554d670-8389-4c4e-89b2-c978fea55d14) With this PR we can writing data to hdfs by executing INSERT INTO/CTAS. Pull Request resolved: facebookincubator#5663 Reviewed By: Yuhta Differential Revision: D48195142 Pulled By: kgpai fbshipit-source-id: 1b52c59fe338a23d9006e261b3d5737534cf1efd
Current INSERT INTO/CTAS only supprt write data to local file system, If we run the following SQL, the SQL will execute successfully, but no data can be queried from the table:
because the data is actually written to the worker's local file system.
With this PR we can writing data to hdfs by executing INSERT INTO/CTAS.