Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support writing data to hdfs by executing INSERT INTO/CTAS. #5663

Closed
wants to merge 1 commit into from

Conversation

wypb
Copy link
Contributor

@wypb wypb commented Jul 17, 2023

Current INSERT INTO/CTAS only supprt write data to local file system, If we run the following SQL, the SQL will execute successfully, but no data can be queried from the table:

presto> create schema hive.tpch_parquet_1px with (location='hdfs://hdfsCluster/user/hive/warehouse/tpch_parquet_1px.db/');
CREATE SCHEMA
presto>
presto> create table hive.tpch_parquet_1px.lineitem with (format = 'PARQUET') as select * from tpch.sf1.lineitem;
CREATE TABLE: 6001215 rows

Query 20230714_073619_00015_8z7nw, FINISHED, 11 nodes
Splits: 301 total, 301 done (100.00%)
2:41 [6M rows, 1.97GB] [37.3K rows/s, 12.5MB/s]

presto> select * from hive.tpch_parquet_1px.lineitem limit 10;
 l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct | l_shipmode | l_comment
------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+----------------+------------+-----------
(0 rows)

Query 20230714_073912_00016_8z7nw, FINISHED, 2 nodes
Splits: 1 total, 1 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

because the data is actually written to the worker's local file system.
image

With this PR we can writing data to hdfs by executing INSERT INTO/CTAS.

@netlify
Copy link

netlify bot commented Jul 17, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit abc700b
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/64d41aca08673d00082fd151

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 17, 2023
@wypb wypb force-pushed the insert_into_hdfs branch 6 times, most recently from 2b5f452 to 95af4ff Compare July 17, 2023 08:32
@@ -59,6 +61,28 @@ void WriteFileDataSink::registerLocalFileFactory() {
DataSink::registerFactory(localWriteFileSink);
}

#ifdef VELOX_ENABLE_HDFS3
std::unique_ptr<DataSink> hdfsWriteFileSink(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be defined in the HDFS API?

@majetideepak
Copy link
Collaborator

@wypb We should abstract away optional checks to the storage adapter and HiveConnector initialize.
Can you refactor hdfs registration in a separate PR similar to S3 here #5469?

@wypb
Copy link
Contributor Author

wypb commented Jul 17, 2023

@majetideepak sure, let me do it in a separate pull request.

@JkSelf
Copy link
Contributor

JkSelf commented Jul 18, 2023

@majetideepak When we support the native parquet write function in Gluten, we found that Velox Parquet Writer only support write data into local file system. And we filed PR#5400 to support writing data in HDFS path when calling the velox parquet writer. Can you also have a look if you have available time? Thanks.

@wypb wypb closed this Jul 20, 2023
@wypb wypb force-pushed the insert_into_hdfs branch 2 times, most recently from 95af4ff to 40b1daf Compare July 20, 2023 01:42
@wypb wypb reopened this Jul 20, 2023
@wypb wypb force-pushed the insert_into_hdfs branch 2 times, most recently from 7bd1dc1 to c42244a Compare July 20, 2023 05:22
@wypb
Copy link
Contributor Author

wypb commented Jul 20, 2023

Hi @majetideepak Based on #5700 I refactored the code logic, could you help me review it again.

@majetideepak
Copy link
Collaborator

@JkSelf this PR seems to contain the HdfsFileSink implementation as well. Do we still need #5400?
Can you help review this PR as well? Thanks.

@JkSelf
Copy link
Contributor

JkSelf commented Jul 21, 2023

@JkSelf this PR seems to contain the HdfsFileSink implementation as well. Do we still need #5400? Can you help review this PR as well? Thanks.

Yes. It seems there is no need #5400. Thanks for your review.

@@ -19,7 +19,11 @@ if(VELOX_ENABLE_S3)
target_sources(velox_s3fs PRIVATE S3FileSystem.cpp S3Util.cpp)

target_include_directories(velox_s3fs PUBLIC ${AWSSDK_INCLUDE_DIRS})
target_link_libraries(velox_s3fs Folly::folly ${AWSSDK_LIBRARIES})
target_link_libraries(velox_s3fs Folly::folly ${AWSSDK_LIBRARIES} xsimd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need s3 related changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JkSelf Thank you for your reply.
Because we included velox/dwio/common/DataBuffer.h in the velox/common/file/File.h file, and DataBuffer.h depends on xsimd and gtest

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does DataBuffer.h depend on xsimd and gtest? We need to fix this.

Copy link
Contributor Author

@wypb wypb Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CMakeLists.txt file of s3 can not be modified this time, but the CMakeLists.txt file of hdfs depends on it.
if we remove xsimd The following exception occurs at compile time:

In file included from /Users/wypb/data/code/apache/temp/velox/velox/connectors/hive/storage_adapters/hdfs/RegisterHdfsFileSystem.cpp:23:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/dwio/common/DataSink.h:23:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/dwio/common/DataBuffer.h:22:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/buffer/Buffer.h:26:
/Users/wypb/data/code/apache/temp/velox/./velox/common/base/SimdUtil.h:24:10: fatal error: 'xsimd/xsimd.hpp' file not found
#include <xsimd/xsimd.hpp>

if remove gtest, The following exception occurs at compile time:

In file included from /Users/wypb/data/code/apache/temp/velox/velox/connectors/hive/storage_adapters/hdfs/RegisterHdfsFileSystem.cpp:23:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/dwio/common/DataSink.h:23:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/dwio/common/DataBuffer.h:22:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/buffer/Buffer.h:27:
In file included from /Users/wypb/data/code/apache/temp/velox/./velox/common/memory/Memory.h:38:
/Users/wypb/data/code/apache/temp/velox/./velox/common/base/GTestMacros.h:24:10: fatal error: 'gtest/gtest_prod.h' file not found
#include <gtest/gtest_prod.h>
         ^~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including velox/common/base/GTestMacros.h inside velox/common/memory/Memory.h seems unnecessary.
I added a PR https://github.com/facebookincubator/velox/pull/5930/files to remove it. Let's see if something breaks.

@wypb
Copy link
Contributor Author

wypb commented Jul 25, 2023

@majetideepak Deepak, could you also help review this PR? thanks.

@JkSelf
Copy link
Contributor

JkSelf commented Jul 26, 2023

LGTM.

@wypb
Copy link
Contributor Author

wypb commented Jul 26, 2023

@JkSelf thank you for your review.

@JkSelf
Copy link
Contributor

JkSelf commented Jul 26, 2023

@majetideepak Deepak, do you have any further comment?

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wypb left some high-level comments.

@@ -17,6 +17,10 @@ include_directories(.)
add_library(velox_file File.cpp FileSystems.cpp Utils.cpp)
target_link_libraries(velox_file velox_common_base Folly::folly)

if(NOT VELOX_DISABLE_GOOGLETEST)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Why does velox_file depend on gtest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the append(dwio::common::DataBuffer<char>& buffer) api, so this dependency can be deleted.

@@ -19,7 +19,11 @@ if(VELOX_ENABLE_S3)
target_sources(velox_s3fs PRIVATE S3FileSystem.cpp S3Util.cpp)

target_include_directories(velox_s3fs PUBLIC ${AWSSDK_INCLUDE_DIRS})
target_link_libraries(velox_s3fs Folly::folly ${AWSSDK_LIBRARIES})
target_link_libraries(velox_s3fs Folly::folly ${AWSSDK_LIBRARIES} xsimd)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does DataBuffer.h depend on xsimd and gtest? We need to fix this.

@@ -143,6 +144,9 @@ class WriteFile {
// Appends data to the end of the file.
virtual void append(std::string_view data) = 0;

// Appends data to the end of the file.
virtual void append(dwio::common::DataBuffer<char>& buffer) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need this new API? Can we use the existing std::string_view API on top of DataBuffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it and found that the std::string_view API is enough to write Parquet files, thank you.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wypb thanks for the fix!

@majetideepak
Copy link
Collaborator

@kgpai, @Yuhta can one of you please help with merging this change? Thanks.

@wypb
Copy link
Contributor Author

wypb commented Aug 8, 2023

hi @kgpai, @Yuhta could you please help me merging this change? Thanks.

@wypb
Copy link
Contributor Author

wypb commented Aug 9, 2023

Hi @kgpai Thanks for your review! Do you have any further comment?

@wypb wypb force-pushed the insert_into_hdfs branch 6 times, most recently from e6417fd to 0286836 Compare August 9, 2023 16:14
@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@wypb
Copy link
Contributor Author

wypb commented Aug 10, 2023

HI @kgpai I just synchronized the latest code to solve the problem of MultiFragmentTest.exchangeStatsOnFailure test failure in linux-build. could you please help me review this again. thank you.

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai
Copy link
Contributor

kgpai commented Aug 11, 2023

@wypb This should land soon.

@wypb
Copy link
Contributor Author

wypb commented Aug 11, 2023

@kgpai Thank you.

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in 81a4de5.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit 81a4de5f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@wypb wypb deleted the insert_into_hdfs branch August 14, 2023 01:36
unigof pushed a commit to unigof/velox that referenced this pull request Aug 18, 2023
…bookincubator#5663)

Summary:
Current INSERT INTO/CTAS only supprt write data to local file system, If we run the following SQL, the SQL will execute successfully, but no data can be queried from the table:
```
presto> create schema hive.tpch_parquet_1px with (location='hdfs://hdfsCluster/user/hive/warehouse/tpch_parquet_1px.db/');
CREATE SCHEMA
presto>
presto> create table hive.tpch_parquet_1px.lineitem with (format = 'PARQUET') as select * from tpch.sf1.lineitem;
CREATE TABLE: 6001215 rows

Query 20230714_073619_00015_8z7nw, FINISHED, 11 nodes
Splits: 301 total, 301 done (100.00%)
2:41 [6M rows, 1.97GB] [37.3K rows/s, 12.5MB/s]

presto> select * from hive.tpch_parquet_1px.lineitem limit 10;
 l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct | l_shipmode | l_comment
------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+----------------+------------+-----------
(0 rows)

Query 20230714_073912_00016_8z7nw, FINISHED, 2 nodes
Splits: 1 total, 1 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
```
because the data is actually written to the worker's local file system.
![image](https://github.com/facebookincubator/velox/assets/5170878/6554d670-8389-4c4e-89b2-c978fea55d14)

With this PR we can writing data to hdfs by executing INSERT INTO/CTAS.

Pull Request resolved: facebookincubator#5663

Reviewed By: Yuhta

Differential Revision: D48195142

Pulled By: kgpai

fbshipit-source-id: 1b52c59fe338a23d9006e261b3d5737534cf1efd
ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
…bookincubator#5663)

Summary:
Current INSERT INTO/CTAS only supprt write data to local file system, If we run the following SQL, the SQL will execute successfully, but no data can be queried from the table:
```
presto> create schema hive.tpch_parquet_1px with (location='hdfs://hdfsCluster/user/hive/warehouse/tpch_parquet_1px.db/');
CREATE SCHEMA
presto>
presto> create table hive.tpch_parquet_1px.lineitem with (format = 'PARQUET') as select * from tpch.sf1.lineitem;
CREATE TABLE: 6001215 rows

Query 20230714_073619_00015_8z7nw, FINISHED, 11 nodes
Splits: 301 total, 301 done (100.00%)
2:41 [6M rows, 1.97GB] [37.3K rows/s, 12.5MB/s]

presto> select * from hive.tpch_parquet_1px.lineitem limit 10;
 l_orderkey | l_partkey | l_suppkey | l_linenumber | l_quantity | l_extendedprice | l_discount | l_tax | l_returnflag | l_linestatus | l_shipdate | l_commitdate | l_receiptdate | l_shipinstruct | l_shipmode | l_comment
------------+-----------+-----------+--------------+------------+-----------------+------------+-------+--------------+--------------+------------+--------------+---------------+----------------+------------+-----------
(0 rows)

Query 20230714_073912_00016_8z7nw, FINISHED, 2 nodes
Splits: 1 total, 1 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
```
because the data is actually written to the worker's local file system.
![image](https://github.com/facebookincubator/velox/assets/5170878/6554d670-8389-4c4e-89b2-c978fea55d14)

With this PR we can writing data to hdfs by executing INSERT INTO/CTAS.

Pull Request resolved: facebookincubator#5663

Reviewed By: Yuhta

Differential Revision: D48195142

Pulled By: kgpai

fbshipit-source-id: 1b52c59fe338a23d9006e261b3d5737534cf1efd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants