Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Set up Python Protobuf codegen for Delta Connect #3125

Merged
merged 33 commits into from
May 29, 2024

Conversation

longvu-db
Copy link
Contributor

@longvu-db longvu-db commented May 21, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Added the very first protobuf messages for DeltaTable, Clone and Scan.

This is the first PR for Delta Connect, which adds support for Delta's DeltaTable interface to Spark Connect. This is needed to support Delta table operations outside of SQL directly on Spark Connect clients.

This PR sets up the Python code generation for the Protobufs of Delta Connect. For this I created a new Buf workspace and I added a few initial Protobuf messages to confirm that everything works. This is the ground work of the project, before we move on to setting up the server and client library.

What we are doing here is similar to the Spark Connect's protobuf development guide.

How was this patch tested?

Added the check-delta-connect-codegen-python.py to the automated testing, making sure the Python Protobuf Generated codes stay in sync with the proto messages.

Does this PR introduce any user-facing changes?

No.

@longvu-db longvu-db changed the title Delta python proto codegen [Spark] Delta python proto codegen May 21, 2024
@longvu-db longvu-db changed the title [Spark] Delta python proto codegen [Spark] Set up Python Protobuf codegen for Delta Connect May 21, 2024
@longvu-db
Copy link
Contributor Author

Failed tests are in org.apache.spark.sql.delta.perf.OptimizeGeneratedColumnSuite, seems unrelated to this PR.

@@ -0,0 +1,21 @@
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the directory name of spark-connect to capture that connect is only for spark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything under python/ is fine, because everything under python is already only for spark.

Copy link
Contributor Author

@longvu-db longvu-db May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas Hmm, from an outside look, even though the content of the python folder is only Spark, there is nothing in the naming that suggests that it is for Spark no, or do we rule out somewhere that everything under the python folder is already only for Spark like you said? Shall we also do python/delta/spark-connect instead of python/delta/connect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the entire content of python/delta is for spark. I agree that ideally this entire python/delta should be under spark/, but thats an existing problem that we will solve holistically.

its not clear to me yet, why all the python files need to be under python/delta/connect/ and not directly under python/delta/

can you tell me more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have a strong opinion. for now we can keep it here... and we can move it when we have more clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas Because all the files under python/delta/connect are for Delta Connect, perhaps we can change the name to delta-connect also.

I think naming name1/name2 might be better than name1-name2, so delta/connect instead of delta-connect, spark/connect instead of spark-connect.

Copy link
Contributor

@andreaschat-db andreaschat-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments.

dev/connect-gen-protos.sh Outdated Show resolved Hide resolved
dev/connect-gen-protos.sh Outdated Show resolved Hide resolved
dev/connect-gen-protos.sh Outdated Show resolved Hide resolved
@@ -17,6 +17,6 @@
[pycodestyle]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file used for lint or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. 71606f4

# the one generated by proto.
for f in `find gen/proto/python/delta/connect -name "*.py*"`; do
# First fix the imports.
if [[ $f == *_pb2.py || $f == *_pb2_grpc.py ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part we modify the generated code. How do we know the replacements are correct? Perhaps add that to "How was this patch tested section"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a Python script to check if the generated codes are out of sync with current proto message.

In the future, when we set up the Delta Connect server and client, we will have tests for that, and if the generated codes are wrong then it will break those Delta Connect tests

Copy link
Contributor

@andreaschat-db andreaschat-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left one comment.

@@ -41,6 +41,10 @@ jobs:
sudo apt-get update
sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev python-openssl git
sudo apt install libedit-dev
curl -LO https://github.com/bufbuild/buf/releases/download/v1.28.1/buf-Linux-x86_64.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add this in the requirements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not because the requirements.txt is only for Python dependencies, used for the Python venv

@longvu-db longvu-db requested a review from tdas May 28, 2024 08:11
Copy link
Contributor

@tomvanbussel tomvanbussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vkorukanti vkorukanti merged commit eb638bb into delta-io:master May 29, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants