Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Oct 25, 2024

What changes were proposed in this pull request?

This pr changes the process to dynamically generate StateMessage.java during build using the corresponding maven/sbt plugin.

Why are the changes needed?

Maintain the same generation way as the Java files corresponding to other pb files in Spark.

Does this PR introduce any user-facing change?

No

How was this patch tested?

image

Was this patch authored or co-authored using generative AI tooling?

No

@LuciferYang LuciferYang marked this pull request as draft October 25, 2024 08:56
@LuciferYang LuciferYang changed the title Change to using a plugin to generate StateMessage.java Change to using maven/sbt plugin to generate StateMessage.java Oct 25, 2024
@LuciferYang LuciferYang changed the title Change to using maven/sbt plugin to generate StateMessage.java Change to using maven/sbt plugin to generate StateMessage.java Oct 25, 2024
@github-actions github-actions bot added the INFRA label Oct 25, 2024
@LuciferYang LuciferYang changed the title Change to using maven/sbt plugin to generate StateMessage.java [BUILD][SS] Change to using maven/sbt plugin to generate StateMessage.java Oct 25, 2024
npm install --save-dev
node --experimental-vm-modules node_modules/.bin/jest

maven-test:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revert after test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang LuciferYang changed the title [BUILD][SS] Change to using maven/sbt plugin to generate StateMessage.java [SPARK-50117][BUILD][SS] Change to using maven/sbt plugin to generate StateMessage.java Oct 25, 2024
This reverts commit 20d0be9.
@LuciferYang LuciferYang marked this pull request as ready for review October 25, 2024 13:09
@github-actions github-actions bot removed the INFRA label Oct 25, 2024
@LuciferYang
Copy link
Contributor Author

cc @dongjoon-hyun FYI

@dongjoon-hyun
Copy link
Member

Could you resolve the conflicts, @LuciferYang ?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LuciferYang
Copy link
Contributor Author

Could you resolve the conflicts, @LuciferYang ?

done

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM from my side. Thank you, @LuciferYang .

@HeartSaVioR
Copy link
Contributor

Thanks for working on this, @LuciferYang !
I don't have a full context of how PySpark folks do the generated code thing in Spark Connect, but we also have py file being generated from proto. Will we need to do the same for generated py file? Or would there be technical reason not to do this? cc. @HyukjinKwon

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Oct 26, 2024

Thank you for the question from @HeartSaVioR . This was also an issue I intended to discuss in SPARK-49821, and I apologize for forgetting to leave comments yesterday.

Regarding the spark-connect module, there is a script called dev/connect-gen-protos.sh. This script is used by developers to manually regenerate the corresponding _pb2.py and _pb2.pyi files and submit them when changes are made to the .proto files in the connect module. Personally, I think we should provide a similar way to automate this process in the future, but as my understanding of Python is very superficial, I think we need @HyukjinKwon to confirm if there is a more reasonable solution.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Oct 27, 2024

@HeartSaVioR @dongjoon-hyun Can we merge this one first? I have created a new Jira ticket: SPARK-50139, to remind us that we should provide a tool to assist in regenerating StateMessage_pb2.py and StateMessage_pb2.pyi. We can complete this TODO after determining the most reasonable approach for PySpark.

@dongjoon-hyun
Copy link
Member

To @LuciferYang , I agree with you.

I believe we can handle that seperately in SPARK-50139 if needed. Or, we can do a follow-up PR if something is missing in this PR's context.

@dongjoon-hyun
Copy link
Member

If there is no further comments until Tomorrow (72 hours after #48654 (comment)), I believe you can merge this.

@LuciferYang
Copy link
Contributor Author

If there is no further comments until Tomorrow (72 hours after #48654 (comment)), I believe you can merge this.

Got it ~

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Oct 28, 2024

Would we want to also add StateMessage.java to gitignore so that we don't try to add the file again?

@LuciferYang
Copy link
Contributor Author

Would we want to also add StateMessage.java to gitignore so that we don't try to add the file again?

double checked:

  • Maven
build/mvn clean install -DskipTests -pl sql/core -am


[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 4.0.0-SNAPSHOT:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [  3.451 s]
[INFO] Spark Project Tags ................................. SUCCESS [  5.497 s]
[INFO] Spark Project Sketch ............................... SUCCESS [  5.819 s]
[INFO] Spark Project Common Utils ......................... SUCCESS [ 26.619 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 13.609 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 16.461 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 15.016 s]
[INFO] Spark Project Variant .............................. SUCCESS [  4.177 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 14.669 s]
[INFO] Spark Project Connect Shims ........................ SUCCESS [  7.762 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  7.345 s]
[INFO] Spark Project Core ................................. SUCCESS [02:26 min]
[INFO] Spark Project SQL API .............................. SUCCESS [ 40.995 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:04 min]
[INFO] Spark Project SQL .................................. SUCCESS [03:19 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11:33 min
[INFO] Finished at: 2024-10-29T08:28:02+08:00
[INFO] ------------------------------------------------------------------------


git status

On branch state_message-pb
Your branch is up to date with 'origin/state_message-pb'.

nothing to commit, working tree clean


find sql/core -name "*StateMessage.java"
sql/core/target/generated-sources/org/apache/spark/sql/execution/streaming/state/StateMessage.java
  • SBT
build/sbt clean "sql/compile"
[success] Total time: 109 s (01:49), completed 2024年10月29日 上午8:34:33

git status
On branch state_message-pb
Your branch is up to date with 'origin/state_message-pb'.

nothing to commit, working tree clean

find sql/core -name "*StateMessage.java"
sql/core/target/scala-2.13/src_managed/main/org/apache/spark/sql/execution/streaming/state/StateMessage.java

The StateMessage.java files generated by the maven/sbt plugin are all located in the sql/core/target directory. They are all visible on the classpath and have already been made invisible to git. I believe that under normal circumstances, they will not be accidentally added by git add, so there is no need to modify the .gitignore file any further.

@HeartSaVioR
Copy link
Contributor

Ah OK, so the generated file is placed in target, not source directory. Sounds good.

@HeartSaVioR
Copy link
Contributor

Please allow me to wait for a day to sync with folks working on this (TWS in PySpark). I've asked them to comment and they will come in US tz. If there is no comment in tomorrow US tz, I'm OK with the fix.

@LuciferYang
Copy link
Contributor Author

Please allow me to wait for a day to sync with folks working on this (TWS in PySpark). I've asked them to comment and they will come in US tz. If there is no comment in tomorrow US tz, I'm OK with the fix.

OK ~

@dongjoon-hyun
Copy link
Member

Thank you for the sync-up with them, @HeartSaVioR .

@bogao007
Copy link
Contributor

bogao007 commented Oct 29, 2024

@LuciferYang Thanks for working on this change! QQ: for the dev loop after this is merged, if we add changes to StateMessage.proto, would command build/sbt clean package be enough to automatically generate the java file and get picked up by the compiler?

@dongjoon-hyun
Copy link
Member

To @bogao007 , yes, exactly.

FYI, Apache Spark has many proto files (including this) and use them (except this) in that way.

$ find . -type f -name "*.proto" | grep -v test
./core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
./sql/core/src/main/java/org/apache/spark/sql/execution/streaming/StateMessage.proto
./sql/connect/common/src/main/protobuf/spark/connect/relations.proto
./sql/connect/common/src/main/protobuf/spark/connect/catalog.proto
./sql/connect/common/src/main/protobuf/spark/connect/base.proto
./sql/connect/common/src/main/protobuf/spark/connect/example_plugins.proto
./sql/connect/common/src/main/protobuf/spark/connect/types.proto
./sql/connect/common/src/main/protobuf/spark/connect/expressions.proto
./sql/connect/common/src/main/protobuf/spark/connect/commands.proto
./sql/connect/common/src/main/protobuf/spark/connect/common.proto

@LuciferYang
Copy link
Contributor Author

Thanks for working on this change! QQ: for the dev loop after this is merged, if we add changes to StateMessage.proto, would command build/sbt clean package be enough to automatically generate the java file and get picked up by the compiler?

@bogao007 Yes, it would be enough.

Thanks for your help! @dongjoon-hyun

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@LuciferYang
Copy link
Contributor Author

Merged into master for Spark 4.0. Thanks @dongjoon-hyun @HeartSaVioR and @bogao007

@dongjoon-hyun
Copy link
Member

Thank you all again!

@LuciferYang LuciferYang deleted the state_message-pb branch May 2, 2025 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants