[SPARK-47950] Add Java API Module for Spark Operator #8

jiangzho · 2024-04-23T07:05:08Z

What changes were proposed in this pull request?

This PR adds Java API library for Spark Operator, with the ability to generate yaml spec.

Why are the changes needed?

Spark Operator API refers to the CustomResourceDefinition(https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/) that represents the spec for Spark Application in k8s.

This module would be used by operator controller and reconciler. It can also serve external services that access k8s server with Java library.

Does this PR introduce any user-facing change?

No API changes in Apache Spark core API. Spark Operator API is proposed.

To view generate SparkApplication spec yaml, use

./gradlew :spark-operator-api:finalizeGeneratedCRD

(this requires yq to be installed for patching additional printer columns)

Generated yaml file would be located at

spark-operator-api/build/classes/java/main/META-INF/fabric8/sparkapplications.org.apache.spark-v1.yml

For more details, please also refer spark-operator-docs/spark_application.md

How was this patch tested?

This is tested locally.

Was this patch authored or co-authored using generative AI tooling?

No.

jiangzho · 2024-04-23T07:09:06Z

gradle.properties

+
+fabric8Version=6.12.1
+commonsLang3Version=3.14.0
+commonsIOVersion=2.16.1


fabric8 client, commons library & log4j versions are designed to be inline with Apache Spark dependency version.

Spark Operator API refers to the CustomResourceDefinition(https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/) that represents the spec for Spark Application in k8s. This aims to add Java API library for Spark Operator, with the ability to generate yaml spec. To generate SparkApplication spec yaml, use ./gradlew :spark-operator-api:finalizeGeneratedCRD (this requires yq to be installed for patching additional printer columns) Generated yaml file would be located at spark-operator-api/build/classes/java/main/META-INF/fabric8/sparkapplications.org.apache.spark-v1.yml For more details, please also refer spark-operator-docs/spark_application.md

dongjoon-hyun · 2024-04-23T15:03:50Z

gradle.properties

 # limitations under the License.
 #

+group=org.apache.spark.kubernetes.operator


Please use k8s as the package namespace.

https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

dongjoon-hyun · 2024-04-23T15:05:04Z

spark-operator-api/src/main/java/org/apache/spark/kubernetes/operator/BaseResource.java

+ *
+ */
+
+package org.apache.spark.kubernetes.operator;


ditto. k8s.

dongjoon-hyun · 2024-04-23T15:05:31Z

spark-operator-api/src/main/java/org/apache/spark/kubernetes/operator/BaseResource.java

+import org.apache.spark.kubernetes.operator.status.BaseStatus;
+
+public class BaseResource<
+        S,


indentation?

This is introduced by spotless GoogleJavaStyle plugin. I have not yet dig into the implementation, but seems this could be related to the guide for

When line-wrapping, each line after the first (each continuation line) is indented at least +4 from the original line.

... and it attempts a +4+4 here.

Spotless GoogleJavaStyle does not allow additional customization. If this line wrapping style becomes a concern we may switch to form our own style xml.

dongjoon-hyun · 2024-04-23T15:06:05Z

spark-operator-api/src/main/java/org/apache/spark/kubernetes/operator/Constants.java

+  public static final String LABEL_SPARK_ROLE_EXECUTOR_VALUE = "executor";
+  public static final String SPARK_CONF_SENTINEL_DUMMY_FIELD = "sentinel.dummy.number";
+
+  public static final String SENTINEL_LABEL = "spark.operator/sentinel";


Why this has a new group?

dongjoon-hyun · 2024-04-23T15:07:27Z

spark-operator-api/src/main/java/org/apache/spark/kubernetes/operator/Constants.java

+  public static final String SENTINEL_LABEL = "spark.operator/sentinel";
+
+  // Default state messages
+  public static final String DriverRequestedMessage = "Requested driver from resource scheduler. ";


May I ask why this PR add a space at the end, scheduler. "?

dongjoon-hyun · 2024-04-23T15:07:50Z

spark-operator-api/src/main/java/org/apache/spark/kubernetes/operator/Constants.java

+  public static final String DriverRequestedMessage = "Requested driver from resource scheduler. ";
+  public static final String DriverCompletedMessage = "Spark application completed successfully. ";
+  public static final String DriverTerminatedBeforeInitializationMessage =
+      "Driver container is terminated without SparkContext / SparkSession initialization.  ";


This is worse because we have two spaces, initialization. ".

dongjoon-hyun · 2024-04-23T15:12:23Z

spark-operator-api/src/main/java/org/apache/spark/kubernetes/operator/Constants.java

+      "The Spark application is running with less than minimal number of requested "
+          + "executors. ";
+  public static final String ExecutorLaunchTimeoutMessage =
+      "The Spark application failed to get enough executors in the given time threshold. ";


Adding a space manually and repeatably is a fragile assumption. Please handle this in the message print logic.

dongjoon-hyun · 2024-04-23T15:15:10Z

...or-api/src/main/java/org/apache/spark/kubernetes/operator/spec/ApplicationTimeoutConfig.java

+@Builder
+@JsonInclude(JsonInclude.Include.NON_NULL)
+@JsonIgnoreProperties(ignoreUnknown = true)
+public class ApplicationTimeoutConfig {


May I ask where these magic numbers came from?

Added a few comments here. These numbers are actually the default values that we recommended for previous prod customers. We may add more documents regarding the default values.

dongjoon-hyun · 2024-04-23T15:15:31Z

...or-api/src/main/java/org/apache/spark/kubernetes/operator/spec/ApplicationTimeoutConfig.java

+  @Builder.Default protected Long driverStartTimeoutMillis = 300 * 1000L;
+  @Builder.Default protected Long sparkSessionStartTimeoutMillis = 300 * 1000L;
+  @Builder.Default protected Long executorStartTimeoutMillis = 300 * 1000L;
+  @Builder.Default protected Long forceTerminationGracePeriodMillis = 300 * 1000L;


It would be great if there is a reference for the original value source.

dongjoon-hyun · 2024-04-23T15:16:41Z

...ator-api/src/main/java/org/apache/spark/kubernetes/operator/spec/ApplicationTolerations.java

+   * policy is set to 'Never'.
+   */
+  @Builder.Default
+  protected ResourceRetentionPolicy resourceRetentionPolicy = ResourceRetentionPolicy.AlwaysDelete;


Does this include Driver Pod and ConfigMap itself? It would be great if you can mention in the above comment explicitly.

dongjoon-hyun · 2024-04-23T15:18:15Z

...api/src/main/java/org/apache/spark/kubernetes/operator/status/ApplicationAttemptSummary.java

+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *


BTW, please double-check all license headers. These seems to be copied from some broken sources. For example, we don't have this empty line.

dongjoon-hyun · 2024-04-23T15:19:52Z

...api/src/main/java/org/apache/spark/kubernetes/operator/status/ApplicationAttemptSummary.java

+public class ApplicationAttemptSummary extends BaseAttemptSummary {
+  // The state transition history for given attempt
+  // This is used when state history trimming is enabled
+  protected Map<Long, ApplicationState> stateTransitionHistory;


Just a question. Why do we use Map instead of List for this linear data?

This is for the sake of unique state identification. We attempt to assign an unique id to each ApplicationState, that would be always incrementing across multiple attempts.

I also considered to add state id inside the ApplicationState instead of introducing a map of state id <-> state, but it end up with many corner cases to achieve idempotency for state transitioning.

It also serves in state transition history truncating. Sometimes this state transition history can be really long & cause big items in ETCD. The map helps us to avoid iterating the full state each time we truncate the history.

dongjoon-hyun · 2024-04-23T15:21:44Z