CASSANALYTICS-168 Need the ability to broadcast and reconstruct subclasses on executors by skoppu22 · Pull Request #205 · apache/cassandra-analytics

skoppu22 · 2026-05-08T12:29:19Z

After the BulkWriterConfig broadcast refactor f960685, bulk writer’s context/cluster/config subclasses cannot be instantiated on executors. For any job whose driver-side context was instantiated from a subclass, the executor silently instantiates base class implementations. Hence need to add the ability to broadcast and reconstruct subclasses on executors.

Circle CI link: https://app.circleci.com/pipelines/github/skoppu22/cassandra-analytics/116/workflows/6b53f2bd-017f-4dbe-917f-8d6794dfd24b

jmckenzie-dev · 2026-05-08T16:05:23Z

+
+        // Extract only broadcast-safe cluster metadata
+
+        // ClusterInfo has transient fields (CassandraContext, token mappings) that are not serializable


Currently the distinction on what is transient and what is not is implicit, derived from the verbiage in this method. How can we instead make it clear within the ClusterInfo what fields are serializable and what are not? Brainstorming, thinking:

javadoc comments

usage of an @Serializable and @Serial interface (kind of overloading and using in a different way than the formal usage but would annotate the intent)

Adding our own @Immutable style interface for something or otherwise denoting the fields final or pushing them to being final if appropriate

Having the serializability state of these fields denoted here in comments is brittle and runs a real risk of drift; changes in ClusterInfo could easily break these contracts in the future w/out another maintainer realizing it.

Removed these comments. In CassandraClusterInfo, grouped fields by serializable state and added comments

jmckenzie-dev · 2026-05-08T16:27:49Z

+    public BulkWriterContext toBulkWriterContext()
+    {
+        BulkSparkConf conf = getConf();
+        if (conf.isCoordinatedWriteConfigured())


stylistic nit: you could rewrite this as:

return conf.isCoordinatedWriteConfigured() ? new CassandraCoordinatedBulkWriterContext(this) : new CassandraBulkWriterContext(this);

Whether or not you think that's more clear is another story entirely. :)

jmckenzie-dev · 2026-05-08T16:28:44Z

+
+        // Extract only broadcast-safe cluster metadata
+
+        // ClusterInfo has transient fields (CassandraContext, token mappings) that are not serializable


Same as above - how can we make this more explicit near the source of the data and its serializability (and reflect the downstream expectation of that serializability) instead of having that information and expectation only reflected here?

jmckenzie-dev · 2026-05-08T16:29:30Z

+
+        BulkWriterContext context = customConfig.toBulkWriterContext();
+        assertThat(context).isNotNull();
+        // The OSS default would return CassandraBulkWriterContext or CassandraCoordinatedBulkWriterContext,


The OSS default <- this is the OSS project. Is this comment from another context and need to refine here?

Reworded this

yifan-c · 2026-05-08T18:35:00Z

        return new CassandraClusterInfoGroup(clusterInfos);
    }

-    @VisibleForTesting // ONLY FOR TESTING


Why removing the annotation? I think it is still only used by test code

This method is no longer test-only, custom IBroadcastableClusterInfo implementations that reconstruct cluster infos individually and wrap them in a group will override this

yifan-c

Some nits. The patch LGTM

yifan-c · 2026-05-08T18:41:51Z

+    @Override
+    public BulkWriterConfig toBulkWriterConfigForBroadcasting(JavaSparkContext sparkContext)
+    {
+        CassandraClusterInfoGroup multiCluster = (CassandraClusterInfoGroup) cluster();


nit: call clusterInfoGroup

yifan-c · 2026-05-12T06:30:19Z

+    @Test
+    void testReconstructJobInfoOnExecutorCanBeOverridden()
+    {
+        JobInfo expectedJobInfo = mock(JobInfo.class);
+        ClusterInfo mockCluster = mock(ClusterInfo.class);
+
+        IBroadcastableClusterInfo customBroadcastable = new IBroadcastableClusterInfo()
+        {
+            @Override
+            public Partitioner getPartitioner()
+            {
+                return Partitioner.Murmur3Partitioner;
+            }
+
+            @Override
+            public String getLowestCassandraVersion()
+            {
+                return "4.0.0";
+            }
+
+            @Nullable
+            @Override
+            public String clusterId()
+            {
+                return null;
+            }
+
+            @NotNull
+            @Override
+            public BulkSparkConf getConf()
+            {
+                return mock(BulkSparkConf.class);
+            }
+
+            @Override
+            public ClusterInfo reconstruct()
+            {
+                return mockCluster;
+            }
+        };
+
+        BulkSparkConf mockConf = mock(BulkSparkConf.class);
+        BroadcastableJobInfo mockJobInfo = mock(BroadcastableJobInfo.class);
+        when(mockJobInfo.getConf()).thenReturn(mockConf);
+        when(mockJobInfo.getRestoreJobIds()).thenReturn(MultiClusterContainer.ofSingle(UUID.randomUUID()));
+        BroadcastableSchemaInfo mockSchemaInfo = mock(BroadcastableSchemaInfo.class);
+
+        BulkWriterConfig config = new BulkWriterConfig(mockConf, 4, mockJobInfo, customBroadcastable, mockSchemaInfo, "4.0.0");
+
+        // Subclass that overrides reconstructJobInfoOnExecutor to return custom JobInfo
+        TestBulkWriterContext context = new TestBulkWriterContext(config)
+        {
+            @Override
+            protected JobInfo reconstructJobInfoOnExecutor(BroadcastableJobInfo jobInfo)
+            {
+                return expectedJobInfo;
+            }
+        };
+
+        assertThat(context.job()).isSameAs(expectedJobInfo);
+    }


customBroadcastable is unnecessary. You can simply the test as the below.

Suggested change

@Test

void testReconstructJobInfoOnExecutorCanBeOverridden()

{

JobInfo expectedJobInfo = mock(JobInfo.class);

ClusterInfo mockCluster = mock(ClusterInfo.class);

IBroadcastableClusterInfo customBroadcastable = new IBroadcastableClusterInfo()

{

@Override

public Partitioner getPartitioner()

{

return Partitioner.Murmur3Partitioner;

}

@Override

public String getLowestCassandraVersion()

{

return "4.0.0";

}

@Nullable

@Override

public String clusterId()

{

return null;

}

@NotNull

@Override

public BulkSparkConf getConf()

{

return mock(BulkSparkConf.class);

}

@Override

public ClusterInfo reconstruct()

{

return mockCluster;

}

};

BulkSparkConf mockConf = mock(BulkSparkConf.class);

BroadcastableJobInfo mockJobInfo = mock(BroadcastableJobInfo.class);

when(mockJobInfo.getConf()).thenReturn(mockConf);

when(mockJobInfo.getRestoreJobIds()).thenReturn(MultiClusterContainer.ofSingle(UUID.randomUUID()));

BroadcastableSchemaInfo mockSchemaInfo = mock(BroadcastableSchemaInfo.class);

BulkWriterConfig config = new BulkWriterConfig(mockConf, 4, mockJobInfo, customBroadcastable, mockSchemaInfo, "4.0.0");

// Subclass that overrides reconstructJobInfoOnExecutor to return custom JobInfo

TestBulkWriterContext context = new TestBulkWriterContext(config)

{

@Override

protected JobInfo reconstructJobInfoOnExecutor(BroadcastableJobInfo jobInfo)

{

return expectedJobInfo;

}

};

assertThat(context.job()).isSameAs(expectedJobInfo);

}

@Test

void testReconstructJobInfoOnExecutorCanBeOverridden()

{

JobInfo expectedJobInfo = mock(JobInfo.class);

BulkSparkConf mockConf = mock(BulkSparkConf.class);

BroadcastableJobInfo mockJobInfo = mock(BroadcastableJobInfo.class);

BroadcastableSchemaInfo mockSchemaInfo = mock(BroadcastableSchemaInfo.class);

BulkWriterConfig config = new BulkWriterConfig(mockConf, 4, mockJobInfo, mock(IBroadcastableClusterInfo.class), mockSchemaInfo, "4.0.0");

// Subclass that overrides reconstructJobInfoOnExecutor to return custom JobInfo

TestBulkWriterContext context = new TestBulkWriterContext(config)

{

@Override

protected JobInfo reconstructJobInfoOnExecutor(BroadcastableJobInfo jobInfo)

{

return expectedJobInfo;

}

};

assertThat(context.job()).isSameAs(expectedJobInfo);

}

yifan-c · 2026-05-13T06:26:18Z

Committed using the legacy way as the PR is missing the update in changes.txt.

Commit is 652cc17

skoppu22 added 2 commits May 7, 2026 19:28

Enable extensibility for bulk writer broadcast/reconstruction

b037b60

remove import

3a6ca2a

jmckenzie-dev reviewed May 8, 2026

View reviewed changes

resolve comments

bbf1f42

skoppu22 commented May 8, 2026

View reviewed changes

Comment thread ...ion-tests/src/test/java/org/apache/cassandra/analytics/BulkReaderMultiDCConsistencyTest.java

add comment

190b988

jmckenzie-dev approved these changes May 8, 2026

View reviewed changes

yifan-c reviewed May 8, 2026

View reviewed changes

yifan-c approved these changes May 12, 2026

View reviewed changes

minor comments

1f8bc55

skoppu22 force-pushed the yif/config-context-extensible branch 3 times, most recently from 8132278 to 50090d7 Compare May 12, 2026 14:09

fix maven too many requests error

05811d5

skoppu22 force-pushed the yif/config-context-extensible branch from 50090d7 to 05811d5 Compare May 12, 2026 14:21

remove sleep

7352f66

yifan-c closed this May 13, 2026


		// Extract only broadcast-safe cluster metadata

		// ClusterInfo has transient fields (CassandraContext, token mappings) that are not serializable

Conversation

skoppu22 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yifan-c left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yifan-c commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

skoppu22 commented May 8, 2026 •

edited

Loading