Skip to content

CASSANALYTICS-36: Bulk Reader should dynamically size the Spark job b…#118

Merged
frankgh merged 7 commits intoapache:trunkfrom
frankgh:CASSANALYTICS-36
Jun 16, 2025
Merged

CASSANALYTICS-36: Bulk Reader should dynamically size the Spark job b…#118
frankgh merged 7 commits intoapache:trunkfrom
frankgh:CASSANALYTICS-36

Conversation

@frankgh
Copy link
Copy Markdown
Contributor

@frankgh frankgh commented Jun 13, 2025

…ased on estimated table size

Patch by Francisco Guerrero; reviewed by TBD for CASSANALYTICS-36

@Test
void testDynamicSizingOption()
{
Dataset<Row> data = bulkReaderDataFrame(tableForNullStaticColumn).option("SIZING", "dynamic").load();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using tableForNullStaticColumn here seems implies that this test somehow has something to do with the null static column table, which isn't really the case other than the row.getString(0) below - can we switch this to just use one of the other tables?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. Let me make the change

@frankgh frankgh force-pushed the CASSANALYTICS-36 branch from 2100474 to e7ece5d Compare June 16, 2025 17:48
this.table = table;
this.dc = datacenter;
this.maxPartitionSize = maxPartitionSize;
this.numCores = availableCores;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT on my own patch: We should rename the field availableCores as well.

/**
* @deprecated replaced by {@link #ABORTED}
*/
@Deprecated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated change. Please revert

dependencies {
shadow(group: 'org.slf4j', name: 'slf4j-api', version: "${project.slf4jApiVersion}")
api(project(':analytics-sidecar-vertx-client'))
implementation(project(':analytics-sidecar-vertx-client'))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated; please revert. If it is a problem, we should fix it in another patch.

Comment on lines +85 to +86
* assume a maximum partition size of 2.5 GB. Also, assume that a consistency
* level of 2. The number of cores is calculated by the following formula:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is "consistency level of 2", but should be 3. minReplicas is the quorum of 3.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: I meant to say, "the consistency level is quorum/local_quorum in this example, and RF is 3". It leads to the minReplicas of 2.

Comment on lines +131 to +133
// normalize table size to the number of instances
LOGGER.debug("{}/{} instances were used to determine the table size {} for table {}.{}",
successCount, instancesSize, tableSizeSum, keyspace, table);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about logging at info level? It is logged only once per job and the information is seemingly useful for debugging.

@frankgh frankgh force-pushed the CASSANALYTICS-36 branch from 7088f9e to 2a98fa0 Compare June 16, 2025 20:27
@frankgh frankgh merged commit 1664b2b into apache:trunk Jun 16, 2025
@frankgh frankgh deleted the CASSANALYTICS-36 branch June 16, 2025 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants