This example serves as a starting point for developers to create batch jobs using the SDK. It provides a basic structure and configuration setup to quickly get started with batch processing tasks.
Create or edit your .env file in the project root folder and add your Spark environment
URL and API authentication details:
CSPARK_BASE_URL="https://spark.my-env.coherent.global/my-tenant"
CSPARK_API_KEY="my-api-key"These environment variables are used by the SDK to authenticate and connect to your Spark environment.
Batch inputs and options are specified in config.py. Modify
this file to adjust your batch processing settings, including:
- Input CSV file location
- Chunk size
- Number of chunks
- Service URI
Run the complete end-to-end pipeline that includes batch processing and analysis:
poetry run python main.pyThis will execute the following steps:
- Prepare Scenarios - Process input CSV files from the
inputs/folder - Batch Processing - Send data to Spark service and generate outputs
- Scenario Ranking - Aggregate DB and MB Top-Ups, rank scenarios, identify top 20% winners
- CTE0 Calculation - Calculate average Top-Ups across all scenarios for each year
- CTE80 Calculation - Calculate average Top-Ups across winner scenarios only for each year
Generated Output Files:
outputs/*_input.csvand*_output.csv- Batch processing resultsfinal/scenarios_ranking.csv- Scenario rankings with winner designationfinal/final_cte0.csv- CTE0 (average across all scenarios)final/final_cte80.csv- CTE80 (average across winner scenarios)
If you already have batch processing outputs and just want to re-run the analysis:
poetry run python run_analysis.pyThis will skip the batch processing (Steps 1-2) and only run:
- Scenario ranking
- CTE0 calculation
- CTE80 calculation
The script reads input CSV files, splits them into chunks, and processes each chunk asynchronously. The batch processing status is displayed in the console.
- Aggregates all DB and MB Top-Ups for each scenario across all permutations
- Ranks scenarios from most negative (best) to most positive (worst)
- Identifies the top 20% as "winners"
- Outputs:
final/scenarios_ranking.csv
- Calculates the average of DB and MB Top-Ups for each year
- Uses data from ALL scenarios
- Outputs:
final/final_cte0.csv
- Calculates the average of DB and MB Top-Ups for each year
- Uses data from WINNER scenarios only (top 20%)
- Outputs:
final/final_cte80.csv
The pipeline can be customized by modifying:
config.py- Batch settings and directory pathsmain.py- Pipeline orchestrationaggregate.py- Analysis logic and calculations