The dataset files used in the analysis are too large to include directly in the repository.
Please run the following commands inside the VM, under the /home/bfshen/parallel_programming/project-3-brianshen03 directory.
pip install --user gdown
python3 -m gdown 1YQN_Yw-i-0XqkDpv-u2_cbDV0W1jroGH -O proj3_data.zip
unzip proj3_data.zipThis will populate /data/similar_sizes/ and /data/varying_sizes/
To run the full benchmark suite on the Peanut cluster, navigate to the benchmark directory and run:
sbatch proj3_benchmark.shThis will:
- Execute all three modes (
s,parfiles,ws) - Run experiments on both:
data/similar_sizesdata/varying_sizes
- Generate:
benchmark/timing_data.csv
After benchmarks complete, (inside benchmark dir) generate speedup plots using:
python3 plot_speedup.py timing_data.csvThis produces:
speedup_similar_sizes.pngspeedup_varying_sizes.png
To execute a demonstration run on the Peanut cluster and view full program logs, navigate to benchmark and:
sbatch run_demo.shThis will run:
- 3 modes (
s,parfiles,ws) - On both datasets (
similar_sizes,varying_sizes) - For a total of 6 runs
Output logs are written to:
benchmark/slurm/out/
From the main directory, you can run the program directly.
go run main.go ../data/similar_sizes s
go run main.go ../data/similar_sizes parfiles 4
go run main.go ../data/similar_sizes ws 4go run main.go ../data/varying_sizes s
go run main.go ../data/varying_sizes parfiles 4
go run main.go ../data/varying_sizes ws 4This project implements a parallel backtesting system for evaluating a grid of quantitative trading strategies on historical limit order book data. The system reads CSV files containing market data (bid/ask prices and sizes), simulates the execution of multiple strategies over each file, and computes performance statistics such as profit and loss (PnL), number of trades, and final inventory.
The core objective of the program is to efficiently process large collections of historical market data files and evaluate many parameterized strategies across them. Because the computation for each file and strategy is independent, the workload presents easy opportunities for parallelization.
Given:
- A directory containing multiple CSV files of historical market data.
- A set of trading strategies defined by parameter grids.
The program must:
- Parse each CSV file into a sequence of market events.
- Simulate each trading strategy over each file.
- Maintain state per strategy (inventory, cash, PnL, trades).
- Aggregate results across all files.
- Rank strategies by average PnL and report the top and bottom performers.
Each file represents a a portion of a trading day, and strategies are evaluated independently on each file. The workload size depends on:
- The number of files.
- The size (number of rows) of each file.
- The number of strategies in the parameter grid.
When file sizes vary significantly (as in varying_sizes), some tasks take much longer than others, introducing load imbalance that makes the system a good candidate for studying parallel scheduling and work stealing.
Input:
- A directory containing CSV files of historical bid/ask market data.
- A selected execution mode:
s(sequential)parfiles(parallel across files)ws(work stealing)
- Optional thread count for parallel modes.
Output:
- Per-file strategy results (PnL, trades, inventory).
- Overall aggregated strategy rankings.
- Total execution time of the program (used for performance analysis).
The first parallel implementation distributes work statically across files. Each worker thread is assigned a subset of CSV files, and processes all strategies for those files independently. After all threads complete their assigned files, results are aggregated.
This approach follows a Bulk Synchronous Parallel (BSP) pattern:
- Superstep 1: Threads process assigned files independently.
- Barrier: All threads synchronize.
- Superstep 2: Aggregation of results.
This model is appropriate because:
- Files are independent units of work.
- There are no cross-thread dependencies during computation.
- Synchronization is only required after local results are complete.
However, static assignment can lead to load imbalance if file sizes differ significantly. Threads assigned larger files may become stragglers, limiting overall speedup.
The second parallel implementation uses a lock-free task queue with work stealing. Instead of statically assigning files to threads:
- Each file is treated as a task.
- Worker threads maintain local deques.
- When a thread finishes its local tasks, it attempts to steal work from other threads.
This dynamic scheduling improves load balancing, especially when file sizes vary. Threads that finish early do not remain idle; they help complete remaining work.
This approach is particularly appropriate for the varying_sizes dataset, where task durations are unpredictable. Work stealing improves throughput by reducing idle time and minimizing straggler effects.
- The workload is embarrassingly parallel at the file level.
- Tasks are independent and require no communication during execution.
- BSP provides a simple and low-overhead baseline.
- Work stealing addresses load imbalance introduced by uneven task sizes.
One of the main challenges was designing a parallel structure that maintained correctness while avoiding excessive synchronization. Aggregating shared results required careful coordination to prevent race conditions, especially in the work-stealing implementation. Another difficulty was handling load imbalance when file sizes differed significantly, which exposed the limitations of static task assignment. Through this assignment, I aimed to better understand how scheduling strategies affect scalability, and how synchronization and task granularity influence real world speedup.
The work-stealing task queue improved performance on the varying_sizes dataset (where file size ranged from 500k - 5M lines), where file processing times differed substantially. In the static parallel implementation, threads assigned larger files became bottlenecks, limiting speedup. Work stealing reduced idle time by allowing threads to dynamically redistribute remaining tasks, improving load balance and overall throughput. On the similar_sizes dataset, the performance difference was similar because the workload was already well balanced.
In the sequential program, the primary hotspot is the nested loop that processes every event in every file for every strategy. This dominates total runtime, as each CSV file may contain millions of rows and each strategy evaluates every event. This computation is embarrassingly parallel at the file level and was successfully parallelized in both parfiles and ws.
The main bottlenecks are:
- CSV file reading and parsing (I/O bound and inherently sequential per file).
- Final aggregation and sorting of strategy summaries.
While file-level computation was parallelized, CSV parsing within a single file remains sequential. Therefore, even in the parallel versions, some sequential work limits overall scalability.
Sequential time: 26.10s
At 12 threads:
parfiles: 2.89s → speedup ≈ 9.0×ws: 2.89s → speedup ≈ 9.0×
Speedup scales well up to 6 threads, but flattens between 6–8 threads before improving again at 12. This indicates:
- Diminishing returns from limited parallel work per file.
- Overhead from synchronization and scheduling.
- Remaining sequential components (I/O + aggregation).
Because file sizes are similar, static load distribution is already balanced. Therefore, work stealing provides minimal additional benefit.
Sequential time: 41.35s
At 12 threads:
parfiles: 12.64s → speedup ≈ 3.3×ws: 12.85s → speedup ≈ 3.2×
At 2 threads:
parfiles: 31.16s → speedup ≈ 1.33×ws: 20.62s → speedup ≈ 2.0×
Here, speedup is significantly lower overall due to load imbalance and sequential I/O within large files. The static parfiles implementation suffers when large files are assigned unevenly. Work stealing improves performance at low thread counts by redistributing long-running tasks, but overall gains are limited by:
- Sequential parsing inside large files.
- Synchronization and task queue overhead.
- Amdahl’s Law constraints.
On similar_sizes, both implementations achieve nearly identical speedups because the workload is naturally balanced. Static partitioning is sufficient, and work stealing adds little benefit.
On varying_sizes, work stealing improves load balance, particularly at lower thread counts (e.g., 2 threads: 2.0× vs 1.33× speedup). However, as thread count increases, the difference narrows because:
- The largest files dominate total runtime.
- Parsing remains sequential within each file.
- Overhead from task management reduces marginal gains.
Overall:
parfilesperforms well when tasks are uniform.wsis more robust under irregular workloads.- Neither implementation achieves linear speedup due to unavoidable sequential components and synchronization overhead.

