Dataset Setup (Run Inside VM)

The dataset files used in the analysis are too large to include directly in the repository.

Please run the following commands inside the VM, under the /home/bfshen/parallel_programming/project-3-brianshen03 directory.

pip install --user gdown

python3 -m gdown 1YQN_Yw-i-0XqkDpv-u2_cbDV0W1jroGH -O proj3_data.zip
unzip proj3_data.zip

This will populate /data/similar_sizes/ and /data/varying_sizes/

Usage

Running Benchmarks (Peanut Cluster)

To run the full benchmark suite on the Peanut cluster, navigate to the benchmark directory and run:

sbatch proj3_benchmark.sh

This will:

Execute all three modes (s, parfiles, ws)
Run experiments on both:
- data/similar_sizes
- data/varying_sizes
Generate:

benchmark/timing_data.csv

Generating Speedup Graphs

After benchmarks complete, (inside benchmark dir) generate speedup plots using:

python3 plot_speedup.py timing_data.csv

This produces:

speedup_similar_sizes.png
speedup_varying_sizes.png

Running the Demo (View Full Program Output)

To execute a demonstration run on the Peanut cluster and view full program logs, navigate to benchmark and:

sbatch run_demo.sh

This will run:

3 modes (s, parfiles, ws)
On both datasets (similar_sizes, varying_sizes)
For a total of 6 runs

Output logs are written to:

benchmark/slurm/out/

Running Locally (Without SLURM)

From the main directory, you can run the program directly.

On `similar_sizes`

go run main.go ../data/similar_sizes s
go run main.go ../data/similar_sizes parfiles 4
go run main.go ../data/similar_sizes ws 4

On `varying_sizes`

go run main.go ../data/varying_sizes s
go run main.go ../data/varying_sizes parfiles 4
go run main.go ../data/varying_sizes ws 4

Analysis

Overview

This project implements a parallel backtesting system for evaluating a grid of quantitative trading strategies on historical limit order book data. The system reads CSV files containing market data (bid/ask prices and sizes), simulates the execution of multiple strategies over each file, and computes performance statistics such as profit and loss (PnL), number of trades, and final inventory.

The core objective of the program is to efficiently process large collections of historical market data files and evaluate many parameterized strategies across them. Because the computation for each file and strategy is independent, the workload presents easy opportunities for parallelization.

Problem Being Solved

Given:

A directory containing multiple CSV files of historical market data.
A set of trading strategies defined by parameter grids.

The program must:

Parse each CSV file into a sequence of market events.
Simulate each trading strategy over each file.
Maintain state per strategy (inventory, cash, PnL, trades).
Aggregate results across all files.
Rank strategies by average PnL and report the top and bottom performers.

Each file represents a a portion of a trading day, and strategies are evaluated independently on each file. The workload size depends on:

The number of files.
The size (number of rows) of each file.
The number of strategies in the parameter grid.

When file sizes vary significantly (as in varying_sizes), some tasks take much longer than others, introducing load imbalance that makes the system a good candidate for studying parallel scheduling and work stealing.

Inputs and Outputs

Input:

A directory containing CSV files of historical bid/ask market data.
A selected execution mode:
- s (sequential)
- parfiles (parallel across files)
- ws (work stealing)
Optional thread count for parallel modes.

Output:

Per-file strategy results (PnL, trades, inventory).
Overall aggregated strategy rankings.
Total execution time of the program (used for performance analysis).

Parallel Implementations and Design Rationale

Parallel Across Files (`parfiles`)

The first parallel implementation distributes work statically across files. Each worker thread is assigned a subset of CSV files, and processes all strategies for those files independently. After all threads complete their assigned files, results are aggregated.

This approach follows a Bulk Synchronous Parallel (BSP) pattern:

Superstep 1: Threads process assigned files independently.
Barrier: All threads synchronize.
Superstep 2: Aggregation of results.

This model is appropriate because:

Files are independent units of work.
There are no cross-thread dependencies during computation.
Synchronization is only required after local results are complete.

However, static assignment can lead to load imbalance if file sizes differ significantly. Threads assigned larger files may become stragglers, limiting overall speedup.

Work Stealing (`ws`)

The second parallel implementation uses a lock-free task queue with work stealing. Instead of statically assigning files to threads:

Each file is treated as a task.
Worker threads maintain local deques.
When a thread finishes its local tasks, it attempts to steal work from other threads.

This dynamic scheduling improves load balancing, especially when file sizes vary. Threads that finish early do not remain idle; they help complete remaining work.

This approach is particularly appropriate for the varying_sizes dataset, where task durations are unpredictable. Work stealing improves throughput by reducing idle time and minimizing straggler effects.

Why These Approaches Are Appropriate

The workload is embarrassingly parallel at the file level.
Tasks are independent and require no communication during execution.
BSP provides a simple and low-overhead baseline.
Work stealing addresses load imbalance introduced by uneven task sizes.

Challenges and Lessons Learned

One of the main challenges was designing a parallel structure that maintained correctness while avoiding excessive synchronization. Aggregating shared results required careful coordination to prevent race conditions, especially in the work-stealing implementation. Another difficulty was handling load imbalance when file sizes differed significantly, which exposed the limitations of static task assignment. Through this assignment, I aimed to better understand how scheduling strategies affect scalability, and how synchronization and task granularity influence real world speedup.

Effect of Work Stealing on Performance

The work-stealing task queue improved performance on the varying_sizes dataset (where file size ranged from 500k - 5M lines), where file processing times differed substantially. In the static parallel implementation, threads assigned larger files became bottlenecks, limiting speedup. Work stealing reduced idle time by allowing threads to dynamically redistribute remaining tasks, improving load balance and overall throughput. On the similar_sizes dataset, the performance difference was similar because the workload was already well balanced.

Hotspots, Bottlenecks, and Parallelization

In the sequential program, the primary hotspot is the nested loop that processes every event in every file for every strategy. This dominates total runtime, as each CSV file may contain millions of rows and each strategy evaluates every event. This computation is embarrassingly parallel at the file level and was successfully parallelized in both parfiles and ws.

The main bottlenecks are:

CSV file reading and parsing (I/O bound and inherently sequential per file).
Final aggregation and sorting of strategy summaries.

While file-level computation was parallelized, CSV parsing within a single file remains sequential. Therefore, even in the parallel versions, some sequential work limits overall scalability.

Plots

similar_sizes

varying_sizes

What Limited Speedup?

similar_sizes

Sequential time: 26.10s

At 12 threads:

parfiles: 2.89s → speedup ≈ 9.0×
ws: 2.89s → speedup ≈ 9.0×

Speedup scales well up to 6 threads, but flattens between 6–8 threads before improving again at 12. This indicates:

Diminishing returns from limited parallel work per file.
Overhead from synchronization and scheduling.
Remaining sequential components (I/O + aggregation).

Because file sizes are similar, static load distribution is already balanced. Therefore, work stealing provides minimal additional benefit.

varying_sizes

Sequential time: 41.35s

At 12 threads:

parfiles: 12.64s → speedup ≈ 3.3×
ws: 12.85s → speedup ≈ 3.2×

At 2 threads:

parfiles: 31.16s → speedup ≈ 1.33×
ws: 20.62s → speedup ≈ 2.0×

Here, speedup is significantly lower overall due to load imbalance and sequential I/O within large files. The static parfiles implementation suffers when large files are assigned unevenly. Work stealing improves performance at low thread counts by redistributing long-running tasks, but overall gains are limited by:

Sequential parsing inside large files.
Synchronization and task queue overhead.
Amdahl’s Law constraints.

Comparison of Parallel Implementations

On similar_sizes, both implementations achieve nearly identical speedups because the workload is naturally balanced. Static partitioning is sufficient, and work stealing adds little benefit.

On varying_sizes, work stealing improves load balance, particularly at lower thread counts (e.g., 2 threads: 2.0× vs 1.33× speedup). However, as thread count increases, the difference narrows because:

The largest files dominate total runtime.
Parsing remains sequential within each file.
Overhead from task management reduces marginal gains.

Overall:

parfiles performs well when tasks are uniform.
ws is more robust under irregular workloads.
Neither implementation achieves linear speedup due to unavoidable sequential components and synchronization overhead.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
proj3		proj3
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Setup (Run Inside VM)

Usage

Running Benchmarks (Peanut Cluster)

Generating Speedup Graphs

Running the Demo (View Full Program Output)

Running Locally (Without SLURM)

On `similar_sizes`

On `varying_sizes`

Analysis

Overview

Problem Being Solved

Inputs and Outputs

Parallel Implementations and Design Rationale

Parallel Across Files (`parfiles`)

Work Stealing (`ws`)

Why These Approaches Are Appropriate

Challenges and Lessons Learned

Effect of Work Stealing on Performance

Hotspots, Bottlenecks, and Parallelization

Plots

similar_sizes

varying_sizes

What Limited Speedup?

similar_sizes

varying_sizes

Comparison of Parallel Implementations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dataset Setup (Run Inside VM)

Usage

Running Benchmarks (Peanut Cluster)

Generating Speedup Graphs

Running the Demo (View Full Program Output)

Running Locally (Without SLURM)

On similar_sizes

On varying_sizes

Analysis

Overview

Problem Being Solved

Inputs and Outputs

Parallel Implementations and Design Rationale

Parallel Across Files (parfiles)

Work Stealing (ws)

Why These Approaches Are Appropriate

Challenges and Lessons Learned

Effect of Work Stealing on Performance

Hotspots, Bottlenecks, and Parallelization

Plots

similar_sizes

varying_sizes

What Limited Speedup?

similar_sizes

varying_sizes

Comparison of Parallel Implementations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

On `similar_sizes`

On `varying_sizes`

Parallel Across Files (`parfiles`)

Work Stealing (`ws`)

Packages