Quantum Computation: Implementation of Quantum Gates on Linux Clusters

Dissertation source and accompanying quantum-circuit simulators.

Bachelor's thesis, Istanbul Bilgi University. Originally submitted 2004 under the supervision of Bülent Özel; revised 2026 to modernise the build, rewrite the algorithm chapters (QFT, Grover, Shor), and add three new reference implementations of the library API described in the thesis.

qucomp.pdf — the revised dissertation (2026).
qucomp-original.pdf — the original submission (2004), preserved verbatim.
source code/ — LaTeX sources for the dissertation.
implementation/ — four sibling implementations of the simulator API, each parallelised through a different model. See the comparison table below.

The thesis is a textbook-style tour: §1–§7 cover the physics, the mathematical preliminaries, models of computation, and the quantum algorithms (Deutsch–Jozsa, QFT, Grover, Shor). §8 motivates a distribution strategy for state-vector simulators. §9–§11 develop the three landmark algorithms in detail. §12 specifies a small library API (qreg, gates, measurement, QFT, Grover, Shor). The implementation/ directory turns §8 and §12 into runnable code.

Implementations

Directory	Language	Parallelism	Status	Tests	Max qubits
`implementation/original/`	C99 + LAM/OpenMPI	MPI ranks, dense $2^n\times 2^n$ operator	2004 artifact, preserved	`make check` (NP=4)	3 (hard-coded demo)
`implementation/c/`	C11 + OpenMPI 5.x	MPI ranks, in-place sparse gates	feature-complete	75 cases × NP=1,2,4 (+NP=8 for `test-large`)	60
`implementation/go/`	Go 1.21	goroutines + `sync.WaitGroup`	feature-complete	70 cases, race-clean	26
`implementation/python/`	Python 3.12+ + PyTorch	tensor ops on CUDA / ROCm / MPS / CPU	feature-complete	414 default + 1 gated Shor-21	memory-bound; Shor-15/21 tested

All three new implementations share the same public surface — a qreg register type, the named gates (H, X, Y, Z, S, T, phase, Rx/Ry/Rz), controlled and multi-controlled variants, the QFT and its inverse, Grover with a user-supplied phase oracle, and end-to-end Shor factoring with continued-fraction post-processing. The differences are in how the work is parallelised, not what the library does.

`implementation/original/` — 2004 dense-matrix MPI

The original code that accompanied the 2004 dissertation, modernised only at the build wrapper (LAM-MPI is end-of-life; the source now compiles under OpenMPI 5.x). The bundled main() is a single hard-coded demo: apply $H\otimes H\otimes H$ to $|010\rangle$ and print the resulting amplitudes from each worker rank.

This implementation is a proof of concept for the chosen distribution strategy, not a working library. It builds the operator as a dense $2^n\times 2^n$ matrix on rank 0, broadcasts the input state to every rank, and partitions operator rows across workers. There is no result gather, no chaining of gates, no measurement, and no algorithm. The master rank never computes; the first $8/N$ output rows are silently dropped. Non-power-of-two rank counts deadlock.

It is preserved exactly because §8 of the revised thesis critiques this strategy on its own merits — $2^{2n+4}$ bytes of operator memory makes it unreachable at any cluster size beyond a handful of qubits — and the sparse-gate alternative in /c, /go, and /python is the response.

`implementation/c/` — sparse-gate MPI

The reference implementation of §8's chosen strategy. State is a flat amplitude array distributed across MPI ranks by the top $\log_2(\text{NP})$ qubits; gates that touch a global qubit perform a pairwise MPI_Sendrecv against the partner rank; modular exponentiation uses MPI_Alltoallv to bucket the permutation.

Same binary scales from mpirun -n 1 to a multi-node cluster without source changes. Tested at NP = 1, 2, 4 on every push and NP = 1, 2, 4, 8 in the test-large job. Shor-15 factors reliably within 8 attempts; the gated Shor-21 test exercises the 16-qubit period-finding circuit at NP up to 8.

`implementation/go/` — goroutine parallelism

The same library API expressed in idiomatic Go. State is shared memory; the parallel dispatcher (parallelOverPairs) spawns per-call goroutines and joins them with a sync.WaitGroup. No persistent worker pool and no Destroy() — the GC reclaims the register when it goes out of scope.

Programmer-error panics use Go panic with qubit: ... prefix; construction errors return error. The package is race-clean under go test -race. Cap at 26 qubits (1 GiB amplitude slice, 2 GiB modular-exp peak); the difference vs /c's 60 is an allocation bound, not a shift-overflow bound.

`implementation/python/` — PyTorch tensor ops

The same library API, written as pure-Python on top of PyTorch. State is a flat (2**n,) complex tensor; gates reshape into the (2,)*n view and tensordot / einsum the unitary against the target axis — no per-amplitude loops. The "parallel dispatch" is PyTorch's kernel scheduler on whatever device is available: NVIDIA CUDA, AMD ROCm, Apple Metal (MPS), or CPU fallback.

Both function-style (apply_h(q, 0)) and method-style (q.apply_h(0)) call shapes are first-class. The CLI is exposed as qubit-demo via [project.scripts]. Tested under ruff check, mypy --strict, and pytest on CPU + MPS locally; CPU-only on Linux CI.

Quickstart

Each implementation has its own quickstart in its README. The shortest paths:

# Sparse-gate MPI (requires OpenMPI 5.x)
cd implementation/c
make && make test
mpirun -n 4 build/bin/qubit --algo qft

# Go (requires Go 1.21)
cd implementation/go
make test
make demo ALGO=grover

# Python (requires uv)
cd implementation/python
make sync && make check
uv run qubit-demo --algo shor

# 2004 artifact (requires OpenMPI; the bundled binary is 2004 i386 ELF)
cd implementation/original
make && make run NP=4

Cross-implementation parity

Each implementation has an assessment.md that maps the thesis claims (§8 strategy, §9 QFT, §10 Grover, §11 Shor, §12 API) to its own source files. The same algorithm-level guarantees hold across all three new implementations:

Claim	`/c`	`/go`	`/python`
In-place single-qubit gate, $O(2^n)$	`matrix.c::apply_u_local`	`qubit/gates_single.go::ApplyU`	`qubit/gates_single.py::apply_u`
Controlled gates via 4×4 block-diagonal	`matrix.c::apply_cu`	`qubit/gates_controlled.go::ApplyCU`	`qubit/gates_controlled.py::apply_cu`
Modular exp as in-place permutation	`shor.c::apply_modular_exp` (+ `MPI_Alltoallv`)	`qubit/shor.go::ApplyModularExp`	`qubit/shor.py::apply_modular_exp` (`torch.gather`)
QFT with bit-reversal swaps	`qft.c::apply_qft`	`qubit/qft.go::ApplyQFT`	`qubit/qft.py::apply_qft`
QFT inverse (true inverse, not 3× forward)	`qft.c::apply_qft_inverse`	`qubit/qft.go::ApplyQFTInverse`	`qubit/qft.py::apply_qft_inverse`
Continued-fraction recovery	`standart.c::continued_fraction`	`qubit/standart.go::ContinuedFraction`	`qubit/standart.py::continued_fraction`
Factor $N=15$ reliably	`tests/test_shor.c::test_shor_factor_15`	`qubit/shor_test.go::TestShorFactor15`	`tests/test_shor.py::test_shor_factor_15_seeded`
Shor-21 period of 2 mod 21 (gated)	`tests/test_shor.c` (gated by `RUN_SHOR_21=1`)	`qubit/shor_test.go::TestShorPeriodA2Mod21`	`tests/test_shor.py::test_shor_period_a2_mod21_gated`

The Python standart.py and Go standart.go deliberately mirror /c's misspelling of "standard" — visual cue that the three files are siblings.

Why three new implementations?

Each parallelism model exposes a different trade-off:

MPI (/c) is the natural fit when you actually have a cluster and want a single source path that scales from a laptop to a multi-node job. Communication is explicit; the partition is visible in the code. Cost: every gate that touches a global qubit incurs a round trip.
Goroutines (/go) is the natural fit on a single shared-memory machine — no marshalling, no rank coordination, no Destroy(). Cost: no path to multiple machines without re-introducing MPI's problem space.
PyTorch (/python) is the natural fit when you want GPU offload without writing CUDA. The simulator never sees the device; PyTorch dispatches. Cost: dtype constraints on MPS, and the per-tensor allocation pattern is wasteful at very large $n$.

The library API is the same in all three. Reading any one of the assessment.md files alongside qucomp.pdf should make the mapping obvious.

Out of scope

Across all three new implementations:

Density matrices and mixed states.
Noise models.
Tensor-network or stabilizer-formalism shortcuts.
Distributed execution across machines for /go and /python (would re-introduce MPI's problem space; /c covers it).
Prime-power detection in shor_factor (classical pre-check that belongs in the caller).

Layout

implementation/
  original/    # 2004 dense-matrix MPI artifact
  c/           # sparse-gate MPI, modern C11
  go/          # goroutines, Go 1.21
  python/      # PyTorch tensor ops, Python 3.12+
source code/   # LaTeX sources for qucomp.pdf
docs/
  superpowers/
    specs/     # per-implementation design documents
    plans/     # phase-by-phase implementation plans
.github/
  workflows/
    ci.yml     # one workflow covers all four implementations
qucomp.pdf            # revised dissertation (2026)
qucomp-original.pdf   # 2004 submission, preserved

Licence and citation

The dissertation text and the 2004 implementation are reproduced with permission of the author. The 2026 implementations (implementation/c, implementation/go, implementation/python) are released alongside the revised thesis. Cite as:

Karaduman, A. Quantum Computation: Implementation of Quantum Gates on Linux Clusters. Bachelor's dissertation, Istanbul Bilgi University. Originally submitted 2004; revised 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github/workflows		.github/workflows
docs/superpowers		docs/superpowers
implementation		implementation
source code		source code
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
qubit.tar.bz2		qubit.tar.bz2
qucomp-original.pdf		qucomp-original.pdf
qucomp.pdf		qucomp.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quantum Computation: Implementation of Quantum Gates on Linux Clusters

Implementations

`implementation/original/` — 2004 dense-matrix MPI

`implementation/c/` — sparse-gate MPI

`implementation/go/` — goroutine parallelism

`implementation/python/` — PyTorch tensor ops

Quickstart

Cross-implementation parity

Why three new implementations?

Out of scope

Layout

Licence and citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quantum Computation: Implementation of Quantum Gates on Linux Clusters

Implementations

implementation/original/ — 2004 dense-matrix MPI

implementation/c/ — sparse-gate MPI

implementation/go/ — goroutine parallelism

implementation/python/ — PyTorch tensor ops

Quickstart

Cross-implementation parity

Why three new implementations?

Out of scope

Layout

Licence and citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`implementation/original/` — 2004 dense-matrix MPI

`implementation/c/` — sparse-gate MPI

`implementation/go/` — goroutine parallelism

`implementation/python/` — PyTorch tensor ops

Packages