Skip to content

feat: add ClustalW .aln format support for sequence alignments#880

Open
haoyu-haoyu wants to merge 2 commits intobiotite-dev:mainfrom
haoyu-haoyu:feat/clustalw-aln-format
Open

feat: add ClustalW .aln format support for sequence alignments#880
haoyu-haoyu wants to merge 2 commits intobiotite-dev:mainfrom
haoyu-haoyu:feat/clustalw-aln-format

Conversation

@haoyu-haoyu
Copy link
Copy Markdown

Summary

Add support for reading and writing ClustalW .aln alignment files, following the existing FASTA alignment I/O pattern.

New files

File Description
src/biotite/sequence/io/clustal/__init__.py Package exports
src/biotite/sequence/io/clustal/file.py ClustalFile class (TextFile + MutableMapping)
src/biotite/sequence/io/clustal/convert.py get_alignment() / set_alignment() conversion functions
tests/sequence/data/clustal.aln Single-block test data
tests/sequence/data/clustal_multi.aln Multi-block test data
tests/sequence/test_clustal.py 9 test cases

Implementation details

ClustalFile

  • Extends TextFile and MutableMapping (same pattern as FastaFile)
  • Parses CLUSTAL header, sequence blocks, and consensus lines
  • Dict-like interface: clustal_file["seq1"] returns gapped sequence string
  • Handles multi-block alignments by concatenating segments per sequence name

Conversion functions

  • get_alignment(clustal_file, seq_type=None) — auto-detects nucleotide/protein
  • set_alignment(clustal_file, alignment, seq_names, line_length=60) — formats into blocks

Edge cases handled

  • Empty file → InvalidFileError
  • Missing CLUSTAL header → InvalidFileError
  • Single sequence → clear ValueError (alignments need ≥ 2 sequences)
  • line_length=0ValueError
  • Reusing a ClustalFile object → entries cleared before writing
  • Consensus lines (with *, :, .) correctly skipped during parsing

Tests

9 tests covering: file reading, multi-block parsing, dict interface, alignment conversion, round-trip consistency (object-level and file I/O-level), name count mismatch error, and explicit sequence type parameter.

Closes #774

- Validate sequence count >= 2 in get_alignment() with clear error
- Validate line_length > 0 in set_alignment()
- Clear existing entries before writing in set_alignment() to prevent
  stale data when reusing a ClustalFile object
@padix-key
Copy link
Copy Markdown
Member

Thanks for adding the parser 👍. I'll have a look 👀

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 8, 2026

Merging this PR will not alter performance

✅ 98 untouched benchmarks


Comparing haoyu-haoyu:feat/clustalw-aln-format (83b69f9) with main (00d5b98)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support ClustalW .aln file formats for sequence alignments

2 participants