Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple alignment feature #19

Merged
merged 24 commits into from
Jun 3, 2024
Merged

Multiple alignment feature #19

merged 24 commits into from
Jun 3, 2024

Conversation

nandsra21
Copy link
Collaborator

No description provided.

nandsra21 and others added 16 commits June 3, 2024 10:19
Adds sorted alignments for HA and NA and a new (failing) test for
identical PCA output regardless of the input order of sequences in
multiple alignments.
Fixes a bug where PCA would produce incorrect output when input
alignments were not sorted already by sorting alignments by sequence
name before building the PCA matrix for each alignment.
Fixes a bug where we tried to calculate distances for PCA when these are
not needed. This change fixes the preceding functional test that checks
for an error when alignments don't have the same sequence names.
Makes the alignment input to pathogen-embed optional, but adds logic to
require the alignment when running PCA or t-SNE with custom error
messages for both cases. Adds functional tests for this behavior.
Sorts each distance matrix input alphabetically by sequence name (rows
and columns) on load to ensure that the sum of all matrices correctly
represents the same pairs of sequences. Adds an assertion to confirm
that sequence names are identical for each distance matrix input. While
we're improving this function, we also create an index of the sequence
names for the final distance matrix data frame, so we don't have to set
it later from alignment names. Adds a functional test to confirm that
this behavior works as expected.
Adds/updates tests to ensure that distance matrices provided by the user
or created on the fly from alignments are sorted and have the same
samples. Refactors the logic for calculating distances from an alignment
into its own function, so we don't have to copy the same code for
pathogen-distance and pathogen-embed.
Adds a test and logic to error when distance matrix inputs are not
square.
Adds tests for t-SNE behavior with multiple inputs when inputs are
formatted correctly and incorrectly. Adds a check for matching record
names in alignment and distance inputs to t-SNE when both are provided,
to make the new failure mode test pass.
Explicitly sets the threading layer for Numba to a simple implementation
that should always be available. UMAP uses Numba internally and, on some
systems, the Numba threading layer can get set to an OpenMP (OMP)
backend. Since Numba internally calls OMP with a function that is
deprecated, OMP emits a deprecation warning. This warning is especially
useless since we force UMAP to run with a single CPU. The main effect of
this warning was to break our functional tests.

For more details, see this Numba issue comment on GitHub:
numba/numba#5275 (comment)
@huddlej huddlej force-pushed the multiple-alignment-feature branch from 051f8c1 to 9904998 Compare June 3, 2024 17:27
Fixes a bug where the genetic distances plotted in the pairwise distance
figure were not guaranteed to be integers. Adds figure outputs to a
functional test which surfaced this issue originally.
Adds example commands to make an embedding from multiple inputs. This
example comes after the quickstart which uses a single input.
Updates figures in the quickstart to match the new test data for HA and
adds a figure for the HA/NA embedding clusters so readers can compare
the results with the HA-only clusters.
@huddlej
Copy link
Contributor

huddlej commented Jun 3, 2024

@nandsra21 I finished adding tests, refining the code, and updating the documentation for this new feature. The README now includes an example tutorial for multiple input alignments. I'm going to test this out a bit with some other flu data, but I think this is nearly ready to merge and release.

@huddlej huddlej merged commit 900653f into main Jun 3, 2024
4 checks passed
@huddlej huddlej deleted the multiple-alignment-feature branch June 3, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

embed: Support multiple input files for alignments and distance matrices
2 participants