Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process HA and NA alignments separately #122

Closed
wants to merge 1 commit into from
Closed

Conversation

huddlej
Copy link
Collaborator

@huddlej huddlej commented Aug 7, 2024

Description

Replaces logic for running distance calculations and embeddings on either an HA alignment or a concatenated HA/NA alignment with logic to get distances and embeddings from HA and NA alignments separately. This works because pathogen-embed supports multiple input values to its alignment and distance matrix arguments. The benefit of this change is that the pathogen-distance command can calculate distances that ignore leading and trailing gaps in each gene's alignment that would otherwise be counted in the concatenated alignment. Since we did not calculate indel distances for the HA/NA analysis, this change to the workflow should only affect the PCA embeddings. Since the new simplex encoding of PCA inputs effectively ignores gaps, even the PCA embeddings should be minimally affected by this change. However, the most important aspect of this change is the demonstration of how we recommend these tools to be used for this kind of reassortment analysis.

Related issues

Closes #121

Replaces logic for running distance calculations and embeddings on
either an HA alignment or a concatenated HA/NA alignment with logic to
get distances and embeddings from HA and NA alignments separately. This
works because pathogen-embed supports multiple input values to its
alignment and distance matrix arguments. The benefit of this change is
that the pathogen-distance command can calculate distances that ignore
leading and trailing gaps in each gene's alignment that would otherwise
be counted in the concatenated alignment. Since we did not calculate
indel distances for the HA/NA analysis, this change to the workflow
should only affect the PCA embeddings. Since the new simplex encoding of
PCA inputs effectively ignores gaps, even the PCA embeddings should be
minimally affected by this change. However, the most important aspect of
this change is the demonstration of how we recommend these tools to be
used for this kind of reassortment analysis.

Closes #121
@huddlej
Copy link
Collaborator Author

huddlej commented Aug 7, 2024

Actually, this change is not something we support yet now that we include genetic distance clusters in the analysis. The pathogen-cluster command does not yet accept the multiple distance matrix inputs that we would need to support for HA/NA genetic distance clusters. I'm closing this to focus on more critical revisions, but we could revisit this concept later. I created an issue in pathogen-embed about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run HA/NA analysis with separate distance matrices for HA and NA
1 participant