You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To produce embeddings for multiple gene segments like HA and NA for influenza H3N2, we currently concatenate the alignments for each gene to create a single alignment file and then calculate the distance matrix from that concatenated alignment. This concatenation step requires additional work from the user, though, that could be easily performed by the pathogen-embed command.
Description
Ideally, users could provide multiple input files for both alignments and distance matrices to the pathogen-embed command. In this way, users could precalculate a distance matrix per gene segment and let the embed command add the distances matrices internally. The interface might look like this:
# Create distance matrix for H3N2 HA alignment.
pathogen-distance \
--alignment h3n2_ha_alignment.fasta \
--output h3n2_ha_distances.csv
# Create distance matrix for H3N2 NA alignment.
pathogen-distance \
--alignment h3n2_na_alignment.fasta \
--output h3n2_na_distances.csv
# Run MDS on the HA and NA distances.
pathogen-embed \
--distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
--output-dataframe h3n2_ha_na_mds.csv \
mds
# Run t-SNE on HA and NA alignments (for PCA initialization) and distances.
pathogen-embed \
--alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
--distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
--output-dataframe h3n2_ha_na_t-sne.csv \
t-sne
# Run t-SNE on HA and NA alignments for PCA initialization and to calculate the distance matrix on the fly.
pathogen-embed \
--alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
--output-dataframe h3n2_ha_na_t-sne.csv \
t-sne
# Run t-SNE with an HA alignment for PCA initialization and HA/NA distance matrices for the embedding.
pathogen-embed \
--alignment h3n2_ha_alignment.fasta \
--distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
--output-dataframe h3n2_ha_na_t-sne.csv \
t-sne
This approach allows each distance matrix to be produced in parallel, for example in a Snakemake workflow, which will speed up a computationally expensive part of the analysis.
Possible solution
To support this new functionality, the pathogen-embed command needs to:
accept one or more arguments to --alignment and --distance-matrix
load all given alignment files and, if more than one file is given, concatenate the alignments before running embeddings
load all given distance matrix files and, if more than one file is given, sum the distances from all matrices into a single distance matrix before running embeddings
In the case where the user only provides alignments and the embedding requires a distance matrix, the command's current logic remains unchanged and operates on the concatenated alignment it produces from step 2 above.
It should be possible for the user to provide a single alignment file to use for PCA initialization of t-SNE, for example, and also provide multiple distance matrices to use for the embedding.
The text was updated successfully, but these errors were encountered:
pathogen-distance \
--alignment h3n2_ha_alignment.fasta \
--output h3n2_ha_distances.csv
# Create distance matrix for H3N2 NA alignment.
pathogen-distance \
--alignment h3n2_na_alignment.fasta \
--output h3n2_na_distances.csv
# Run MDS on the HA and NA distances.
# Change: must add an alignment
pathogen-embed \
--alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
--distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
--output-dataframe h3n2_ha_na_mds.csv \
mds
# Run t-SNE on HA and NA alignments (for PCA initialization) and distances.
pathogen-embed \
--alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
--distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
--output-dataframe h3n2_ha_na_t-sne.csv \
t-sne
# Run t-SNE on HA and NA alignments for PCA initialization and to calculate the distance matrix on the fly.
pathogen-embed \
--alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
--output-dataframe h3n2_ha_na_t-sne.csv \
t-sne
# Run t-SNE with an HA alignment for PCA initialization and HA/NA distance matrices for the embedding.
# Change: same number of alignments as distance matrices
pathogen-embed \
--alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
--distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
--output-dataframe h3n2_ha_na_t-sne.csv \
t-sne```
Context
To produce embeddings for multiple gene segments like HA and NA for influenza H3N2, we currently concatenate the alignments for each gene to create a single alignment file and then calculate the distance matrix from that concatenated alignment. This concatenation step requires additional work from the user, though, that could be easily performed by the
pathogen-embed
command.Description
Ideally, users could provide multiple input files for both alignments and distance matrices to the
pathogen-embed
command. In this way, users could precalculate a distance matrix per gene segment and let the embed command add the distances matrices internally. The interface might look like this:This approach allows each distance matrix to be produced in parallel, for example in a Snakemake workflow, which will speed up a computationally expensive part of the analysis.
Possible solution
To support this new functionality, the
pathogen-embed
command needs to:--alignment
and--distance-matrix
In the case where the user only provides alignments and the embedding requires a distance matrix, the command's current logic remains unchanged and operates on the concatenated alignment it produces from step 2 above.
It should be possible for the user to provide a single alignment file to use for PCA initialization of t-SNE, for example, and also provide multiple distance matrices to use for the embedding.
The text was updated successfully, but these errors were encountered: