embed: Support multiple input files for alignments and distance matrices #10

huddlej · 2024-02-08T20:03:05Z

Context

To produce embeddings for multiple gene segments like HA and NA for influenza H3N2, we currently concatenate the alignments for each gene to create a single alignment file and then calculate the distance matrix from that concatenated alignment. This concatenation step requires additional work from the user, though, that could be easily performed by the pathogen-embed command.

Description

Ideally, users could provide multiple input files for both alignments and distance matrices to the pathogen-embed command. In this way, users could precalculate a distance matrix per gene segment and let the embed command add the distances matrices internally. The interface might look like this:

# Create distance matrix for H3N2 HA alignment.
pathogen-distance \
  --alignment h3n2_ha_alignment.fasta \
  --output h3n2_ha_distances.csv

# Create distance matrix for H3N2 NA alignment.
pathogen-distance \
  --alignment h3n2_na_alignment.fasta \
  --output h3n2_na_distances.csv

# Run MDS on the HA and NA distances.
pathogen-embed \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_mds.csv \
  mds

# Run t-SNE on HA and NA alignments (for PCA initialization) and distances.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE on HA and NA alignments for PCA initialization and to calculate the distance matrix on the fly.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE with an HA alignment for PCA initialization and HA/NA distance matrices for the embedding.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

This approach allows each distance matrix to be produced in parallel, for example in a Snakemake workflow, which will speed up a computationally expensive part of the analysis.

Possible solution

To support this new functionality, the pathogen-embed command needs to:

accept one or more arguments to --alignment and --distance-matrix
load all given alignment files and, if more than one file is given, concatenate the alignments before running embeddings
load all given distance matrix files and, if more than one file is given, sum the distances from all matrices into a single distance matrix before running embeddings

In the case where the user only provides alignments and the embedding requires a distance matrix, the command's current logic remains unchanged and operates on the concatenated alignment it produces from step 2 above.

It should be possible for the user to provide a single alignment file to use for PCA initialization of t-SNE, for example, and also provide multiple distance matrices to use for the embedding.

The text was updated successfully, but these errors were encountered:

nandsra21 · 2024-04-09T00:06:05Z

Working Implementation

pathogen-distance \
  --alignment h3n2_ha_alignment.fasta \
  --output h3n2_ha_distances.csv

# Create distance matrix for H3N2 NA alignment.
pathogen-distance \
  --alignment h3n2_na_alignment.fasta \
  --output h3n2_na_distances.csv

# Run MDS on the HA and NA distances.
# Change: must add an alignment
pathogen-embed \
 --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_mds.csv \
  mds

# Run t-SNE on HA and NA alignments (for PCA initialization) and distances.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE on HA and NA alignments for PCA initialization and to calculate the distance matrix on the fly.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE with an HA alignment for PCA initialization and HA/NA distance matrices for the embedding.
# Change: same number of alignments as distance matrices
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne```

huddlej added the enhancement New feature or request label Feb 8, 2024

nandsra21 self-assigned this Mar 29, 2024

nandsra21 linked a pull request Apr 10, 2024 that will close this issue

Multiple alignment feature #19

Merged

huddlej closed this as completed in #19 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embed: Support multiple input files for alignments and distance matrices #10

embed: Support multiple input files for alignments and distance matrices #10

huddlej commented Feb 8, 2024

nandsra21 commented Apr 9, 2024

embed: Support multiple input files for alignments and distance matrices #10

embed: Support multiple input files for alignments and distance matrices #10

Comments

huddlej commented Feb 8, 2024

Context

Description

Possible solution

nandsra21 commented Apr 9, 2024

Working Implementation