Skip to content

Commit

Permalink
Test t-SNE success/failure with multiple inputs
Browse files Browse the repository at this point in the history
Adds tests for t-SNE behavior with multiple inputs when inputs are
formatted correctly and incorrectly. Adds a check for matching record
names in alignment and distance inputs to t-SNE when both are provided,
to make the new failure mode test pass.
  • Loading branch information
huddlej committed Jun 3, 2024
1 parent 93a8195 commit f5e4eae
Show file tree
Hide file tree
Showing 3 changed files with 61 additions and 0 deletions.
11 changes: 11 additions & 0 deletions src/pathogen_embed/pathogen_embed.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,17 @@ def embed(args):
second_df = pd.DataFrame(numbers)
second_df.columns = ["Site " + str(k) for k in range(genomes_df.shape[1],genomes_df.shape[1] + len(numbers[0]))]
genomes_df = pd.concat([genomes_df, second_df], axis=1)

# If we're using PCA to initialize the t-SNE embedding, confirm that the
# input alignments used for PCA have the same sequence names as the
# input distance matrices.
if (
args.command == "t-sne" and
not np.array_equal(distance_matrix.index.values, np.array(sequence_names))
):
print("ERROR: The sequence names for the distance matrix inputs do not match the names in the alignment inputs.", file=sys.stderr)
sys.exit(1)

#performing PCA on my pandas dataframe
pca = PCA(
n_components=n_components,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
Get a distance matrix from a H3N2 HA alignment that has been sorted by sequence name.

$ pathogen-distance \
> --alignment $TESTDIR/data/h3n2_ha_alignment.sorted.fasta \
> --output ha_distances_complete.csv

Get a distance matrix from a H3N2 NA alignment that has been sorted by sequence name.

$ pathogen-distance \
> --alignment $TESTDIR/data/h3n2_na_alignment.sorted.fasta \
> --output na_distances_complete.csv

Remove the second record from the HA and NA distance matrices.
This should produce mismatched records between the alignments and distances, but the pairs of alignments and distances on their own are matched.

$ cut -f 1,3- -d "," ha_distances_complete.csv | sed 2d > ha_distances.csv
$ cut -f 1,3- -d "," na_distances_complete.csv | sed 2d > na_distances.csv

Run pathogen-embed with t-SNE on distances from H3N2 HA and H3N2 NA alignments.

$ pathogen-embed \
> --alignment $TESTDIR/data/h3n2_ha_alignment.sorted.fasta $TESTDIR/data/h3n2_na_alignment.sorted.fasta \
> --distance-matrix ha_distances.csv na_distances.csv \
> --output-dataframe embed_t-sne.csv \
> t-sne
ERROR: The sequence names for the distance matrix inputs do not match the names in the alignment inputs.
[1]
23 changes: 23 additions & 0 deletions tests/pathogen-embed-t-sne-multiple-distances-and-alignments.t
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
Get a distance matrix from a H3N2 HA alignment.

$ pathogen-distance \
> --alignment $TESTDIR/data/h3n2_ha_alignment.fasta \
> --output ha_distances.csv

Get a distance matrix from a H3N2 NA alignment.

$ pathogen-distance \
> --alignment $TESTDIR/data/h3n2_na_alignment.fasta \
> --output na_distances.csv

Run pathogen-embed with t-SNE on distances from H3N2 HA and H3N2 NA alignments.

$ pathogen-embed \
> --alignment $TESTDIR/data/h3n2_ha_alignment.fasta $TESTDIR/data/h3n2_na_alignment.fasta \
> --distance-matrix ha_distances.csv na_distances.csv \
> --output-dataframe embed_t-sne.csv \
> t-sne

There should be one record in the embedding per input sequence in the alignment.

$ [[ $(sed 1d embed_t-sne.csv | wc -l) == $(grep "^>" $TESTDIR/data/h3n2_ha_alignment.fasta | wc -l) ]]

0 comments on commit f5e4eae

Please sign in to comment.