# Species Identification with Split Kmer Analysis



## Overview:
<hr>

1. Species labeling Problem
2. Sourmash
    * Methods
    * Limitations
    * Results
3. SKA
    * Methods
    * Limitations
    * Results
4. Comparing Sourmash against SKA
5. QA  
    * Reference genomes 
    * Maca
6. Future Directions

## Species Labelling
<hr>

### **Problem:**
   * Species labelling is done in the field and by eye, because of this process, it is easy to mislabel species.
   * We need a way to verify whether mosquito species are being identified accurately and consistantly.
   * As most mosquito species do not have reference genomes, we need a way to identify species in a reference free way. 
   * There are over 3000 species of mosquitoes, which means differentiating species by eye will become more and more problematic.

![Mosquito%20Image%20Comparison%20%28Real%29.png](attachment:Mosquito%20Image%20Comparison%20%28Real%29.png)




### Sorting Mosquitoes on Site:
<hr>

![Frame%203.png](attachment:Frame%203.png)

### **Solution :**

   * Cluster labeled samples by comparing transcripts instead.
       * To cluster samples we need a distance metric to compare across samples.
       * For this analysis we looked into two seperate libraries that are both focused on comparing short DNA segments (kmers) across samples: [Sourmash](https://sourmash.readthedocs.io/en/latest/) and [SKA](https://github.com/simonrharris/SKA) (split kmer analysis) to retrieve the following distance metrics:
           * Jaccard Index (Sourmash)
           * SNP Distance (SKA)
       


## Sourmash:
<hr>

   * Sourmash works by taking a random sampling of kmers (subsets of a transcript of length k) and hashing these subsets to integers. Each transcript has a set of hashed kmers that comprises a hash sketch (or minhash). 
   * Similarity is calculated by taking the jaccard index between sample minhashes. Similarity is scored on a scale from 0 to 1.
   
![jaccard%20resized.png](attachment:jaccard%20resized.png)

### Results:

![Sourmash%20Results%20%283%29.png](attachment:Sourmash%20Results%20%283%29.png)

### **Issues**:
   * Samples are clustering by species, but a high amount of normalization was needed to show seperation.

# SKA:
<hr>

  * SKA uses pairs of kmers that are seperated by a single base for comparisons between samples, this allows us to get more granular detail on what is different between two transcripts by observing SNPs (single nucleotide polymorphisms). 
  * The split kmer approach is more robust, as instead of stating that two kmers that differ by a base pair are not the same, SKA instead marks them as matching with a SNP seperating the two transcripts.
  * As there is no hashing involved, comparing two samples with SKA is more computationally expensive than comparing using sourmash. (This library states that it was designed for prokaryotic and other small haploid genomes).
  
  
![SKA%20figure%20%281%29.png](attachment:SKA%20figure%20%281%29.png)

![SNP%20Distances%20%282%29.png](attachment:SNP%20Distances%20%282%29.png)

## Comparing Methods:
<hr>

![Sourmash%20Comparison.png](attachment:Sourmash%20Comparison.png)

* Point out major differences between identifying kmers as unmatching when transcripts are differing by  a few snps.
* point out that we need to go much higher up in clustering to get seperation in clusters

![Hierarchical%20Clustering%20%284%29.png](attachment:Hierarchical%20Clustering%20%284%29.png)

## QA:
<hr>

* To validate what we were seeing in SKA, we spiked in some reference genomes.
* We also looked at seperate tissue types within the same species (with maca data) to verify that SNP distances were zero (or very close)

![SKA%20with%20Reference%20%283%29.png](attachment:SKA%20with%20Reference%20%283%29.png)

## Reference Data Spike In Hierarchical Clustering:
<hr>

![Hierarchical%20with%20Reference.png](attachment:Hierarchical%20with%20Reference.png)

## Maca Data Heatmap 
<hr>

![maca_QA.png](attachment:maca_QA.png)

## Future Directions:
<hr>

* Further QA analysis
* Data Storage and metadata linkages
* Handling/adding new data

## Acknowledgments

Thank you to tho Skeeters group and the Data Science Team at CZB

Special Thanks:
* Josh Batson
* Lucy Li
* Olga Botvinnik

