# Species Identification with Split Kmer Analysis



## Overview:
<hr>

1. Species labeling Problem
2. Sourmash
    * Methods
    * Limitations
    * Results
3. SKA
    * Methods
    * Limitations
    * Results
4. Comparing Sourmash against SKA
5. QA  
    * Reference genomes 
    * Maca
6. Future Directions

## Species Labelling
<hr>

### **Problem:**
   * Species labelling is done in the field and by eye, because of this process, it is easy to mislabel species.
   * We need a way to verify whether mosquito species are being identified accurately and consistantly.
   * As most mosquito species do not have reference genomes, we need a way to identify species in a reference free way. 


![Frame%203.png](attachment:Frame%203.png)


### **Solution :**

   * A possible solution is to cluster these samples.
   * With additional clustering information, we can see where species labels are clustering, and detect labels that are clustering more closely with other labels (meaning that sample is a candidate for being mislabeled).


### **Comparing Transcripts:**


* To cluster species, we need to compare each sample transcript to every other transcript to get distances between samples to see which samples are the most related and have the most transcript overlap. 
* For this analysis we looked into two seperate libraries that are both focused on comparing short DNA segments (kmers) across samples: [Sourmash](https://sourmash.readthedocs.io/en/latest/) and [SKA](https://github.com/simonrharris/SKA) (split kmer analysis)

## Sourmash:
<hr>

### Method:

   * Sourmash works by taking a random sampling of kmers (subsets of a transcript of length k) and hashing these subsets to integers. Each transcript has a set of hashed kmers that comprises a hash sketch (or minhash). 
   * Similarity is calculated by taking the jaccard index between sample minhashes. Similarity is scored on a scale from 0 to 1.
   * Because these sketches are much smaller than transcripts, samples can easily be compared and similarity scores can be generated across a pool of samples.

### Results:


![Sourmash%20Results%20%281%29.png](attachment:Sourmash%20Results%20%281%29.png)


### **Issues**:
   * Species are clustering by species, but a high amount of normalization was needed to show seperation.

## SKA:
<hr>

### Method:

  * SKA uses pairs of kmers that are seperated by a single base for comparisons between samples, this allows us to get more granular detail on what is different between two transcripts by observing SNPs (single nucleotide polymorphisms). 
  * The split kmer approach is more robust, as instead of stating that two kmers that differ by a base pair are not the same, SKA instead marks them as matching with a SNP seperating the two transcripts.
  * As there is no hashing involved, comparing two samples with SKA is more computationally expensive than comparing using sourmash. (This library states that it was designed for prokaryotic and other small haploid genomes)

### Results:

![SNP%20Distances.png](attachment:SNP%20Distances.png)

## Comparing Methods:
<hr>


![Sourmash%20Comparison.png](attachment:Sourmash%20Comparison.png)

* Point out major differences between identifying kmers as unmatching when transcripts are differing by  a few snps.
* point out that we need to go much higher up in clustering to get seperation in clusters

![Hierarchical%20Clustering%20%283%29.png](attachment:Hierarchical%20Clustering%20%283%29.png)

## QA
<hr>

* To validate what we were seeing in SKA, we spiked in some reference genomes.
* We also looked at seperate tissue types within the same species (with maca data) to verify that SNP distances are zera

### Reference Data Spike In

![SKA%20with%20Reference%20%281%29.png](attachment:SKA%20with%20Reference%20%281%29.png)

### Maca Data

![maca_QA.png](attachment:maca_QA.png)

## Future Directions:
<hr>

* Further QA analysis