# Aligning pink loci that failed HWE to the chinook and rainbow trout genomes (O. tshawytscha and O. mykiss)

#### Bowtie2 version 2.3.4

### Background 
Though we eliminated any loci that did not adhere to HWE in 50% of our 18 populations, we wanted to see if aligning them to a genome would show any interesting patterns in these loci. We are especially interested to see if there are any inversions, as loci out of HWE have proved to be an indicator of inversions in other similar salmon data sets (personal communication, Garrett McKinney). The original choice for alignment was the rainbow trout genome, recommended by Garrett, but since chinook is a more closely related species, I wanted to try to align to the chinook genome as well. Garrett may have recommended the trout genome because it is the most complete, I'm not sure. 

### Simplified steps:

    1. Convert the list of pink loci that failed HWE filtration step to a FASTA file.
    2. Retrieve the o. mykiss and o. tshawytscha genomes.
    3. Using Bowtie2, create an index from the rainbow trout genome and align the pink loci that failed the HWE filter to it; do the same for the chinook genome.  

### Pink Data: 
The pink loci are a list of 1,481 loci that were eliminated from the data set when they did not adhere to the expectation of Hardy-Weinberg (HWE) of >0.05 in at least 50% of the populations (9 populations). 


### Rainbow trout genome: 
There are two rainbow trout genomes, and I aligned to both. 

  The older one is from 2014 and came from a paper published in Nature: 
  
    Berthelot, C. et al. The rainbow trout genome provides novel insights into evolution after whole-genome duplication in vertebrates. Nat. Commun. 5:3657 doi: 10.1038/ncomms4657 (2014).

    Accession codes: Genome, transcriptome and miRNA sequence data for Oncorhynchus mykiss have been deposited in GenBank/EMBL/DDBJ sequence read archive (SRA) under the accession codes ERP003734, ERP003742 and SRP032774. The genome assembly has been deposited in the European Nucleotide Archive under the accession code CCAF000000000 and the project PRJEB4421.

   The newer genome is also listed on genbank, and was submitted in 2016:
   
    Gao, G. 2016. A New and Improved Rainbow Trout (Oncorhynchus mykiss) Reference Genome Assembly. International Conference on Integrative Salmonid Biology. 1: 40. 

    The Oncorhynchus mykiss whole genome shotgun (WGS) project has the project accession MSJN00000000.  This version of the project (01) has the accession number MSJN01000000, and consists of sequences MSJN01000001-MSJN01139799.


### Chinook genome:   
  
There are two copies of the chinook genome too, I believe they're the same. One is on our lab network drive, we got it directly from the authors, and the other was listed on genbank. 

   From Genbank: 
   
    Chinook Salmon Genome and Transcriptome:  Submitted (01-NOV-2017) Marine Ecosystems and Aquaculture Division, Science Branch, Fisheries and Oceans Canada, 4160 Marine Drive, West Vancouver, BC V7V 1N6, Canada.  Christensen,K.A., Leong,J.S., Sakhrani,D., Minkley,D.R., Withler,R.E., Rondeau,E.B., Koop,B.F., Devlin,R.H.

    The Oncorhynchus tshawytscha whole genome shotgun (WGS) project has the project accession PEKY00000000.  This version of the project (01) has the accession number PEKY01000000, and consists of sequences PEKY01000001-PEKY01015945.

   From our network drive (I need more information about it): 
   
     The file is called Otsh_ver1.0_renamed.fasta and it is 1,735,490KB large, last modified 11/07/2017

## Step 1. Create a FASTA file for the loci that failed HWE. 
To create a FASTA file, use the list of loci that failed the HWE test in the filtering stage and the catalog file that has the sequences for each of the loci. I brought them both into Excel and did a vlookup to get the sequence for each tag that contains the SNP. 
Use the following python script to convert the excel file to a FASTA format. Be sure to hard code in the appropriate column numbers for your excel sheet into the following code. 

In [None]:
##convert_catalog_tags_to_FASTA.py
##python plan for converting table s2 to FASTA format
## feb 18 2015 
# carolyn tarpey (&garrett)

## this needs two arguments:
#the first is the name of the excel file that needs converting 
#the second is the name of the output file you want (the FASTA file)

#!/bin/bash

import sys
import re

#open file
excel_file = open(sys.argv[1], "r")
FASTA = open(sys.argv[2],"w") 

for line in excel_file:#read one line  of the excel file at a time and 
	columns = line.split("\t")#take that line and split it up by the tabs
	#print columns 
	newline =[ ">", columns[0], "\n" ] #> the column with the locus name
	print newline
	FASTA.write(''.join(newline)) # write this to the output file: > the second column tab third column tab fourth column end line
	seq = columns[3] # the column with the sequence
	print seq
	
	FASTA.write(''.join(seq)) # write this to the output file: the sequence
	FASTA.write("\n") #skip a line in the output

excel_file.close()
FASTA.close()

## Step 2. Retreive the genomes from genbank
This one is really easy. Go to the website and search for the species names. 
Download any genomes in the FASTA format. Unzip them when you get them to the file that contains Bowtie2. 

## Step 3. Use Bowtie2 to align the failed HWE pink loci to each of the genomes
First create an index out of each of the genomes, then align the FASTA file with the failed HWE loci to the index for each genome. Here is the code I used to create the indexes and align the HWE loci to each of the genomes: 

In [None]:
#these lines of code have to be run in the folder that has the bowtie2 executables and the fasta files listed,
#can not be run from here- I made them into a .bat and ran overnight

bowtie2-build ./CCAF01.fasta trout_genome
bowtie2 -f --local -x ./trout_genome -U ./loci_failed_HWE_FASTA.txt -S ./pink_HWE_to_trout.SAM

bowtie2-build ./Otsh_ver1_0_renamed.fasta chinook_genome_garrett
bowtie2 -f --local -x ./chinook_genome_garrett -U ./loci_failed_HWE_FASTA.txt -S ./pink_HWE_to_chinook_garrett.SAM

bowtie2-build ./GCF_002163495.1_Omyk_1.0_genomic.fna trout_genome_new
bowtie2 -f --local -x ./trout_genome_new -U ./loci_failed_HWE_FASTA.txt -S ./pink_HWE_to_trout_new.SAM

bowtie2-build ./GCA_002831465.1_CHI06_genomic.fna chinook_genome_new
bowtie2 -f --local -x ./chinook_genome_new -U ./loci_failed_HWE_FASTA.txt -S ./pink_HWE_to_chinook_new.SAM


The HWE failed loci had different alignment rates to each of the four genomes, detailed here:

    Rainbow trout: 
       
       Old:
        1482 reads; of these:
          1482 (100.00%) were unpaired; of these:
          248 (16.73%) aligned 0 times
          229 (15.45%) aligned exactly 1 time
          1005 (67.81%) aligned >1 times
        83.27% overall alignment rate
         
        New:   
        1482 reads; of these:
          1482 (100.00%) were unpaired; of these:
          41 (2.77%) aligned 0 times
          105 (7.09%) aligned exactly 1 time
          1336 (90.15%) aligned >1 times
        97.23% overall alignment rate

    Chinook: 
       
       Old:
        1482 reads; of these:
          1482 (100.00%) were unpaired; of these:
          255 (17.21%) aligned 0 times
          291 (19.64%) aligned exactly 1 time
          936 (63.16%) aligned >1 times
        82.79% overall alignment rate
        
        New:  
        1482 reads; of these:
          1482 (100.00%) were unpaired; of these:
            79 (5.33%) aligned 0 times
            201 (13.56%) aligned exactly 1 time
            1020 (81.11%) aligned >1 times
        94.67% overall alignment rate
        
        Because the New versions of both species' genomes worked well in the alignment of the pink loci, I'll use the new Chinook genome alignments going forward because chinook salmon is more closely related to pink salmon than rainbow trout.


# Results

## Pink HWE failed loci to early rainbow trout genome

In [None]:
###Pysam needs to be run with anaconda. This notebook works on Ryan's Ubuntu
###using jupyter anaconda

In [None]:
%cd "/home/ipseg/Desktop/pink_to_chinook/NewLGalignment"

In [None]:
import pysam

In [None]:
#import pysam
import os.path
import numpy as np
import pandas as pd
from IPython.core.pylabtools import figsize
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=2)
sns.set_style("white")
%matplotlib inline

In [None]:
raw_sam_file = '/home/ipseg/Desktop/pink_to_chinook/NewLGalignment/pink_to_chinook.SAM'
filtered_sam_file = '/home/ipseg/Desktop/pink_to_chinook/NewLGalignment/pink_to_chinook_filtered.SAM.txt'

In [None]:
def get_aligns(sam_file = raw_sam_file):
    return(pysam.AlignmentFile(sam_file, "r").fetch())

### Mapping quality

In [None]:
mpqs = [read.mapping_quality for read in get_aligns()]

In [None]:
len(mpqs)

In [None]:
plt.hist(mpqs, bins = 20)
plt.xlabel('Mapping quality')
plt.ylabel('count')
plt.title('')
plt.show()

plt.hist(mpqs, bins = 20)
plt.xlabel('Mapping quality')
plt.ylabel('count')
plt.title('')
plt.xlim(2)
plt.ylim(0, 1500)
plt.show()

In [None]:
###Strand Bias

In [None]:
flags  = [read.flag for read in get_aligns()]
figsize(5,5)
plt.hist(flags, bins = 16)
plt.show()

In [None]:
###Alignment length

In [None]:
qal =  [read.query_alignment_length for read in get_aligns()]
plt.hist(qal)
plt.show()

In [None]:
###Edit Distances

In [None]:
edit_distances = list()
mq = list()
for read in get_aligns():
    try:
        edit_distances.append(np.int(read.get_tag('XM')))
        mq.append(np.int(read.mapping_quality))
    except KeyError:
        pass

In [None]:
plt.hist(edit_distances)
plt.show()

In [None]:
##Edit Distance vs. Mapping Quality

In [None]:
rr = pd.DataFrame({'ed' : edit_distances, 'mq' : mq })

In [None]:
sns.kdeplot(rr, cmap="Blues", shade = True, legend = True)
plt.show()

In [None]:
##lets only keep the aligments with: (query_alignment_length>=93) AND (mapping_quality>=30)

In [None]:
with pysam.AlignmentFile(raw_sam_file, "r") as INFILE:
    with pysam.AlignmentFile(filtered_sam_file, "wh", template=INFILE) as OUTFILE:
        for aln in INFILE:
            if (aln.query_alignment_length >= 94) and (aln.mapping_quality >= 30):
                OUTFILE.write(aln)

In [None]:
filtered_SAM = pd.read_csv(filtered_sam_file, sep = '\t', comment='@', engine='python', 
            names = ['QNAME','FLAG','RNAME','POS','MAPQ','CIGAR','RNEXT','PNEXT','ISIZE','SEQ','QUAL',
                     'TAG1', 'TAG2', 'TAG3' 'TAG4', 'TAG5', 'TAG6', 'TAG7','TAG8', 'TAG9','TAG10'])
filtered_SAM.head()

In [None]:
filtered_SAM.drop(['POS', 'RNEXT', 'PNEXT', 'ISIZE', 'SEQ','QUAL'], axis=1, inplace=True)
filtered_SAM['RNAME'] = [str(xx) for xx in filtered_SAM['RNAME'] ]
filtered_SAM.head()