# Extracting N2 3' UTR regions from other genomes

# Blasting N2 3' UTR regions to CB4856

### 1) download the CB4856 genome from WormBase FTP site:  
ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA275000/sequence/genomic/

I believe the CB4856 assembly is PRJNA275000. I chose the WS268 unmasked assembly, as we think WS268 matches what is used in CENDR:
`c_elegans.PRJNA275000.WS268.genomic.fa.gz`

### 2) convert the genome sequences into a local blast database. Requires BLAST+ to be installed

In [4]:
gunzip ../Genome_seqs/c_elegans.PRJNA275000.WS268.genomic.fa.gz
makeblastdb -in ../Genome_seqs/c_elegans.PRJNA275000.WS268.genomic.fa -out "../Genome_seqs/CB4856_WS268_genomic" -taxid 6239 -dbtype nucl



Building a new DB, current time: 05/06/2019 13:55:24
New DB name:   /home/ksilliman/Projects/PD_RNAworms/Genome_seqs/CB4856_WS268_genomic
New DB title:  ../Genome_seqs/c_elegans.PRJNA275000.WS268.genomic.fa
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 7 sequences in 0.809663 seconds.


### 3) Download the 3'UTR sequencefrom ParaSite through WomBase, as described on the [WormBase FAQ page](https://www.wormbase.org/about/Frequently_asked_questions).

In [33]:
# Set the variable name for the ParaSite download and the filtered fasta file
# Change these anytime you download a new file 
UTR = "martquery_0506213417_153.txt"
UTRf = "UTR3_3417_WS269_filt.fa"

In [18]:
%expand
# Unzip 3'UTR download
gunzip {UTR}.gz

### 4) For some reason, the download includes a lot of sequences that say "Sequence unavailable". These need to be removed.

In [19]:
%expand
# Remove all sequences that say "Sequence unavailable". Print the total number of sequences.
IN = open("{UTR}","r")
OUT = open("{UTRf}","w")

n = 0
seq = ""
for line in IN:
    if ">" in line:
        if seq != "":
            OUT.write(header+seq+"\n")
            n += 1
            seq = ""
        header = line
    else:
        if "unavailable" not in line:
            seq = seq + line.strip()
        else:
            seq = ""
IN.close()
OUT.close()
print(n)

25978


In [20]:
%expand
head {UTRf}

>WBGene00003525|17486871|17486981|4R79.1a.1
TCTTCTATCTGGTGTTATTTATTTTGTTGCTTATTGTTCCATGACGTGTGTATAATGTAATTCTGAAAGCCAATTTTTTCATTTTTTGAAAATATTTATATAATTTATACT
>WBGene00004098|10383977|10384063|AC3.4.1
ACATCGAATGCGTAACTTTGACATCAGTTCTCTGTATATATGACACAATTTTCTCATTTTTTTCACAATAAATAATAATAATGCTTG
>WBGene00007071|10393357|10393504|AC3.5a.2
ATGAATTTCCATACAATGACAAAAACTATTAGTGACAGATAACATAAACACTTGATTTTATTTATTAATGTGAAACCGGTCAGAGTTCATAATTTTTGTTGTAACTTGTGTTTGCCTCAACATTGAATAAAATGTTTATAAATCGGAC
>WBGene00007072|10397735|10397817|AC3.7.1
TTTTAAAAAGTTTTATTTGCTATCAATTTGTATCTCTTGTTGATTTAATTCATATTTGAGCCTTAATAAACTGTCTAATCTGC
>WBGene00000024|10380340|10380425|AC3.3.1
ACATCGAATGCGTAACTTTGACATCAGTTCTCTGTATATATGACACAATTTTTTTTCTTTTTTTTCACAATAACATTGCTTGAAAT


### 5) Blast N2 3' UTR sequences against the CB4856 genome.
Requires an e-value of < 1e-3, at least 80% coverage of the query sequence, and only the top 3 hits.

In [34]:
%expand
blastn -query {UTRf} -task megablast -outfmt 10 -word_size 11 \
-db ../Genome_seqs/CB4856_WS268_genomic -out 3UTR_blast.out \
-num_threads 2 -max_hsps 3 -evalue 1e-3 -qcov_hsp_perc 80

In [38]:
# Look at output of blast
# Columns are: Fasta header,
# aligned chromosome,
# # of identical matches,
# alignment length,
# # of mismatches,
# # of gaps,
# alignment start pos query,
# alignment end pos query,
# alignment start pos in target
# alignment end pos in target
# evalue
# bit score
head -n 30 3UTR_blast.out

WBGene00003525|17486871|17486981|4R79.1a.1,IV,100.00,111,0,0,1,111,17178578,17178468,8e-53,206
WBGene00004098|10383977|10384063|AC3.4.1,V,100.00,87,0,0,1,87,10068540,10068626,1e-39,161
WBGene00004098|10383977|10384063|AC3.4.1,V,91.78,73,4,1,1,73,10064988,10064918,3e-21,100
WBGene00007071|10393357|10393504|AC3.5a.2,V,100.00,148,0,0,1,148,10077922,10078069,3e-73,274
WBGene00007072|10397735|10397817|AC3.7.1,V,100.00,83,0,0,1,83,10082300,10082382,2e-37,154
WBGene00000024|10380340|10380425|AC3.3.1,V,94.52,73,2,1,1,73,10068540,10068610,1e-24,111
WBGene00004964|10385685|10385711|AC3.10.1,V,100.00,27,0,0,1,27,10070248,10070274,3e-07,51.0
WBGene00007063|4664|4717|2L52.1a.1,II,100.00,54,0,0,1,54,4663,4716,1e-21,100
WBGene00007073|10401223|10401307|AC3.8.1,V,98.82,85,1,0,1,85,10085789,10085873,7e-37,152
WBGene00007070|10374208|10374245|AC3.2.1,V,100.00,38,0,0,1,38,10058907,10058944,6e-13,71.3
WBGene00007071|10393357|10393504|AC3.5a.1,V,100.00,148,0,0,1,148,10077922,10078069,3e-73,274
WBGene000070

Some N2 3'UTR loci are aligniing with high confidence to multiple regions of the CB4856 genome. For example, transcript AC3.4.1 aligns to two locations, one alignment with 100% match (evalue 1e-39) and one alignment with 4 mismatches and one gap (evalue 3e-21). When I look at the polymorphism viewer on WormBase for AC3.4.1, I see that CB4856 has a single SNP in the 3'UTR region (so neither of these BLAST results accurately reflect that).

# Using CENDR data