# **Análisis de secuencias**
## Fundamentos en Biología computacional
## *by Prof. Javier C. Alvarez*

![intro](ClusterHenry.gif)

## **Index**
1. [Introduction](#Introduction)
2. [Sanger Sequencing method](#2.-Principio-del-Secuenciamiento-Sanger)
    1. [Surfing in raw data](#A.-Tipología-de-datos-Sanger)
    2. [Phred Phrap Score](#B.-Phred&Phrap-Score)
    3. [Trimming software](#C.-Programas-para-manipular-secuencias-Sanger)
    4. [Scripting](#D.-Scripting)
    5. Excercise with Sanger sequences

## Introduction
Evolution of prices in sequencing
![moore](moore.png)

Amount of data in the Genbank
![genbank](genbank.png)

## 2. Principio del Secuenciamiento Sanger
## Método de dideoxy
![dideoxy](dideoxy.png)


### Método de Sanger (1975)
![sanger](sanger1975.png)

### A. Tipología de datos Sanger
Generalmente al contratar un servicio de secuenciamiento tipo Sanger, se envía un tubo con el amplicón y los dos cebadores usados (Forward y Reverse). De regreso el usuario recibe 4 archivos por cebador:<br>
sample1_fw.txt -> Sequencia en formato fasta <br>
sample1_fw.ab1 -> Archivo con raw data binaria, archivo de interés <br>
sample1_fw.phd1 -> Archivo con cromatogramas <br>
sample1_fw.pdf -> Archivo con resultados versión impresa <br>

In [13]:
%%bash
cat raw_data/Sanger/S6_27F.txt

>170407-040_A07_S6_27F.ab1	1271
CGACAAATTAAAAGCGTACTAGACATGCAGTCGAACGAACTCTGGTATTG
ATTGGTGCTTGCATCATGATTTACATTTGAGAGAGTGGCGAACTGGTGAG
TAACACGTGGGAAACCTGCCCAGAAGCGGGGGATAACACCTGGAAACAGA
TGCTAATACCGCATAACAACTTGGACCGCATGGTCCGAGCTTGAAAGATG
GCTTCGGCTATCACTTTTGGATGGTCCCGCGGCGTATTAGCTAGATGGTG
GGGTAACGGCTCACCATGGCAATGATACGTAGCCGACCTGAGAGGGTAAT
CGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAG
TAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGA
GTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAAAGAAGAACATAT
CTGAGAGTAACTGTTCAGGTATTGACGGTATTTAACCAGAAAGCCACGGC
TAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCG
GATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGA
AAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGT
GCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATAT
ATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCT
GAGGCTCGAAAGTATGGGTAGCAAACAGGATTAGATACCCTGGTAGTCCA
TACCGTAAACGATGAATGCTAAGTGTTGGAGGGTTTCCGCCCTTCAGTGC
TGCAGCTAACGCATTAAGCATTCCGCCTGGGGAGTACGGCCGCAAGGCTG
AAACTCAAAGGAATTGACGGGGGCCCGCACA

In [5]:
%%bash
less raw_data/Sanger/S6_27F.ab1

"raw_data/Sanger/S6_27F.ab1" may be a binary file.  See it anyway? 

### B. Phred&Phrap Score
The phred software reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base.

The quality value is a log-transformed error probability, specifically

Q = -10 log10( Pe )

where Q and Pe are respectively the quality value and error probability of a particular base call.

The phred quality values have been thoroughly tested for both accuracy and power to discriminate between correct and incorrect base-calls.

Phred can use the quality values to perform sequence trimming.<br>

![phred](./phred.png)

Donde q es la valor phred y P es la probabilidad de encontrar una base errada:<br>
valor phred = 20 => 1 base errada a cada 100 (99%)<br>
valor pherd = 30 => 1 base errada a cada 1000 (99.9%)<br>

More info:<br>
http://www.phrap.org/phredphrapconsed.html

![chroma](chroma_sanger.png)

## C. Programas para manipular secuencias Sanger
* Bioedit. GUI. Licencia abierta. Funciona bien en Windows (http://www.mbio.ncsu.edu/BioEdit/page2.html)
* Geneious. GUI. Pago, Trial por 15 dias. (https://www.geneious.com)
* Artemis. GUI. Open. Java. (https://www.sanger.ac.uk/science/tools/artemis)
* Phred/Phrap. Console, automatizable. Open. Díficil instalar en linux (http://www.phrap.org/phredphrapconsed.html)
* Home Scripts ([BioPerl](http://bioperl.org/howtos/SeqIO_HOWTO.html), [BioPython](http://biopython.org/wiki/SeqIO), [SangerseqR](https://bioconductor.org/packages/release/bioc/html/sangerseqR.html))

## D. Scripting

In [6]:
library("sangerseqR")
#Para bajar la Libreria con las herramientas para analisis de secuencias sanger
#source("https://bioconductor.org/biocLite.R")
#biocLite("sangerseqR")
#biocLite("seqinr")
secuenciaF <- readsangerseq("raw_data/Sanger/S6_27F.ab1")
print(secuenciaF)

Number of datapoints: 16196
Number of basecalls: 1271

Primary Basecalls: CGACAAATTAAAAGCGTACTAGACATGCAGTCGAACGAACTCTGGTATTGATTGGTGCTTGCATCATGATTTACATTTGAGAGAGTGGCGAACTGGTGAGTAACACGTGGGAAACCTGCCCAGAAGCGGGGGATAACACCTGGAAACAGATGCTAATACCGCATAACAACTTGGACCGCATGGTCCGAGCTTGAAAGATGGCTTCGGCTATCACTTTTGGATGGTCCCGCGGCGTATTAGCTAGATGGTGGGGTAACGGCTCACCATGGCAATGATACGTAGCCGACCTGAGAGGGTAATCGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAAAGAAGAACATATCTGAGAGTAACTGTTCAGGTATTGACGGTATTTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGTATGGGTAGCAAACAGGATTAGATACCCTGGTAGTCCATACCGTAAACGATGAATGCTAAGTGTTGGAGGGTTTCCGCCCTTCAGTGCTGCAGCTAACGCATTAAGCATTCCGCCTGGGGAGTACGGCCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCC

In [7]:
#BaseCalling
seqcalls <- makeBaseCalls(secuenciaF, ratio = 0.33)
chromatogram(seqcalls, width = 100, height = 2, trim5 = 70, trim3 = 100,
             showcalls = "both", showtrim = TRUE, filename = "seqF.pdf")

Chromatogram saved to seqF.pdf in the current working directory

In [2]:
from IPython.display import IFrame
IFrame("seqF.pdf", width=600, height=300)

In [13]:
#Formatear secuencias en string
Forward <- primarySeq(seqcalls, string=TRUE)
print(Forward[1])

[1] "TGAGGGGCAGCGGACTAGTCATGCAGTCGTACGATCTCTGGTATTGATTGGTGCTTGCATCATGATTTACATTTCAGTGAGTGGCGAACTGGTGAGTAACACGTGGGAAACCTGCCCAGTAGCGGGGGATAACACCTGGAAACAGATGCTAATACCGCATAACAACTTGGACCGCATGGTCCGAGCTTGAAAGATGGCTTCGGCTATCACTTTTGGATGGTCCCGCGGCGTATTAGCTAGATGGTGGGGTAACGGCTCACCATGGCAATGATACGTAGCCGACCTGAGAGGGTAATCGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAACTCTGTTGTTAAAGAAGAACATATCTGAGAGTAACTGTTCAGGTATTGACGGTATTTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAATGCGGCTGTCTGGTCTGTCACTGACGCTGACGCTCGAAAGTATGGGTAGCAAACAGATTAGATACCCTGATAGTCCATACCGTACACGATGAGTGCTGAGTGTTGCACGGATCCGCCATTCAGTGCTGCAGCTCACGCAGTAAGCACTCCGTCTGAGGAGTACAGTCGCACCGCTGAAACTCAGAAGAGGTGACAGCGGTTCGCACTAGCAGTAGAGCATGTGGCTCCACTCGTAGCTACGCGTAGTATCGTAGCATGTCATGACATACTATGCGTA

In [1]:
#Libreria necesaria
import sys
from Bio import SeqIO
#to install Biopython %pip install biopython
#Variables
seqF = ""
seqR = ""
#maipulacion de los ab1
handle = open("raw_data/Sanger/S6_27F.ab1", "rb")
for record in SeqIO.parse(handle, "abi"):
    print(record.id + " procesada!")
    seqF = record

handle = open("raw_data/Sanger/S6_907R.ab1", "rb")
for record2 in SeqIO.parse(handle, "abi"):
    print(record2.id + " procesada!")
    seqR = record2
    rc = seqR.reverse_complement(id=record2.id) #Reversar y complementa la reversa

sequences = [seqF, seqR]
#Escribir el archivo de salida. Puede ser multifasta
SeqIO.write(sequences, "S6.fasta", "fasta")
#Si desea leer un fasta use el siguiente método
#SeqIO.read("S6.fasta", "fasta")

S6_27F procesada!
S6_907R procesada!


2

In [34]:
%%bash
cat S6.fasta

>S6_27F
CGACAAATTAAAAGCGTACTAGACATGCAGTCGAACGAACTCTGGTATTGATTGGTGCTT
GCATCATGATTTACATTTGAGAGAGTGGCGAACTGGTGAGTAACACGTGGGAAACCTGCC
CAGAAGCGGGGGATAACACCTGGAAACAGATGCTAATACCGCATAACAACTTGGACCGCA
TGGTCCGAGCTTGAAAGATGGCTTCGGCTATCACTTTTGGATGGTCCCGCGGCGTATTAG
CTAGATGGTGGGGTAACGGCTCACCATGGCAATGATACGTAGCCGACCTGAGAGGGTAAT
CGGCCACATTGGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTAGGGAATCT
TCCACAATGGACGAAAGTCTGATGGAGCAACGCCGCGTGAGTGAAGAAGGGTTTCGGCTC
GTAAAACTCTGTTGTTAAAGAAGAACATATCTGAGAGTAACTGTTCAGGTATTGACGGTA
TTTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAA
GCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGA
AAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGG
ACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCG
AAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGTATGGGTAGCAAACAGGA
TTAGATACCCTGGTAGTCCATACCGTAAACGATGAATGCTAAGTGTTGGAGGGTTTCCGC
CCTTCAGTGCTGCAGCTAACGCATTAAGCATTCCGCCTGGGGAGTACGGCCGCAAGGCTG
AAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAG
CTACGCGAAGAACCTT