# Week 3 - Bioinformatics Data Skills Notes
## Working with Sequence Data (p339-354)

## FASTA Format

Fasta format: Each sequence entry is composed of a discription line beginning with a '>', and the sequence on the following line.

eg.

    >gene_00284728 length=231;type=dna
    GAGAACTGATTCTGTTACCGCAGGGCATTCGGATGTGCTAAGGTAGTAATCCATTATAAGTAACATGCGCGGAATATCCGGAGGTCATAGTCGTAATGCATAATTATTCCCTCCCTCAGAAGGACTCCCTTGCGAGACGCCAATACCAAAGACTTTCGTAGCTGGAACGATTGGACGGCCCAACCGGGGGGAGTCGGCTATACGTCTGATTGCTACGCCTGGACTTCTCTT

Subsequent sequences can be listed in the same file.

Naming convention: >[identifier][space][comment]

## FASTQ Format

Fasta format:
1. The description line, beginning with '@'. This contains the record identifier and other information.
2. Sequence data, which can be on one or many lines.
3. The line beginning with +, following the sequence line(s) indicates the end of the sequence.
4. The quality data, which can also be on one or many lines, but must be the same length as the sequence. Each numeric base quality is encoded with ASCII characters.

eg.

    @AZ1:233:B390NACCC:2:1203:7689:2153
    GTTGTTCTTGATGAGCCATGAGGAAGGCATGCCAAATTAAAATACTGGTGCGAATTTAAT
    +
    CCFFFFHHHHHJJJJJEIFJIJIJJJIJIJJJJCDGHIIIGIGIJIJIIIIJIJJIJIIH

## Nucleotide Codes

- A
- T
- C
- G
- N = A/T/C/G
- Y = C/T
- R = A/G
- S = G/C
- W = A/T
- K = G/T
- M = A/C
- B = C/G/T
- D = A/G/T
- H = A/C/T
- V = A/C/G

## Base Qualities
Quality is encoded as a string of ASCII characters, each representing and integer between 0 and 127, which can be decoded using 'qual'.

In [9]:
qual = "JJJJGJJIHHHFDFCC"

In [7]:
[ord(b) for b in qual]

[74, 74, 74, 74, 71, 74, 74, 73, 72, 72, 72, 70, 68, 70, 67, 67]

### Converting between quality scores
Sanger, Illumina (versions 1.8 onward)

    ASCII character range = 33–126
    Offset = 33
    Quality score type = PHRED
    Quality score range = 0–93
Solexa, early Illumina (before 1.3)

    ASCII character range = 59–126
    Offset = 64
    Quality score type = Solexa
    Quality score range = 5–62
Illumina (versions 1.3–1.7)

    ASCII character range = 64–126
    Offset = 64
    Quality score type = PHRED
    Quality score range = 0–62
To convert Sanger sequence to PHRED quality score...

In [10]:
phred = [ord(b)-33 for b in qual]

In [11]:
phred

[41, 41, 41, 41, 38, 41, 41, 40, 39, 39, 39, 37, 35, 37, 34, 34]

In [12]:
[10**(-q/10) for q in phred]

[7.943282347242822e-05,
 7.943282347242822e-05,
 7.943282347242822e-05,
 7.943282347242822e-05,
 0.00015848931924611142,
 7.943282347242822e-05,
 7.943282347242822e-05,
 0.0001,
 0.00012589254117941674,
 0.00012589254117941674,
 0.00012589254117941674,
 0.00019952623149688788,
 0.00031622776601683794,
 0.00019952623149688788,
 0.00039810717055349735,
 0.00039810717055349735]

In [17]:
!sickle se -f untreated1_chr4.fq -t sanger -o untreated1_chr4_sickle.fq

****Error: Could not open input file 'untreated1_chr4.fq'.



Both sickle and seqtk are used to trim low-quality bases

sickle

    -f = input file
    -t = quality type
    -o = output
seqtk trimfq

samtools for indexing FASTA files