# Week 2: Data Cleaning & Detective Work

### #coralcrew > Holmes + Watson

In [28]:
import Bio
from Bio.Seq import Seq

You've just seen some common data errors -- anomalies in raw data that might trip up the data analysis process. Luckily, there are many ways to correct for these errors. We'll be designing solutions to the errors we encountered during our detection exercise. 

We'll be prototyping our detection functions with random-generated nucleotide sequences, then giving them a test run with some nucleotide from _Acanthopathes thyoides_, a black coral that I've been studying. Eventually, our data-cleaning processes will become a part of a larger pipeline we'll implement to take our data from its raw form to graphics and analysis. 

### to recap, some common errors in raw data: 

* Missing nucleotide bases

* Bizarre IUPAC codes

* Stop codon in middle of sequence

* Sequence length not a multiple of 3

* Low complexity region
 

### Missing nucleotide bases
Blank spaces aren't going to tell us anything useful about our genome -- except, maybe, that our sequencer wasn't able to read some portion of our specimen. We'll be writing a function that recognizes these blank spaces and alerts us to their presence and location. 

In [None]:
my_seq = Seq('ATGCTA-GTN--TAG') #feel free to change this sequence as needed!


def missing_seq(input_seq):
    # your code below!
    
    
    

In [20]:
%%html
<style>
table {float:left}
</style>

### Bizarre IUPAC codes

| IUPAC nucleotide code | Base |
| :--- | :--- | :--- |
| A | Adenosine |
| T | Thymine |
| C | Cytosine |
| G | Guanine |
| R	| A or G |
| Y	| C or T |
| S	| G or C |
| W	| A or T |
| K	| G or T |
| M	| A or C |
| B	| C or G or T |
| D	| A or G or T |
| H	| A or C or T |
| V	| A or C or G |
| N	| any base |
| . or - | gap |


### Stop codon in middle of sequence

Uh oh -- what does an early stop mean? There are a number of reasons for such a detection: 

* We're not starting our translation at the right base pair
* We've included extraneous bases in our sequence, and the gene does, in fact, terminate before the end of our sequence

Give yourself a random string to experiment on -- say, ATCTAAGTACTAGCT -- and see if you can code up a function that will return an error message when it encounters the strings 'TAG' or 'TAA.'

In [29]:
my_seq = Seq('ATCTAAGTACTAGCT')

def check_stop(input_seq):
    ## your code here!


check_stop(my_seq)

IndentationError: expected an indented block (<ipython-input-29-4f01a3674d24>, line 7)

### Sequence length not a multiple of 3

Remember that error Biopython threw when you attempted to translate a nucleotide sequence whose length wasn't a multiple of 3? It did return a limited number of amino acid bases, but then stopped when it fell off the sequence. Below, write a function that 1) alerts us when our sequence length is not a multiple of three, and 2) pads the ends of such sequences with hyphens ('-') such that the length of the string + hyphens is now a multiple of three. e.g.: 'ATGA' has length four; your function should throw an error and return the string 'ATGA--,' which has length 6. 

In [26]:
my_seq = Seq('ATAGTCTAGCTAG')

def check_seqlen(input_seq):
    # your code here!
    
check_seqlen(my_seq)

IndentationError: expected an indented block (<ipython-input-26-df5b274e4512>, line 8)

### Adjusting reading frames

As an aside, another way to solve the problem of non-multiples of three is to change your reading frame -- there are technically three ways to read any nucleotide sequence, if you think about it. You could start at the first, second, or third base pair, with each giving a different translation to protein bases. 

For this exercise, we won't actually translate our nucleotides. Write a function that tests all three reading frames of any nucleotide sequence you give it (your function can accept a string. Or, if you're feeling adventurous, you can read from a file, first) and returns the frame ('first', 'second', or 'third') that gives the cleanest read -- i.e., no stop codons in the middle of the sequence. 


In [None]:
my_seq = 'design your own test cases!'

def check_frame(input_seq):
    # your code here!

    
check_frame(my_seq)

### Low complexity regions

Regions with low complexity -- high incidences of just one or two nucleotide bases -- are extremely suspect. Usually, they're 'junk' coding regions that don't actually translate to meaningful amino acid bases. As such, bioinformaticians like to identify and flag these regions. We'll be writing a prototype function that scans a sequence and alerts us to regions of the genome with unusually high percentages of certain base pairs. 

The algorithms generally used for this process are very complex -- here's how we'll design our function to mimic these algos: 

* determine a window size appropriate to the length of the sequence we're analyzing
* scan our sequence using this window
* as we're scanning, update the percentage breakdown of each type of nucleotide base
* flag the start and stop locations of regions with unacceptably high percentages of certain bases

With Gabrielle, you wrote a function that returns the percentage composition of a nucleotide sequence. How could you repurpose that here? 

In [None]:
my_seq = 'ATGATCGATCGATGCTAGTAGATAAAAATAAGACTAAC'

def check_complex(input_seq):
    # your code here!
    
    
check_complex(my_seq)