## Learning objectives


1. Understanding assert statements

2. Parsing FASTA files

3. Writing a Python class

4. Creating an iterator

5. Writing a k-mer counter

---




## Defensive programming


One of the first things that you need to be aware of when writing scripts that you may use more than once or get used by other people is that people goof up, and you should make sure that your program can handle that in some fashion. One of the easiest ways to do this is to do checks along to ensure that everything is what our program expects before code gets executed. To do this, we'll use the `assert` statement.


`assert` checks if condition is true or not. If it is false, the program is exited with an AssertionError. You can also pass a string to print along with the error.




In [0]:
def test(a):
    assert a > 5, 'Value is too small'
    print('Value is big enough')

test(10)
test(2)

This is particularly useful for reading files to make sure that the file is formatted as expected.


## Parsing a FASTA sequence


First, let's make sure that everyone knows what a FASTA file is and looks like. A FASTA file is a text file containing one or more sequences of nucleotides or amino acids. Each sequence has an associated name.


<br><div style="background: #EEE"> \>sequence 1<br> AGATCTCCCTGAGAGAAGAGCTCTCTCTCGA<br> TCTCGGATTACGTAGGCTAGAGAGAGAGCTA<br> TTCAA<br> \>sequence 2<br> GATCTCGGGATAAAAAAACTGGGATCTGATC<br> ATCTAAAGAGAG </div><br>


So, each sequence starts with a '>' followed by the sequence name. Then, all subsequent lines until the next sequence or the end of the file contain the sequence, broken up with a uniform number of characters per line.


Let's write a function to read in a single FASTA file.




In [0]:
# Let's have it accept an open file object
# That way it can be passed a file or standard input

def single_FASTAReader(file):
	# Get the first line, which should contain the sequence name
	line = file.readline()

	# Let's make sure the file looks like a FASTA file
	assert line.startswith('>'), "Not a FASTA file"
	
	# Get the sequence name
	seq_id = line[1:].rstrip('\r\n')

	# create a list to contain the 
	sequence = []

	# Get the next line
	line = file.readline()

	# Keep reading lines until we run out
	while line:
		# Check if we've reached a new sequence (in a multi-sequence file)
		if line.startswith('>'):
			break

		# Add next chunk of sequence
		sequence.append(line.strip())
		
		# Get the next line
		line = file.readline()
	return (seq_id, ''.join(sequence))

Now we need to test whether our function works. We'll use the file 'subset.fa' to test this. You can see how this file was generate in the script prep.sh. This file contains a subset of Drosophila melanogaster rna transcripts from a male embryo. So, does the function work?




In [0]:
name, seq = single_FASTAReader(open('subset.fa'))
print(name, seq)

Finally, let's put it into a script and see if we can run it on standard input data.


## Parsing a FASTA file


The next step is expand the code to handle multiple sequences. This should be a simple matter of putting some new code in the if statement.




In [0]:
def FASTAReader(file):
    # Get the first line, which should contain the sequence name
    line = file.readline()

    # Let's make sure the file looks like a FASTA file
    assert line.startswith('>'), "Not a FASTA file"
    
    # Get the sequence name
    seq_id = line[1:].rstrip('\r\n')

    # create a list to contain the 
    sequence = []

    # Get the next line
    line = file.readline()

    # Add a list to hold all of the sequences in
    sequences = []

    # Keep reading lines until we run out
    while line:
        # Check if we've reached a new sequence (in a multi-sequence file)
        if line.startswith('>'):
            # Add previous sequence to list
            sequences.append((seq_id, ''.join(sequence)))
            
            # Record new sequence name and reset sequence
            seq_id = line[1:].rstrip('\r\n')
            sequence = []
        else:
            # Add next chunk of sequence
            sequence.append(line.strip())
        
        # Get the next line
        line = file.readline()
    # Add the last sequence to sequences
    sequences.append((seq_id, ''.join(sequence)))

    return sequences

And let's see whether it works.




In [0]:
seqs = FASTAReader(open('subset.fa'))
print(len(seqs))
print(seqs[0])
print(seqs[1])

## Python Classes


One concept we haven't talked about yet is  the Python Class. A class is a python object that contains its own variables and methods. That should sound familiar as we've seen classes already in the form of all of the python data types.


Let's create our own class. Every class requires two things. A class declaration and an initialization method.




In [0]:
class OurClass(object):
    def __init__(self):
        print('created')

instance = OurClass()        

Now we've defined our class and created an instance of it. Classes had lots of special methods that can be defined. Special methods start and end with the double-underscore. For example '__init__'. We can define any number of methods in a class. Every method takes 'self' as its first argument. This gives us access to all of the classes internal variables and methods.




In [0]:
class Rect(object):
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def area(self):
        return self.width * self.height

R = Rect(5, 10)
print(R.area())

We've talked about iterators before. Let's see how to create one. There arae two special methods we need to define. The first is `__iter__` and acts like the `__init__` method when the class is called as an iterator. The second is `__next__`, which returns information each time the iterator is tapped.




In [0]:
class Iterator(object):
    def __init__(self, start, stop):
        self.start = start
        self.stop = stop
        self.current = start - 1

    def __iter__(self):
        return self

    def __next__(self):
        self.current += 1
        if self.current >= self.stop:
            raise StopIteration
        return self.current

I = Iterator(0, 10)
for i in I:
    print(i)

## FASTA iterator


An iterator seems ideally suited to our FASTA parser, since we want to access one sequence at a time from the file. Currently, we have the whole FASTA file being read in at once. In the case of mammalian genomes, that can be a large amount of data. An iterator will only look a chunk at a time. With large datasets, this may be important.




In [0]:
class FASTAReader(object):

    def __init__(self, file):
        self.last_id = None
        self.file = file
        self.eof = False

    def __iter__(self):
        return self

    def __next__(self):
        if self.eof:
            raise StopIteration
        # check if this is the first sequence from the file
        if self.last_id is None:
            # First line
            line = self.file.readline()
            # Verify that this is a FASTA file
            assert line.startswith(">"), "Not a FASTA file"
            # Get the sequence ID
            seq_id = line[1:].rstrip("\r\n")
        else:
            # Get ID from previous round
            seq_id = self.last_id

        sequence = []
        while True:
            line = self.file.readline()
            # Check if we've reached the end of the file
            if line == "":
                self.eof = True
                break
            # Check if we've reached the next sequence
            elif not line.startswith(">"):
                sequence.append(line.strip())
            # We've reached the next sequence ID
            else:
                self.last_id = line[1:].rstrip("\r\n")
                break
        
        sequence = "".join(sequence)
        return seq_id, sequence

reader = FASTAReader(open('subset.fa'))
for seq_id, seq in reader:
    print(seq_id, seq)

## K-mer counting


Now that we can read in FASTA files, let's do something with them. K-mers, or arbitrary length sequences, are used in a wide variety of bioinformatic algorithms and analysis methods, such as sequence alignment and metagenomic species parsing. So let's start with the simple task of loading the sequences in our FASTA file and counting all of the K-mers that occur in it for some value of 'k'.




In [0]:
reader = FASTAReader(open('subset.fa'))
kmers = {}

k = 11

for seq_id, sequence in reader:
    for i in range(0, len(sequence) - k):
        kmer = sequence[i:i + k]
        kmers.setdefault(kmer, 0)
        kmers[kmer] += 1

for key in kmers:
    print(key, kmers[key])

## Importing functions and classes from other scripts


Much like we can load python modules using the `import` statement, we can import classes and functions from any code that is in the same folder as the code we're executing. This means that  you can easily reuse code that you've previously written, such as the FASTA reader.


