# Handout 5  

Bruce Schultz

### Exercise 1

##### Part a  

**Write a function single_fasta_sequence(filename) that reads a sequence given in
FASTA format from a file containing a single sequence.**

In [6]:
def single_fasta_sequence(file):
    """
    :param fasta_file: FASTA formatted file contents
    :return: Tuple containing the header in position 0 and sequence in position 1
    """
    content = file.read().splitlines()
    header = content[0][1:]  # Remove the '>' character
    sequence = ''.join(content[1:])  # Join sequence lines together
    return (header, sequence)

In [9]:
with open('ecoli-genome.fna', 'r') as f:
    hd, seq = single_fasta_sequence(f)
print(seq[:10])

AGCTTTTCAT


##### Part b  

**Write a function fasta_list(filename) that reads all sequences from a FASTA file and returns a list of tuples, each tuple containing the header as the first and the sequence as the second element.**

In [10]:
def fasta_list(file):
    """
    Same as single_fasta_sequence, only works on a file containing multiple FASTA sequences
    :param fasta_file: FASTA formatted file content with 1+ sequences
    :return: List of tuples, each containing the header in position 0 and sequence in position 1
    """
    content = file.read().split('>')[1:]
    genes = []
    for gene in content:
        gene = gene.splitlines()  # Split the lines
        header = gene[0]
        sequence = ''.join(gene[1:]).replace('\n', '')
        genes.append((header, sequence))
    return genes

In [12]:
with open('ecoli-genes.ffn', 'r') as f:
    genes = fasta_list(f)
print(genes[0][:50])

('gi|556503834|ref|NC_000913.3|:190-255 Escherichia coli str. K-12 substr. MG1655, complete genome', 'ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA')


##### Part c  

**Write a function fasta_sequences(filename) that can be used in the manner described above using a self written generator using the yield keyword.**

In [19]:
def fasta_sequences(fasta_file):
    with open(fasta_file, 'r') as file:
        for gene in fasta_list(file):
            yield gene

In [22]:
name,seq = max(fasta_sequences("ecoli-genes.ffn"),key= lambda x: len(x[1]))
max_length = len(seq)
print("The longest gene is",name,"and contains",max_length,"nucleobases.")

The longest gene is gi|556503834|ref|NC_000913.3|:2044938-2052014 Escherichia coli str. K-12 substr. MG1655, complete genome and contains 7077 nucleobases.


##### Part d  

**Test the functions written in (b) and (c) on ecoli-proteome.faa . Read the Fasta file
and print the header and the length of the shortest and longest amino acid sequence.**

In [28]:
# Lists
with open('ecoli-proteome.faa', 'r') as f:
    genes = fasta_list(f)
name,seq = max(genes,key= lambda x: len(x[1]))
max_length = len(seq)
print("Lists:\nThe longest gene is",name,"and contains",max_length,"nucleobases.\n")

# Generators
name,seq = max(fasta_sequences("ecoli-proteome.faa"),key= lambda x: len(x[1]))
max_length = len(seq)
print("Generator:\nThe longest gene is",name,"and contains",max_length,"nucleobases.")

Lists:
The longest gene is gi|145698281|ref|NP_416485.4| putative adhesin [Escherichia coli str. K-12 substr. MG1655] and contains 2358 nucleobases.

Generator:
The longest gene is gi|145698281|ref|NP_416485.4| putative adhesin [Escherichia coli str. K-12 substr. MG1655] and contains 2358 nucleobases.


##### Part e  

**Now write a function write_fasta(outfile,header,sequence) that writes the sequence
with header to the opened file outfile in FASTA format.**

In [29]:
def write_fasta(outfile, header, sequence):
    outfile.write(">"+header+'\n')
    i = 0
    while i < len(sequence):
        outfile.write(sequence[i:i+69]+'\n')  # Sequence lines cannot be > 70 chars
        i += 69

##### Part f  

**Put the definitions of the previous functions in a single file named fastatools.py .**

### Exercise 2

**Write a script cdna.py that takes two command line parameters. The first is an input file
containing a DNA sequence in FASTA format. Read the sequence and generate the cDNA
(complementary DNA strand) and write it to the file given as second parameter.**

In [33]:
run cdna.py ecoli-genome.fna ecoli-genome-comp.fna

### Exercise 3

##### Part a  

**Write a script orf_finder.py that takes two command line parameters. The first is an
input file containing a DNA sequence. The second is an output file that should contain
all the (longest) open reading frames found in DNA (from both strands).**

In [34]:
run orf_finder.py ecoli-genome-sample.fna ecoli-genome-orfs.ffn