# CMM262 Programming Practice Problems in Python, Part 3

**Author:** Michelle Franc Ragsac (mragsac@eng.ucsd.edu)

This notebook contains more exercises that incorporate basic concepts of programming, specifically for the Python programming language, but with a bioinformatics twist! 

These exercises are quite involved and can be difficult as they require you to read in files from our `ref/` folder, while also manipulating them into data structures in Python.  

> * Many of these exercises (and more!) can be found through the Rosalind "Bioinformatics Stronghold" resource: http://rosalind.info/problems/list-view/?location=bioinformatics-stronghold

---

### [Calculating Protein Mass](http://rosalind.info/problems/prtm/)

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p><b><i>Given</i></b>: A protein string, <code>p</code>, of at most 1000 aa
    <br><b><i>Return</i></b>: The total weight of <code>p</code></p>
    <p><i>Hint:</i> We have a monoisotopic mass table that you can reference: <code>ref/monoisotopic_masses.txt</code>!</p>
</div>

#### Sample Dataset

```
SKADYEK
```

#### Sample Output

```
821.392
```

In [1]:
# 1. Initialize the protein string that we want to evaluate
p = "SKADYEK"

################################################################################
# 2. Import the monoisotopic mass table and convert it to a dictionary
mass_table_filename = "ref/monoisotopic_masses.txt"

masses = {}
with open(mass_table_filename, "r") as file:
    for line in file:
        amino_acid,mass = line.strip().split()
        masses[amino_acid.upper()] = float(mass)
        
################################################################################
# 3. Calculate the mass of the protein string
total_mass = sum([masses[aa] for aa in p])
total_mass = round(total_mass, 3) # round the answer to three decimal places

# Share the output
print(f"The total weight of the protein string [ {p} ] is : {total_mass}")

The total weight of the protein string [ SKADYEK ] is : 821.392


---

### [Translating RNA into Protein](http://rosalind.info/problems/prot/)

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p><b><i>Given</i></b>: A RNA string, <code>s</code>, corresponding to a strand of mRNA (of length at most 10 kbp)
    <br><b><i>Return</i></b>: The protein string encoded by <code>s</code></p>
    <p><i>Hint:</i> We have a RNA codon table that you can reference: <code>ref/rna_codon_table.txt</code>!</p>
</div>

#### Sample Dataset

```
AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
```

#### Sample Output

```
MAMAPRTEINSTRING
```

In [2]:
# 1. Initialize the RNA string that we want to convert to a protein string
s = "AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA"

################################################################################
# 2. Import the RNA codon table and convert it to a dictionary 
rna_codon_table_filename = "ref/rna_codon_table.txt"

codon_table = {}
with open(rna_codon_table_filename, "r") as file:
    for line in file:
        line = line.strip().split()
        for codon,aa in zip([line[e] for e in range(len(line)) if e % 2 == 0],
                            [line[e] for e in range(len(line)) if e % 2 != 0]):
            codon_table[codon.upper()] = aa.upper() # set uniform case
            
################################################################################
# 3. Determine the codons we need to evaluate in the RNA string and then
#    figure out what the results are! 
rna_codons = [s[i:i+3].upper() for i in range(0, len(s), 3)]

# We also consider if a STOP codon appears earlier than expected and only gather 
# the portion of the string that we're interested in
protein_string = "".join([codon_table[codon] for codon in rna_codons])
protein_string = protein_string.split("STOP")[0]

# Share the output
print("The resulting output for this exercise:\n")
print(f"RNA String:\t{s}\nProtein String:\t{protein_string}")

The resulting output for this exercise:

RNA String:	AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA
Protein String:	MAMAPRTEINSTRING


---

### [Consensus and Profile](http://rosalind.info/problems/cons/)

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p><b><i>Given</i></b>: A collection of DNA strings of at most 1 kbp in FASTA format of equal length
    <br><b><i>Return</i></b>: The consensus string and profile matrix for the collection</p>
</div>

#### Sample Dataset

```
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
```

#### Sample Output

```
ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6
```

In [3]:
# 1. Gather the location of the FASTA file we want to evaluate
input_file = 'ref/rosalind_cons.txt'

################################################################################
# 2. Gather the FASTA sequences within the file by iterating through each
#    line in the file and compiling all of the sequences under each header line
fasta = {} 

current_header = ""
sequence       = ""
with open(input_file,'r') as file: 
    for line in file:
        line = line.strip() # removes extra characters from a line (e.g., spaces)
        
        if ">" in line: # determine if we're evaluating a header sequence 
            if len(current_header) == 0: pass 
            else:
                # Store the previous set of sequences we've evaluated into the 
                # previous header line that we evaluated (This doesn't apply to the 
                # first header since there was nothing that came before it!)
                fasta[current_header] = sequence.upper() # uniform case
                
            # Reset the state for tracking what sequence and header we're evaluating
            current_header = line 
            sequence       = ""
            fasta[line]    = ""
        else: 
            # Compile all of the lines that come after the header sequence we're 
            # currently evaluating (stored under current_header)
            sequence += line

    # Add the last entry for the FASTA dictionary before we close the file
    fasta[current_header] = sequence.upper()

################################################################################
# 4. Tally the type of nucleotides we have at each sequence position 
profile = { 'A':[], 'C':[], 'G':[], 'T':[] }

for sequence in fasta:
    sequence_to_evaluate = fasta[sequence]
    if len(profile['A']) == 0: # initialize the profile matrix
        profile['A'] = [0] * len(sequence_to_evaluate)
        profile['C'] = [0] * len(sequence_to_evaluate)
        profile['G'] = [0] * len(sequence_to_evaluate)
        profile['T'] = [0] * len(sequence_to_evaluate)
        
    # Go through each sequence and each position in each sequence and tally
    # the number of times we see a particular nucleotide at a given position
    for i in range(len(sequence_to_evaluate)):
        profile[sequence_to_evaluate[i]][i] += 1

################################################################################
# 5. Determine the consensus sequence
consensus_sequence = ""

characters = "ACGT"
for i in range(len(sequence_to_evaluate)):
    # Determine the number of A's, C's, G's, and T's we see at the current 
    # position we're evaluating, then determine which base we see the most 
    values = [profile['A'][i], profile['C'][i], profile['G'][i], profile['T'][i]]
    consensus_sequence += characters[values.index(max(values))]

################################################################################
# 6. Show the result in the format specified by Rosalind
print(consensus_sequence)
for base in profile: 
    print(base,":"," ".join([str(i) for i in profile[base]]))

ATGCAACT
A : 5 1 0 0 5 5 0 0
C : 0 0 1 4 2 0 6 1
G : 1 1 6 3 0 1 0 0
T : 1 5 0 0 0 1 1 6


---

### [Open Reading Frames](http://rosalind.info/problems/orf/)

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p><b><i>Given</i></b>: A DNA string, <code>s</code>, of at most 1 kbp in FASTA format
    <br><b><i>Return</i></b>: Every distinct candidate protein string that can be translated from ORFs of <code>s</code>. Strings can be returned in any order.</p>
</div>

#### Sample Dataset

```
>Rosalind_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
```

#### Sample Output

```
MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE
```

In [4]:
# 1. Gather the location of the file we want to evaluate then gather the sequence
input_dna_filename = "ref/rosalind_orf.txt"

dna_sequence = ""
with open(input_dna_filename, "r") as file:
    for line in file:
        # Skip over the header line within our FASTA file 
        if ">" in line: 
            continue
        dna_sequence += line.strip().upper() # gather all lines in the FASTA 

################################################################################
# 2. Determine the reverse complement of the DNA sequence 
base_complements = {"A":"T", "T":"A", "C":"G", "G":"C"}
revcomp_dna_sequence = ''.join([base_complements[base] for base in dna_sequence][::-1])

################################################################################
# 3. Import the RNA codon table and convert it to a dictionary 
rna_codon_table_filename = "ref/rna_codon_table.txt"

codon_table = {}
with open(rna_codon_table_filename, "r") as file:
    for line in file:
        line = line.strip().split()
        for codon,aa in zip([line[e] for e in range(len(line)) if e % 2 == 0],
                            [line[e] for e in range(len(line)) if e % 2 != 0]):
            codon_table[codon.upper()] = aa.upper() # set uniform case
            
################################################################################
# 4. Determine all sets of codons that we can see from each reading frame
#    for the sequence in the forward and reverse orientation

protein_sequences = set()
for reading_frame in [0,1,2]: 
    # Evaluate sequences in the forward orientation
    start_codon_found = False
    current_sequence = ""
    for codon in [dna_sequence[i:i+3].upper() 
                  for i in range(reading_frame, len(dna_sequence), 3)]:
        if len(codon) != 3:
            break
        codon = codon.replace("T","U")
        aa_translation = codon_table[codon]
        if aa_translation == "M":
            start_codon_found = True
            current_sequence += aa_translation
        elif aa_translation == "STOP":
            protein_sequences.add(current_sequence)
            start_codon_found = False
            current_sequence = ""
        elif start_codon_found == True:
            current_sequence += aa_translation
    
    # Evaluate sequences in the reverse orientation
    start_codon_found = False
    current_sequence = ""        
    for codon in [revcomp_dna_sequence[i:i+3].upper() 
                  for i in range(reading_frame, len(revcomp_dna_sequence), 3)]:
        if len(codon) != 3:
            break
        codon = codon.replace("T","U")
        aa_translation = codon_table[codon]
        if aa_translation == "M":
            start_codon_found = True
            current_sequence += aa_translation
        elif aa_translation == "STOP":
            protein_sequences.add(current_sequence)
            start_codon_found = False
            current_sequence = ""
        elif start_codon_found == True:
            current_sequence += aa_translation
            
################################################################################
# 5. Double check that there aren't subsequences present within our answers
final_protein_sequences = []
for sequence in protein_sequences:
    if len(sequence) == 0:
        continue
    # Determine where we see alternate positions to start the protein string
    positions_of_m = [i for i, ltr in enumerate(sequence) if ltr == "M"] 
    if len(positions_of_m) == 1:
        final_protein_sequences.append(sequence)
    else:
        for position in positions_of_m:
            final_protein_sequences.append(sequence[position:])

################################################################################
# 6. Show the result in the format specified by Rosalind
for sequence in final_protein_sequences:
    print(sequence)

MLLGSFRLIPKETLIQVAGSSPCNLS
MGMTPRLGLESLLE
MTPRLGLESLLE
M
