# CMM262 Programming Practice Problems in Python, Part 2

**Author:** Michelle Franc Ragsac (mragsac@eng.ucsd.edu)

This notebook contains some additional introductory exercises to basic concepts of programming, specifically for the Python programming language, but with a bioinformatics twist!

> * Adapted from BISB Bootcamp 2020 Day 4, Module 6: Pair Programming materials: https://github.com/mragsac/BISB-Bootcamp-2020/tree/master/day4/module6_pair-programming
> * Many of these exercises (and more!) can be found through the Rosalind "Bioinformatics Stronghold" resource: http://rosalind.info/problems/list-view/?location=bioinformatics-stronghold

---

### [Counting DNA Nucleotides](http://rosalind.info/problems/dna/)

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p><b><i>Given</i></b>: A DNA sequence string, <code>s</code>, with a maximum length of 1,000 nucleotides 
    <br><b><i>Return</i></b>: Four integers (separated by spaces) counting the respective number of times that the bases <code>A</code>, <code>T</code>, <code>C</code>, and <code>G</code> occur in <code>s</code></p>
</div>

In [None]:
# 1. Start by initializing the DNA string that we want to evaluate
s = "GGGCCGTTGGT"

# 2. Generate a dictionary to hold the tallies of each nucleotide
d = {"A":0, "T":0, "C":0, "G":0}

# 3. Go through the string and tally the number of times a nucleotide appears
for nt in s: 
    d[nt.upper()] += 1
    
# 4. Return the four integers with the tally for each base
print(f'{d["A"]} {d["T"]} {d["C"]} {d["G"]}')

---

### [Compute the Hamming Distance Between Two Strings](http://rosalind.info/problems/ba1g/)

We say that position $i$ in k-mers $p_1 … p_k$ and $q_1 … q_k$ is a mismatch if $p_i ≠ q_i$. For example, `CGAAT` and `CGGAC` have two mismatches. 

The number of mismatches between strings $p$ and $q$ is called the **Hamming Distance** between these strings and is denoted `HammingDistance(p, q)`.

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p><b><i>Given</i></b>: Two DNA Strings, <code>p</code> and <code>q</code>
    <br><b><i>Return</i></b>: An integer value representing the Hamming Distance of the two DNA strings</p>
</div>

In [None]:
# 1. Start by initializing the two DNA strings we want to evaluate
p = "GGGCCGTTGGT"
q = "GGACCGTTGAC"

# 2. Initialize a variable to hold the number of mismatches
num_mismatches = 0

# 3. Go through each pair of nucleotide to determine if they are the same 
for p_i, q_i in zip(p,q):
    if p_i != q_i: 
        num_mismatches += 1
        
# 4. Return the result
print(f"The Hamming Distance between the two DNA strings is num_mismatches={num_mismatches}")
print(f"\np = {p}\nq = {q}")

---

### [Complementing a Strand of DNA](http://rosalind.info/problems/revc/)

In DNA strings, symbols `A` and `T` are complements of each other, as are `C` and `G`. The reverse complement of a DNA string $s$ is the string $s_c$ formed by reversing the symbols of `s`, then taking the complement of each symbol (e.g., the reverse complement of `GTCA` is `TGAC`).

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p><b><i>Given</i></b>: A DNA string, <code>s</code>, with a maximum length of 1,000 nucleotides
    <br><b><i>Return</i></b>: The reverse complement of <code>s</code></p>
</div>

In [None]:
# 1. Start be initializing the DNA string that we want to determine the reverse complement of 
s = "GGGCCGTTGGT"

# 2. Create a dictionary to hold the nucleotide pairings
d = {"A":"T", "T":"A", "C":"G", "G":"C"}

# 3. Generate a string with the complement strand to our DNA string
s_c = [d[nt] for nt in s]

# 4. Reverse the string to produce the reverse complement strand and join the characters together
s_c = ''.join(s_c[::-1])

# 5. Return the result
print(f"For the DNA string\t\ts   =\t{s}")
print(f"The reverse complement is\ts_c =\t{s_c}")

---

### [Find the Most Frequent "Words" in a String](http://rosalind.info/problems/revc/)

**k-mers** are small subsequences of a string that are of length `k`. We can define the term *Pattern* as our most frequent k-mer in a string if it maximizes `Count(Text,Pattern)` among all k-mers. For example, `ACTAT` is a most frequent 5-mer in `ACAACTATGCATCACTATCGGGAACTATCCT`, and `ATA` is a most frequent 3-mer of `CGATATATCCATAG`.

<div class="alert alert-block alert-success">
    <p><b>Exercise:</b></p>
    <p><b><i>Given</i></b>: A DNA string, <code>Text</code>, and an integer, <code>k</code>
    <br><b><i>Return</i></b>: All most frequent k-mers in <code>Text</code> in any order
</div>

In [None]:
# 1. Start by defining the DNA sequence we would like to evaluate
Text = "ACAACTATGCATCACTATCGGGAACTATCCT"

# 2. Define the k-mer length we want to evaluate
k = 5

# 3. Generate all k-mers within our DNA string
all_kmers = [Text[i:i+k] for i in range(len(Text)-k+1)]

# 4. Initialize a dictionary to tally the number of times we see each k-mer in our text
kmer_tallies = {}

# 5. Go through each k-mer within our text and determine the number of times it appears
for kmer in all_kmers:
    if kmer not in kmer_tallies:
        kmer_tallies[kmer] = 0
    kmer_tallies[kmer] += 1
    
# 6. Return the results
kmer_tallies