<a href="https://colab.research.google.com/github/daisysong76/AI--Machine--learning/blob/main/Lab02_c146_v03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2: STR Algorithms
v.1
### Data Science for Biology
Developed by Sarp Dora Kurtoglu. <br> Adapted and inspired from Harvard CS50x from 2021 with permission from David Malan. <br>
At https://cs50.harvard.edu/x/2021/psets/6/dna/. <br>
Please see the video at that link for another introduction to this topic.

In [None]:
from datascience import Table
a, b, c = ... # answer key

## Background

#### Short Tandem Repeats (STRs)
Short tandem repeats (STRs) are short sequences of DNA that repeat *consecutively* at specific locations on the human genome. Within the human population, the numbers of times specific STRs repeat vary greatly, which leads to ways of differentiating people given their STR counts. Different people have each type of STR repeated different numbers of times within their genomes.

For example, in the sequence below, the STR with the sequence "agat" repeats 6 times:

dna = aatgtc<mark>agatagatagatagatagatagat</mark>ccgttga

#### STR Analysis in Forensics
Everyone has a fixed number of STRs at different locations on their chromosomes, with varying numbers for each individual within the population. Law enforcement, including FBI, use STR analysis to perform computational forensics analyses and fingerprinting. The FBI uses 20 different CODIS (Combined DNA Index System) core loci to perform forensics analyses. In this lab, we will take a simplified approach and explore only a few STR sequences in an unknown individual's DNA sequence.


If you would like to learn more about how STRs are used in forensic analyses by FBI, you can find more details here: https://www.fbi.gov/services/laboratory/biometric-analysis/codis/codis-and-ndis-fact-sheet

## Python Review

### While Loops
#####**Syntax:** while [condition]:
            
While loops will continue to run the lines of code under it as long as the condition is true. Once the condition turns false, the code under the while loop will not run anymore.

To count how many T nucleotides are present in the first 30 nucleotides-long section of a DNA sequence, we can use the while loop. See the example below:

In [None]:
dna = "gtagctagctacattatagctgagcggccgtcgattgctagtgctagcatgcgttagctagcatc"

index = 0
t_count = 0
dna_len = 30
while index < dna_len:
    if dna[index] == "t":
        t_count += 1 # everytime we see a t, we increase our count by 1
    index += 1 # we have to increase our index by 1 everytime, or the while loop will run forever

print("Number of t nucleotides is", t_count)


Number of t nucleotides is 7


### For Loops
##### **Syntax:** for [index] in [iterable]:

For loops will iterate through each element of an iterable object (could also be a string) in order. For example, the for loop can be used to iterate through each nucleotide in a DNA sequence and count how many t nucleotides are present in the whole sequence. See the example below:

In [None]:
dna = "gtagctagctacattatagctgagcggccgtcgattgctagtgctagcatgcgttagctagcatc"

t_count = 0
for nuc in dna: # everytime the for loop runs, nuc takes on the next character in the dna string
    if nuc == "t":
        t_count += 1
print(t_count)


18


To count the number of t nucleotides present only in the first 30-nucleotides long section of the DNA, we will need to run the for loop for a certain number of times. See the example below:

In [None]:
dna = "gtagctagctacattatagctgagcggccgtcgattgctagtgctagcatgcgttagctagcatc"

t_count = 0
for index in range(30): #range(x) creates a list of integers from 0 to x-1
    if dna[index] == "t":
        t_count += 1
print(t_count)

7


Or, you can also accomplish this by using <mark> break</mark>. Once the program encounters break in a loop, it leaves that loop immediately:

In [None]:
dna = "gtagctagctacattatagctgagcggccgtcgattgctagtgctagcatgcgttagctagcatc"

t_count = 0
index = 0
for nuc in dna:
    if index == 30:
        break # break exits the loop it is called in
    if nuc == "t":
        t_count += 1
    index += 1 # we need to increment it to keep track of how many times the for loop ran

print(t_count)

7


Lastly, the most efficient and concise way to tackle the same problem could be as following:

In [None]:
t_count = 0
for nuc in dna[0:30]: # everytime the for loop runs, nuc takes on the next character in the dna string
    if nuc == "t":
        t_count += 1

print(t_count)

7


### Nested Loops
We can also combine multiple loops together, in what is called nested loops, to parse through non-linear structures. In nested loops, one loop is inside another one, so at each iteration of the outer loop, the inner loop runs to completion. For example, to iterate through each cell of a 4x4 matrix, we would need to use a nested loop, as shown below:

In [None]:

matrix = [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]
matrix # This is a 4x4 matrix.
# You can imagine each inner list as a column. The first row would be the first elements of each inner list: [0, 4, 8, 12]
# [0, 4, 8, 12 ]
# [1, 5, 9, 13 ]
# [2, 6, 10, 14]
# [3, 7, 11, 15]

for i in range(len(matrix)): #iterates through the # of column
    for j in range(len(matrix[i])): #iterates through the # of row
        print(matrix[i][j]) #prints each column completely in order
    print("\n")


0
1
2
3


4
5
6
7


8
9
10
11


12
13
14
15




We could also do set up our nested loops to accomplish the same thing:

In [None]:
# [0, 4, 8, 12 ]
# [1, 5, 9, 13 ]
# [2, 6, 10, 14]
# [3, 7, 11, 15]

for i in range(len(matrix)): #iterates through the # of column
    for j in matrix[i]: #iterates trough the column, j stands for each element of a particular column (matrix[i])
        print(j) #prints each column completely in order
    print("\n")

0
1
2
3


4
5
6
7


8
9
10
11


12
13
14
15




### Exercise
In the above examples, we were able to print each column separately, but we lost the columns. Now, print out the matrix from the above examples in the shape of a matrix (4x4), retaining the columns and rows:

In [None]:
# Exercise

# Your output should exactly look like this matrix:
# [0, 4, 8, 12 ]
# [1, 5, 9, 13 ]
# [2, 6, 10, 14]
# [3, 7, 11, 15]


### Example with Nested Loops
Let's do a harder example with nested loops for better understanding. Let's say we want to calculate the longest length of a DNA sequence where there is no Thymine nucleotide present.

One way to do this is to start at position 0 in the string, and successivly look at each nucleotide of the DNA sequence <mark> seq</mark> one at a time, and count the number of nucleotides until we encounter the first “t”. Then we could do the same thing starting at position 1. Then at position 2. And… you get the idea.

So, we calculate the longest sections of DNA, starting at each position, without Thymine. Then
we can take the highest value, which would be equivalent to the length of the longuest section of
the DNA sequence without a Thymine. We can do this as shown below:

In [None]:
seq = "atatatagagcacgagcagcatcgacatcagcagcgacgacagcacgacagaggagcagcacccggcgcgcgagcagcattagcagcatc"

max = 0 # our maximum length without a Thymine, we start at 0 and will update it throughout
for index in range(len(seq)): # index iterates through the sequence, takes on the position of each nucleotide
    cur_max = 0 # this is to count the longuest run of the sequence without a Thymine starting from the index positioned nucleotide
    for nuc in seq[index:]: # seq[index:] gets rid of the nucleotides of the sequence with positions smaller than index (for simplicity)
        if nuc == "t":
            if (cur_max > max): # once we see a Thymine, compare the overall maximum count with our current count and update max if needed
                max = cur_max
            break # leaves the inner loop
        else:
            cur_max += 1 # increases the count as long as we don't see a Thymine
max

51

## Trying to Identify a Match
You are a forensics analyst at the FBI. One day, your coworkers bring you a DNA sample from a nearby crime scene. Your task is to try to find if this DNA sample has any matches in your DNA databases!

We have acquired the following DNA sequence <mark> sample_dna</mark> from the crime scene:

In [None]:
sample_dna = 'GGGGGACGGTTTGTCTCACGCCTGTTGGTACCCTGAGTCCCCCACAATACCACAACCGTCGATCTTGAACGGTACCCTCATAGCGATAGAGACCCTTGCGGCTGAATTGCAATTACTCCAGTCACTTCCTGCCGGTGACCTGTCATGTATGAACAGGAGTCTCCTACTGGAGTAGTCACTTTTTGTAGACACAACTCGTGACTCACGGCGGGCGTGACTCCGCCTTATCACTTGGTTTGTATAGGGCCACTGGAGCCTGCGCTTGAGTCTTCGCTTGTAGCAGTAACGTTCCGGGAACACCCGTTGGCTATTCGGCGCCCGCGAGGTGCAGCGTAGCATTTTCGCGCCTCCAACGTCATATTAACATCCGAAAGGACCTATCTGGAACCAGACATCTTCGTTGCGTTCTATATATCCGCCCCACGTGGATGGATGTCGGGCTCGTTGCCAATAGAGCCTTATTAGCGTTTTCTGGACAATCGCAAGCAGGTCTGAGATGACTGCACCTATTTTCATGGGCACGAGTCTCTTGTCGGCACCTACGACTTATGAACCCGAACATGCATTGTTCCCGAGCACAGCGGACACGCAGTCGCCTTCCCTTACGCCATAGTCTGGGCCCCCACGGACTATCCCTCTAAGCACCTCGTCCGCTGACTATATTTCGAGTGAGTAACCGCTCCACGTGATGCCCCGGTTTGACAACATTACGCAGAGTCTCACAGCCATTGCTTCCGCCATGACATTCCGCTGGCTTCAACATAAATACACCCTACACAGAGACGGTTCTTCGGTCCGAAGTTCGGAGCTGCTTAGGTTGCAAGCAAAACCCTTAGGTCGCCAACCGTGTAGTTAGTCTCTGGGGGCGGGATGGGTCCTAGCCAGAATGATGATTGAAGAACCGTTCAGCAATTTTTGTTAGCGACTTTCCAGACGCGGTACCATGCGGTAGTACCGAAGGAGTACCCAAAGGTGGGCTCCGTTTTAAAAAAGGTAGCTGGGGCACCGGGAGCATCACGAGTTGGGTCGACTGTGAGAGTTTAGCACCAGACTTAGCGCTGTACCTACGACCTGGCTGTTGCACACGTAATACCGAAGTATAAATCACCTGTTAAACACGATGGACTAGTGACTTGTAGCGGTCTCCGCGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAACCGTAAGAAGATAACTAACACCCCTGATTGTACTATGATAAGGTGATATCTAACCATCAATCAATTGGAAACATTATAAATATAAGTCTTTGTGGTCGCTATCCTTGACTCGCAATGATCAAGGACACGGCTGCCTGATGTGGTCCGCCCTTATTTTTAACTCGAGGGACGGCCAAGGGGGTATATGCGACAGCCACCAAACCATCCTGACTCGCAAATCCCAACCATTAATGATGCTTAATTTGTTGGACGATAGGTAAATATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCGACTGGAAGTCCTTCAGCACGACCCGAGATCATGGCCAGATATCTCCCGGCTACGACCAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGCGAGCGTGATTATAGACGACCATCCACTTCACTTGTGTTTCTGTAGGTCATCAGATCACGTCATCACCGAGGAATAGCTGTGACAATCCGTTAAACTATGGACGCGAAGGGCGAGATGTTATCAGAACCTTACCGTTGCCGCGAGCAGGAGCGTACTACAGGCGGGGGCGCACGTGCCTCTAGTCCGGCTGTGAAAAAGTTGGTCCAGCACTACGTTCTGCCGCCCACAAGTGGAGTACGCCAGTGGGGGTGATGCTGCATTCTGCGCTCCAATGGTTACGTCGGTCCAGTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTGACCACTATATCAAGGTGTCTTTCGGTCCAGGCTAGGAGGAACTTACCTCGCGGGAATAAGATCCCCAGTCCTTTTTAGCTCGACATTCGTGTTTTCTTGGGGACAGTAACTCATCATCCAAGGTGTAAAAGTATTCTAAGGTGCTCTCTTCTATGATCACACACTGCATCCGCTACTCAACATAGCTCGCGGGCCAACAACGTAGCGAAAGGTCGATTCTCCATACAAAAATATCATACTAGCCGGTTAACTAAACGGATTACCTCTTCCGCCGGAACATCTGAATCCATTACTACCGTGTGGTGCATGTTCACATCTTCGCGGTACATGCAGGAACAGTATGTAATGCTCTCGCGAGCCCTATGTGGCCAGGCTAAATGGGTTTACTGCTCAGCTAGTCCCGCCGAAATGACCATCAAATTGCCCGCTCACCCTTATCGCAGACTTCTACGAGAACGAGGTCCCGTTGGGATCAGAGGGAACCCCCAATTGTCAACGCTCCGTGGGTAGGAAAGATCACACCATGTTATCGAAAACATGGGTCCTCTGTATGTCGTAAAGGACCCAGCGAAAAAATTTATTAACGTGGATGGTTACCAGCTTGTTTTAAGAGCACGTCGACCGACACCAAGTCAATCTATTCCCCTCCATATCACGTTGCGAAGAGGTGCAGAGGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGAGGGCATCACACGGCGGTATACGCATTCAGACCTACACGATGTAAACGCCCAGTTAGGCGCAGATTATCACTGTACACCGACGTTCACCGCTGATAGGGTACTACCTCTCTAGAAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCGTTCCTAGTTGCTACCCAACTTGCCTCAATCGGTAGCACCTGAGCACTACGTGTAAACCTAGTAATGTTATGTACCGTGTGGCCGGAGTCTCAGCGACCAGAGAGTGCGAATTGCTCGAATATAGCCTCCGCCAAGTGATATCATAGTCCGGGTAACCAAGTAGTCTCTGTCCGACTAGGCCGCAAGCTTAGTCTAATCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGTCTAGACCAACCTAAGTCCTTCGCGTGGGCTTTGACGGTCGGAAAAGGCAGGCCTGCCTTTCTCAATGTTGGATACAGTACCACACATCACTCCGCCGTGTTAGAGCGTAGGGAGACGTTGAGAACCGAACTACCGTGTCGCACATCATGATTTGCTTAATCAATTATTCGGTTACGGTATTCACGCGGACTCCGACGTATAGAATATAAGGTTTACACATGCGAAGAGCCCTGTGTATAGTTGCCCGCATCGAGGGTAACTATGTAAGCTATTGTGTGTGAGTAGTTAGCTACTAACGTGAGCCACCTTAAAACGCTTTGCAGTGAGTTAGGACCTTCGGGGTCCGTGCCACCTTACGTTCGGCACTTCTCTACTTCCTGGCCACCAGCGTTCAGTCGACGCAGGGTAACCGTTCGCGCCACCAAGATAGAAAAGCGGGTCAGGGAGATAATGTACTATTTACTCTATACTACATCACAGGATAATGGTAGAAGCGTCTGAAATCCACCATCGCTCTCCAAATGGGTGGTCGTTAGTGCATTGAGGACTGATCATGCGGCCGCTTTGACGAGCAGGGAGATCTTATGCTGTATCAGGATATGTCGGAGCTTATCCAGCGGCCACGCCGAAGGGTCACCCCATCAACCAGTCGGCCATGTATGTATCCCACCGAACCTGTGATAGTCACTGATTCCGGACTAAGACAGCCGAGTATACAACAGTTTCGCCACATGACGCAGGGGACTATGTCGACCGATACGACTTCCAGTATCTTCTGTACCTACAGATTCCACTGCATAACTGTTAATAGCAGAAGAGAGTCTGCATACGATGGCTAATTAACGGGTCTACCGAACTGCGTCCCTGGCAGATCTTGCTAGCACTCCGCTGCAAGCACACTGAGGCTTAAGCTATTATACCATCTTGCTCTGGTTGCAGACAGGTATGATTCACTCTGTACCTCCGGATCTTTAGCCTTGCATTTACGTCCGGCCGTAACGCAGAAAATTATCTACATGTCTGTCAGTCTCGCGGCCTGCACCGAGTGCTCGTCGACGCTCTGGCCAGGGCGAAGATCAGATTCCGTTTGACTTAAGGGTAGTTAGTAGGCGGCCAACGTCTGGTGAAAGGGCTGTTAGACTTGTAGACTTCCGTCATTGGGTCTAATACTGCTATTGATCTATGCCTGCCAATAACACGATGATTAGCTGTATAGTGGGTAATCAGGGAACGTTTACGTTCCTCAAGAAAGAACTCAAAGCCCCAGATAGAAAATACAGACTACCCAACTTGCGGGTTCCACTATATGTTCACCCGTTACTTGCTGCAAACCGAGATCACCCAGCATTTCGACCACGAGCCTTAGATCCAAATGCCACACACCAGTTTAGCGATGAATGTGTATAACGGGTTCTTACAGAGAACTACCAATCCGTAGCCTACGACACCACGGACTCTAATCACCAGGAAGATACAGAACTTGTGGCTGTCTACTGACCGTACGGGTTTGGCCAAAATGAAGCCGCCTAATGGTGTCACCGAGACAGTGTCTCGCTTACCAAGTCCGAGGCCCCATCACGAAGCGTCTGTAGATAAGGATCCGTTCCGGCGCCTGGGACCCGATCGAGAGTTCAGCTGCTGAAAGGGGTGGCGTCGTATGCATCCTTACCTACATCAGTTACGTGCCTCCACCGTTCACAGGGTGGGAGCCTTGACTCTTGCGGAGGGCGGACTGCTTTTAGCGCCTACAGGTCGTACTGATTTAACAAAGCAGAAGCGCTGCCGGGGTCTCATATGGACTATATAAGCGACACACCTTGGACCGCAACTCGAAGACTTCTGGCCGCTAGTGCCCAGGATGTAAACCGAACTACGCCAAGTTGGGAAGAGAAAATCCATCTGACTGCACGGTTCTGTGGAAGATAATGGCCCTAACTTTGCTCTATTAATAAACTATTGACGCCCAAAAAGAACATGTACGGACTGACTGGTACTTCAATTAGCCACTGGTAGTAGTCCTCGATTCAAGGAATCTGTGACAGTCGTACCGCTATATTAATAAATCGCTAAGTGCCAAAGGTCCGACCGTACTGACTGCTGATGTACTTGATTCACGGGATGTTAATAGGAGCTTTGCAATGTGTGCTTAAGTCCAGTCATCTACGCGCTTGCGGCTTGATCCCAGAGTATAACACATAAATCAGACTGTAGCG'

### Import the Data
You have a database containing information about possible suspects and information about their genomes, specifically their STR counts. Let's import this database:

The table above shows each individual's STR types and the longest run of consecutive repeats of the STR in their DNA sequence. Each row in the table corresponds to an individual and each column corresponds to a specific STR and its counts in corresponding individuals' genomes. Every row after the first one corresponds to a person.

### Getting the STR Counts from the Sample DNA
To be able to match the sample DNA to an individual in the database, we first need to calculate the STR counts (or the longest run consecutive repeats of each STR) sequence in the sample sequence. We will build the algorithm step by step.

We recommend you to watch this walkthrough video to get an idea of what the overall goal of our algorithm is: https://www.youtube.com/watch?v=j84b_EgntcQ. Please ignore minutes 4:03-5.15 because we will be transforming the database in CSV form into a table as in the datascience package.

Note: The questions in this lab build up on each other, eventually leading you to the final solution. The final solution is quite long, that's why the questions try to tackle the problem step by step.


##### Question  1
To be able to find the longest run of consecutive STRs, we need to look at each nucleotide of the sequence at a time and compare with a specific STR sequence to see if that nucleotide is the start of consecutive STR sequences. For this, first design a loop that parses through the <mark> sample_dna</mark> sequence:

In [None]:
# Question 1

##### Question 2
Next, let's gather all of the possible STR types together. Create a list called <mark> str_list</mark>, which contains all of the STR types (as strings) in it. Hint: The first row of the database can be used for which STRs (short DNA sequences) to look for in the given DNA. You can use the first row of the database provided and add each STR name into a list. You can also use the <mark> list.append()</mark> to add items to the list. Do not hardcode str_list, use the database provided:

In [None]:
# Question 2
str_list = []

##### Question 3
Write a function called <mark> str_match(i, str, dna)</mark> that takes a start position integer <mark> i</mark>, a string type <mark>str</mark>, and a DNA sequence string <mark> dna</mark> as inputs. This function will prove to be useful later on and will help us simplify our code. The function should check whether the DNA's ith position and onward matches the chosen str. It should return True if it matches and False otherwise. Hint: Use dna[i : j] to take a substring and compare with str. Also, you can use len() to take the length of the str sequence and determine the jth position to take a substring:

In [None]:
# Question 3, fill in the ellipsis

# Output: True or False
def str_match(i, str, dna):
  # Your Code Here
  ...
  return ...



In [None]:
# Test Case for Question 3

def test_q3():
  output1 = str_match(3, 'aagt', 'gtcaagt')
  output2 = str_match(2, 'cgta', 'cgtac')
  if (output1 == True) & (output2 == False):
    print('Test Case Passed!')
  else:
    print('Test Case Failed')

test_q3()

Test Case Passed!


##### Question 4
Now, incorporate the str_match() function with the for loop from Question 1. If str_match matches an str at a particular nucleotide on the DNA sequence, it should continue to check whether the following pieces of DNA at the same length as the str also match the str sequence until finally it does not match. Hint: You will need to use a nested loop.

In [None]:
# Question 4
# For now, only check for the first str from the str_list for simplicity as provided below.
# str = str_list[0] #Uncomment this line of code once you have built str_list

for i in ... :
  for j in ... :


# Assign your answer to this variable
matches = ...

In [None]:
# DO NOT MODIFY THE CONTENTS OF THIS CELL
assert matches == a

##### Question 5
Next, introduce a way to count the maximum number of consecutive STRs matched. Again, only do it on one type of str for simplicity. Your code should look at each nucleotide in sample_dna, calculate the longest run of the str starting from that nucleotide, and after having checked each nucleotide, return the maximum number of consecutive STRs.

In [None]:
# Question 5



# Assign your answer to this variable
max_consec_str = ...

In [None]:
# DO NOT MODIFY THE CONTENTS OF THIS CELL
assert max_consec_str == b

##### Question 6
Make what you have in Question 5 into a function called <mark> strCounting(dna, str_list)</mark>. The inputs are the DNA sequence and the str_list that you have built previously. The function should print the STR counts for each of the STRs. When printing, print the STR sequence and the STR count together. The only difference of this function and the loop from Question 5 is that the function should calculate the maximum number of consecutive STRs for *each* STR in the str_list. Hint: Use a for loop to parse through str_list and apply your code from Question 5 to each different str.

In [None]:
# Question 6, fill in the ellipsis

# Output: Print the STR counts for each of the STRs
def strCounting(dna, str_list):
  # Your code here

  for str in ... :  # Calculate the max number of consecutive STRs for each STR
    ...
    print(...)

In [None]:
# DO NOT MODIFY THE CONTENTS OF THIS CELL
assert strCounting() == c

### Build a STR Table for the Sample DNA
##### Question 7
Now that we know the STR counts in the DNA sequence, change the function you created in Question 6 so that it adds every STR count and the STR sequence into a table called <mark> sample_dna_strs</mark> (should not print the STR counts anymore). The table should be in the same structure as the database provided in the beginning. It should have 2 rows only (one of them being all of the STR titles and the other the STR counts for each of the STR titles).

In [None]:
# Question 7, fill in the ellipsis

sample_dna_strs = Table()     # Create a table that will store STR counts

# Output: A table containing 2 rows, one containing the STR titles and one containing STR counts
def strCounting_table(dna, str_list):
  # Your code here
  sample_dna_strs.append(...)   # append the STR titles

  for str in ... :
    ...
    sample_dna_strs.append(...)     # add the max STR counts into the table

  return sample_dna_strs     # Return the table


In [None]:
# DO NOT MODIFY THE CONTENTS OF THIS CELL
sample_dna_strs.head()

### Who Is the Match?
##### Question 8
Having found all of the STR counts in the sample DNA sequence, now let's find who it matches to! Write a function called <mark> match(sample_dna_data, database)</mark>, which takes two tables and prints out the name from the database that matches the sample DNA's STR counts. If no match is found, the function should print "No match found!" Hint: The <mark> match()</mark> function should take <mark> sample_dna_strs</mark> and <mark> database</mark> as inputs:

In [None]:
# Question 8
def match(sample_dna_strs, database):
    print("")

##### Question 9
Finally, use the match() function you have just created to find out who is the match!

In [None]:
# Question 9

Congrats! You have finished LAB 2!