# **Welcome to Week 2!** 
This week we'll use the buildling blocks you learned last week and in the advanced section of Intro to Python to get you closer to designing your own CRISPR reagents. 


---
# Comparing two DNA sequences 
Last week, we were working with a segment of the `rol-6` gene. We're going to work with this sequence again this week. We've copied the sequence here; store it as a string in the code cell *below*. 

`GCACATTCCAATTGTAGacaaattca`

In [None]:
# Store your DNA sequence here

A grad student in Bob Johnston's lab was sequencing the genome of one of their worm strains. The student noticed that their worms had a slightly different DNA sequence in the `rol-6` gene region. In place of the sequence above, their worms had the sequence `GCTTATTGGAATTGTAGagtaattca`. We'd like to try to quantify how different their sequence is from our sequence. Store Bob's new sequence as its own variable in the code cell below so we can compare. 

In [None]:
# Store Bob's sequence here 

We want to know at which positions our two sequences differ. (Obviously, in this case, we can just look by eye, but often you'll be dealing with sequences that don't even fit on a single page - thousands of nucleotides - and in those cases, you'll want to have some automated process). 

Go through both sequences and check whether they're the same at each position. Have your code print the position of each nucleotide where your sequence is not the same as Bob's. We only care about the nucleotide itself - so it doesn't matter not whether its coding or non-coding (but remember, to Python, `a` is not the same as `A`).

In [None]:
# Write code to print the position where the nucleotide in Bob's sequence is different from the nucleotide in yours 

This works, but the ouput is a little bit ugly. Last week, you learned how to concatenate (combine) strings together. We can use this concept here to make the output nicer.

Copy your code from directly above into the code block below. Update your code so that instead of just printing the positions of the bases that are different, it prints an informative message for each (e.g., "The sequences differ at position 2" rather than just the number 2). 

#@title <font color='green'>Click here for a hint</font>
#Hint: Remember, you can't concatenate strings and integers together. You'll need to find some way to convert the integers to strings.

In [None]:
# Update your comparison code here; instead of just printing the position number, print an informative message

You'd like to share the positions that differ with Bob and his grad student. Specifically, you'd like to send Bob a **list** of the positions where your sequences differ.

Again, copy your code from above into the code cell below. Update your code so that instead of printing out the position of each difference on its own line, it instead generates a list containing the positions of each nucleotide that is different between the sequences. When you're done, print the list.

In [None]:
# Paste your comparison code here, updating it to generate a list of positions where the sequences differ 

---
# Identifying coding and noncoding regions of a gene sequence  

Genes have both coding regions (exons) and non-coding regions (introns and UTRs). It is sometimes useful to separate the coding and non-coding regions from some gene sequence (not only when designing CRISPR reagents, but also for a number of other applications). WormBase — the database we are using to get the sequence of the gene we'll modify with CRISPR — has a useful format for denoting coding and non-coding regions. In WormBase, coding nucleotides are reported as uppercase characters, and non-coding nucleotides are reported as lowercase characters. Here, we'll use this format to count the number of coding and non-coding nucleotides in our `rol-6` sequence.

In the code cell below, count the number of coding and non-coding nucleotides in the `rol-6` sequence, and store these numbers in variables. Finally, print out the variables. There are three hints for you if you need them, but try it on your own first. 

In [None]:
#@title <font color='green'>Click here for Hint 1</font>
%%capture

'''
Hint 1: You can check whether your nucleotide is in a coding region by checking whether the nucleotide is the same when the method `.upper()` is applied.
'''

In [None]:
#@title <font color='green'>Click here for Hint 2</font>
%%capture

'''
Hint 2: The `range()` function might be useful to you here.
'''

In [None]:
#@title <font color='green'>Click here for Hint 3</font>
%%capture

'''
Hint 3: It may help to start by defining the variables `nucleotides_coding` and `nucleotides_noncoding` and then use a loop and a conditional to increase each based on the input sequence. 
'''

In [None]:
# Write code to count coding and noncoding nucleotides in your sequence 

In [None]:
# Use the `print()` function to check your work (since the sequence is short enough for you to count the nucleotides)

---
# Iterating through a dictionary

Recall that a `for` loop allows you to interact with each element in a data structure. Above, we looped through strings, but we can also loop through lists and dictionaries.

Last week, we created a dictionary that stored several _C. elegans_ genes and the phenotypes that each gene is associated with. Here, we'll be working with a similar dictionary. Below is the same set of genes, along with the number of coding nucleoties in each gene (we're assuming no alternative splicing). These numbers are _very_ made up.

*   `lin` (48)
*   `unc` (36)
*   `sur` (96)
*   `bli` (9)
*   `ste` (27) 
*   `let` (27)
*   `rol` (18) 
*   `egl` (36)

In the code cell below, store these values in a dictionary and give it a variable name.

In [None]:
# Store your data type with each gene and the number of coding nucleotides here 

We have the number of coding nucleotides for each gene, but now we want to find the **length of each gene in amino acids**. Recall that one codon is three nucleotides (i.e., `CTT` codes for Leucine). 

Using your dictionary above, create a new dictionary that encodes, for each gene, the length of the gene in amino acids.

In [None]:
# Convert number of nucleotides to number of amino acids

# Use a print statement to check your work 

---
# "Inverting" a Dictionary
## Using your codon table 

The codon table you made last week allows you to look up a codon (`key`) and find the amino acid (`value`) it codes for. In the cell below, copy your codon table from last week's homework and store it as a variable. 

In [None]:
# Store your original codon table here

Sometimes it's also useful to be able to look up which codons encode a given amino acid (we'll actually need something like this when we're designing a repair template for our CRISPR experiment).

Our existing codon table can't do this. If you tried to look up the codons that encode Serine using something like `print(codon_table['S'])` it would give you an error. We cannot look up values in a dictionary, only keys.

We intead need to make a dictionary that allows us to map from some given amino acid to a list of each of the codons (DNA) that encode that amino acid.

In the code cell below, write code to "invert" your **codon-to-amino acid** dictionary from above to an **amino acid-to-codon** dictionary. 

There is a hint below, if you think you need it.

In [None]:
#@title <font color='green'>Click here for a hint</font>
%%capture

'''
Your new dictionary should have amino acids (or STOP) as keys
and a list or set of codons that encode that amino acid as values.

Create an empty output dictionary. Loop through the orginal dictionary,
adding each amino acid to the output dictionary if it does not already appear 
there, or adding to the associated list/set of codons if it is already
in the output dictionary.
'''

In [None]:
# Write code to invert your codon table here 

Use this new dictionary to print out all codons that encode Serine (`'S'`).

In [None]:
# What codons encode Serine? 

---
# Translating a gene from DNA to amino acids

## Practice building functions 

Think about all the built-in functions in Python you've used so far. These are extremely handy and you'll continue to rely on them through the rest of this module.


In [None]:
# List some of the built-in functions here as a comment, 
# along with some notes about what they do and why you might want to use them

But sometimes you need a function that does something more specific, or something more complicated... basically a custom function that **you** build to do exactly what you want for your data. 

The general syntax to build a function in python is below.

<font color="red">**NOTE:**</font> You can include other functions (either built-in or functions you've previously defined) in a new function you're creating. 

```
# Define the function by naming it. In parentheses, list any inputs that the function will use

def my_function(input1, input2):

  # Indented by one tab, you'll have code that manipulates those inputs
  # in some way. This can be any mathematical operation, counting, a
  # for loop, and if statement, or anything else. Here, we'll loop through
  # input1 and input2 and add the value to an output variable

  tot_sum = 0

  for num in input1:
    tot_sum += 1

  for num in input2:
    tot_sum += 1

  # The function can then return some value or data structure you are
  # interested in
  
  return tot_sum

# outside of the function, we unindent, and we have the opportunity to
run the function and store the output to a variable
```

In the code cell below, build a function named `multiplication` that returns the product of two input numbers (either floats or integers).

In [None]:
# Build your `multiplication` function here 

To call a function, you write the name of the function and then in parentheses write the inputs you want to pass to it. You can assign the value being returned to its own variable. 

In the cell below, call your `multiplication` function on any two numbers and assign the output to a variable called `my_product`. 

In [None]:
# Call your `multiplication` function here and assign its output to a variable called `my_product`

## Building a function to print your gene sequence 
Now that we have some practice building functions, let's build one that can extract the coding sequence of an input gene sequence. In the code cell below, write a function that will take some wormbase-formatted gene sequence (including coding and non-coding regions) as an input, and will **return** just the coding sequence of the gene.

There is a hint below, if you think you need it.

In [None]:
# Define your function here 

In [None]:
#@title <font color='green'>Click here for a hint</font>
%%capture

'''
Remember that in the wormbase format (which your sequence is in), capital letters are coding and lowercase letters are noncoding. 
Can your function identify which letters are uppercase? And then store those in something that is returned? 
'''

Awesome! Let's test your new function out. Here's a new, extended gene sequence: `GCACATTCCAATTGTAGacaaattcaGTCGtgaacCTCGATatt`. In the code cell below, store this sequence as a variable. Then, **call** your function on this sequence and store the output to a new variable. Finally, print the stored value.

In [None]:
# Define the variable that contains your gene sequence

# Call your function on your sequence and assign the output to a variable `coding_regions`

# Print `coding_regions`

## Building a function to translate your gene sequence 

Now that you know how to subset a gene sequence to just the nucleotides in the coding region, you are curious what the amino acid sequence of this gene is.

In the code cell below, build a function that will take as input your whole gene sequence (coding and non-coding) and anything else it needs, and will return the amino acid sequence it codes for. The first step in your function should be using your _previous_ function to extract the coding sequence from the input sequence.

Once you define your function, run it on the extended gene sequence, store the output, and then print the stored output.

There are two hints below, if you think you need them.

In [None]:
#@title <font color='green'>Click here for Hint 1</font>
%%capture

'''
You'll need the codon table to be able to convert from codons to amino 
acids (the original, not the inverted one)
'''

In [None]:
#@title <font color='green'>Click here for Hint 2</font>
%%capture

'''
You'll likely want to use the `range()` function here to loop through the
coding sequence. Try setting the step size to something other than 1,
that makes a little more sense in this context... 
'''

In [None]:
# Build a function that returns the amino acid chain that a sequence codes for 



In [None]:
# Call your function on your sequence and assign the output to a variable `amino_acid_seq`

# Print `amino_acid_seq`