# **Welcome to Week 2 of CRISPR Design**

Before you get started, copy all of the cells below and paste them in at the bottom of the colab you completed for last week. You'll be doing the second of half of this assignment in that notebook as well.

Also, if you did not manage to finish last week's assignment, no worries, but you will need some of the values you calculated from last week's assignment for this week. As such, we've given them to you: copy some or all of the code cell below if you need to.

In [None]:
if insert_end == 'N':
    insert_pos = 2016
    PAM_pos = 2015
    PAM_strand = '+'

elif insert_end == 'C':
    insert_pos = 7597
    PAM_pos = 7605
    PAM_strand = '-'

---
# Designing the Repair Template

Congratulations! You've successfully found a PAM close to the insert position, and you've found the sequence of the crRNA that corresponds to that PAM. Now comes the hard part: designing the repair template that corresponds to your chosen PAM and crRNA. When designing a repair template, there are few things to remember:
1. We need to prevent Cas9 from editing our target gene again after we've successfully added our tag. To do so, we need to modify either the sequence of the PAM or the sequence of the crRNA-binding region in our repair template. If we can modify the sequence of the PAM, that's generally preferable
2. We need to insert the sequence of the tag at the correct position
3. We need to include 60 bp of homology on either side of any modifications we make (including inserting the tag)

Let's think about how we're going to do that. We essentially want to:
1. Check if we can modify the PAM in our target sequence. If we can, determine the modified sequence and keep track of where it occurs
2. If we can't modify the PAM, modify the crRNA-binding region. Keep track of where in the target gene our modifications occur
3. Based on the locations of modifications within the target gene, generate the repair template with 60 bp homology and including the tag

---
## First things first

The tag codes for a GFP protein, but also has an *unc-119* gene within one of the GFP introns. For this assignment, you don't need to understand why we've structured the tag this way, just make sure the whole tag sequence is inserted into the target gene sequence at the correct location.

In [None]:
tag = 'GCGGGCAGCGGTGGCAGTGGAGGTACCGGCGGAAGCGGTATGAGCAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGTGATGTTAATGGGCACAAATTTTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGAAAACTACCTGTTCCATGGgtaagtttaaacatatatatactaactaaccctgattatttaaattttcagCCAACACTTGTCACTACTTTCTGTTATGGTGTTCAATGCTTCTCGAGATACCCAGATCATATGAAACGGCATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTTCAAAGATGACGGGAACTACAAGACACgtaagtttaaacagttcggtactaactaaccatacatatttaaattttcagGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAATTGGAATACAACTATAACTCACACAATGTATACATCATGGCAGACAAACAAAAGAATGGAATCAAAGTTgtaagtttaaacatgattttactaacataacttcgtataatgtatgctatacgaagttatttctagacattctctaatgaaaaaatctttcagttgaaattgaaaatgagttaaagttggagtttttattgaaaacagatttccgtgtgattagtgtttttagcgagtgtgacaggacagcgaaaaaatatagaaacaaggggggaactgaaaagcttaggaatgcattgaacatgagaaggggaaggggaaggaacaaactagacaggaattattggaatttaatcacatttggagttttttttctattcgacagaataattatccagaacatttttgtattaaatatttatgcatcatatgagtagtcggctttgttgtgcatgacgagtttgttatcgacgaaatagaagctgtcagaacgagtctcgtttggattgttgatcatgtcgtccactgaaaaagagattagtcttttgaattgtactttttagaataatgactcactgagttgttgagagagttgagggaactcatagatatgttcacagttgtttcgtgaattcggaatacagaatccgaattcaaagtcaaaacacttcagaaggcgatctttgaagaagtgacgttcgatcattcggaaatgatggattgggatgtctcctactttgaattccacagttgctccgaccgttttgagtttcaaaaagtttggagcaaatctataacgcacgtatcttgccgactcctgtggcgactcatcattctcctgatcgttctccggtttggcgatctcgaaaagcacttgctcagtgtccagatcacggatttggaacttggtgaactcgatgttatagatgttcgcagatggggagcataagaatcctaaatttatgttttaaactgaaatccaaagggagcaagataccttgagtgattcccggaagtgctaaaacgtcgttcggagtgatttgagctttcttcgcaagctccgattccgttgtgattccttgttcggtgcttggtggtggccgtggcatctggaaatatggaaaagttcaacaaaaagaaaagagaaaagaatgaaatcggatatcaagagttagttgagcggtttctctagttttctgagtctcacctgcgacgggaaggtcgccgagccgggtggaatcgattgttgttgctcggctttcatatcggtttggttggaagcggctgaaaacggaaagaagtggaagaaggaaaagagtgtggtgtgacaggaaaatggtaattagagggtgccaaataaccagctatattttgtttttttttgaaaacatttttaaaaagaaaaatacgataatgatatcagatggatttccggaaaactggtatgaaaaatttcaacctttttgagtacatgtaatcaaaatacactttgtaaattatcatttttattgaaactccaccatttttctatttataacgctaataatttgaaaaagaaacctgttgcgaaccgcggggtgaatcccaaaaacgaatgcgttttggtggagtgattgattcgaatcgaagaagaaaaagaagaagacgtggaatagagagctcactcttaaccgagcagcacacaccgacagaaaaaaaaatgaaatgaatgagggtcttcttcttcttcttcttcgaatgattgacagaaatgggaaaaagaggaagattgagaagggaaaaaggaaggagaaaagaagcagaagaagacgtcagagaggagaggaacgagcggaaaagcagcgggcgcaagtcatagaagtagcagagctggggagaagaagacactatccaagaaaggaatgacgagagagtatgcaaaggggtatagggtgcagacagaataggaacagaataacagatgatgagccaagaagagttgaaaagggcgatgaatttgtcatgtaacttaatttgggtcaatttgagcatgatgaattgaaatcatcccttgttgggagttaataaccggtttgttatcagaaaccctgtaatagaagggcgccctaactttgagccaattcatcccggtttctgtcaaatatatcaaaaagtggtcaactgacaaattgtttttgatattataataaacattttatccgttaacaattttcgaatactttttacaaggacttggataaattggctcaaagataacttcgtataatgtatgctatacgaagttattaactaatctgatttaaattttcagAACTTCAAAATTAGACACAACATTGAAGATGGAAGCGTTCAACTAGCAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGCCCTTTCGAAAGATCCCAACGAAAAGAGAGACCACATGGTCCTTCTTGAGTTTGTAACAGCTGCTGGGATTACACATGGCATGGATGAACTATACAAAGGT'

---
## Modifying the target gene

So how do we check if we can modify a certain region in the target sequence (i.e., the PAM)?

Perhaps counterintuitively, the easiest way to check if we are able to modify the PAM sequence without disrupting the target gene is to modify the target gene _as much as possible_ and then check if the PAM was modified. If we modify the target gene sequence as much as possible and the PAM **is** modified, then we can modify the PAM. If we modify  the sequence as much as possible and the PAM **is not** modified, then we cannot.

We can then use this modified sequence later when we are building the repair template.


### As much as possible...

How do we modify a DNA sequence "as much as possible"? For the non-coding regions, we should just be able to switch out each of the bases with another base. However, for the coding regions, we want to ensure that we're not changing the amino acid sequence of the gene; we only want to make synonymous changes.

For the coding regions, for each codon, we can try to find the synonymous codon that is **most different** from the original codon. This is a little confusing, so let's consider a specific example:<br><br>

The codon <font color='green'>**TCT**</font> encodes the amino acid <font color='green'>**Serine**</font>. Let's consider the other amino acids that encode Serine: <font color='blue'>**AGC**</font>, <font color='blue'>**TCG**</font>, <font color='blue'>**TCC**</font>, <font color='blue'>**AGT**</font>, and <font color='blue'>**TCA**</font>.<br><br>

<font color='blue'>**TCA**</font>, <font color='blue'>**TCC**</font>, and <font color='blue'>**TCG**</font> are both <font color='red'>one</font> base different from our initial codon <font color='green'>**TCT**</font>. But <font color='blue'>**AGC**</font> and <font color='blue'>**AGT**</font> are both <font color='red'>three</font> bases different from the initial codon <font color='green'>**TCT**</font>. As such, when modifying the coding sequence our target gene _as much as possible_, we will change all instances of the codon  <font color='green'>**TCT**</font> to either <font color='blue'>**AGC**</font> or <font color='blue'>**AGT**</font>.

Below is some code that creates a dictionary: `most_diff_syn_codon`. This dictionary encodes, for each possible codon, the synonymous codon that is most different from it. You're welcome to comb through the code to try to figure out what it is doing, but that isn't required for this assignment (especially because it utilizes an advanced feature called a `lambda function`). 

Regardless, the dictionary will work just like any other dictionary you've seen. Run the code, and in the cell below, look up the key `'TCT'` in this dictionary and print the associated value, just to ensure you understand how the dictionary works.

In [None]:
def hamming_dist(str1, str2):
  dist = 0
  for i in range(len(str1)):
    if str1[i] != str2[i]:
      dist += 1
  return dist


codon_table = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

AA_to_codon = {}

for codon, AA in codon_table.items():
  AA_to_codon.setdefault(AA, set())
  AA_to_codon[AA].add(codon)

most_diff_syn_codon = {}
for AA in AA_to_codon:
  for codon in AA_to_codon[AA]:
    other_codons = AA_to_codon[AA]
    sorted_codons = sorted(other_codons, key=lambda x: hamming_dist(x, codon))
    most_diff_syn_codon[codon] = sorted_codons[-1]

In [None]:
# print the value associated with "TCT" in most_diff_syn_codon

Now we have a way to find the **most different** synonymous codon for each codon in a sequence. In the code cell below, write a function that takes some gene sequence as an input and returns the **most modified** version of that gene. This must do two things:
1. For non-coding regions, change every base
2. For coding regions, each codon must be swapped out with the most different synonymous codon

There is a hint below, if you think you need it.

In [None]:
#@title <font color='green'>Click here for a hint</font>
%%capture

'''
def modify_sequence(input_seq):

  # This part of the function has not been given to you in this hint, but
  # here's what it should do: 
  # Pull out the coding sequence and modify it as much as possible
  # It may be helpful to look at your code from Intro Part 2 where you
  # translated an input gene sequence
  # Let's say the modified sequence is stored in the variable `mod_coding_seq`

  coding_pos = 0

  output_seq = ''

  for i in range(len(input_seq)):
    if input_seq[i].upper() == input_seq[i]:
      output_seq += mod_coding_seq[coding_pos]
      coding_pos += 1
    else:
      # This part of the function also has not been given to you. If the
      # base at index i in the input sequence is non-coding, do this:
      # Append a base different from the one at input_seq[i] to output_seq
  
  return output_seq
'''

In [1]:
# Define a function that will take an input gene sequence (in WormBase format)
# and modify it as much as possible without changing the translated amino
# acid sequence

In the code cell below, check that your function works by running it on the test sequence provided and printing the output.

In [None]:
test_seq = 'aggatATGGAATaaggttaGCTTGtgccat'

# Check that your function works on this sequence

You should now have a function that takes some input gene sequence and returns the most modified version of that gene that still generates the same amino acid sequence. In the code cell below, run this function on your target gene sequence and store the output in a variable.

In [None]:
# Run your function on the target sequence and store the output in a variable

---
## Checking if the PAM Can Be Modified

Now, we want to check if the PAM sequence we're using was sufficiently modified in this "fully modified sequence". You'll need to use both the position of the PAM in the target sequence and it's strandedness to determine whether this is the case. Keep in mind, it is not sufficient to modify just the N in <font color='red'>**N**</font>**GG**, as doing so will not actually disrupt the PAM. You will need to modify at least one of the two Gs.

In the code cell below, check whether the PAM sequence was sufficiently modified, and store this as a boolean in a variable.

In [None]:
# Check if the PAM was modified in your fully modified sequence
# This will differ based on whether your PAM is on the forward or reverse strand
# Store this as a boolean in a variable

Great! You should now know whether the PAM can be modified or not. To ensure that your code ran correctly, print the boolean value you stored in the code cell above.

Double check that the output makes sense given the target gene sequence and your choice of terminus. As a reminder, you should check that the output makes sense both when inserting at the N-terminus and when inserting at the C-terminus. 

<font color="red">**BIG NOTE:**</font> For this example gene sequence, when inserting at the N-terminus, you **SHOULD NOT** be able to modify the PAM, and when inserting at the C-terminus, you **SHOULD** be able to modify the PAM. This isn't always going to be the case, but it is for the sequence we've chosen here.

In [None]:
# Print whether the PAM was modified or not

We should now know whether or not we are able to modify our PAM sequence when building our repair template. But this is not enough information just yet. Remember: when designing our repair template, we need to include 60bp of homology on either end of any modifications we make. This means **we need to keep track of where we're modifying our sequence**.

This is actually easier said than done. Let's consider the case when we can modify our PAM. When we modify our PAM, it's possible we're modifying 0, 1, or 2 codons. If we're modifying a codon, we may be modifying more bases than are included in the PAM. Consider the following example:

Our PAM is on the reverse strand and falls between two codons, encoding Isoleucine (I) and Arginine (R).

**<center>...AT<font color='red'>CCG</font>A...</center>**

When we modify this as much as possible, the sequence becomes **...ATAAGG...**

It is not only the sequence "CC" that is modified. Rather, parts of both codons are modified. The modified region is not only from the position of the first C to the second C; it is the region from the first C to the last A (which becomes a G).

To be able to account for this nuance, we'll need to be able to keep track of which codon each base is part of, and which "frame" each base is in.

In the code cell below, write a function that takes some gene sequence as input and returns a list of the same length that encodes the frame at each position. It should follow the following set of rules:

1. If the base is the first base in a codon, its frame should be encoded as `0`
2. If the base is the second base in a codon, its frame should be encoded as `1`
3. If the base is the third base in a codon, its frame should be encoded as `2`
4. Finally, if the base is in a non-coding region, its frame should be encoded as `-1`

There is a hint below, if you think you need it.

In [None]:
#@title <font color='green'>Click here for a hint</font>
%%capture

'''
Keep track of which frame you're in for the coding sequence as you move
through the gene sequence (starting at 0). The modulus (%) operator may
be useful to you here.
'''

In [None]:
# Define a function that determines the frame of each base in an input gene
# sequence (in wormbase format). Return this as a list.

In the code cell below, check that your function works by running it on the test sequence provided and printing the output.

In [None]:
test_seq = 'aggatATGGAATaaggttaGCTTGtgccat'

# Check that your function works on this sequence

Once your function works, run it on the target gene sequence and store the output frame list in a variable to use later.

In [None]:
# Run your function on the target sequence and store the output in a variable

## Modifying the PAM

Let's assume that we *could* modify the PAM (as should be the case if we're inserting the tag at the C-terminus). We now need to track where the modifications to the PAM actually are. When doing this, there is an assumption we're going to use that will make our lives easier. The assumption is this: our PAM, and any codons that it is part of, will not overlap an intronic region. It may overlap one of the UTRs, but it will not overlap an intron.

This is likely a fair assumption, because our PAM should be relatively close to the beginning or end of the coding sequence; so this assumption will hold unless the first or last exon is incredibly short. It should be noted that this assumption _can_ be violated. Genes do exist with quite small exons, and in the future, you would need to modify your code to handle that. But the gene we're giving you for this assignment does not violate this assumption, and so we're going to move forward.

As such, there are four possibilities regarding the two relevant bases of the PAM:
1. They are both contained within a single codon
2. They are split between two separate codons (as in the example above)
3. One exists within a codon, and the other is within a UTR
4. Both are within a UTR

In the code cell below, determine the start and end positions of the region in the target gene that is modified when we modify the PAM. **We'll assume we can modify the PAM as much as possible**. The end position should be 1 greater than the position of the last modified base (we're going to want to use it to index the target sequence later). Store both of these values in variables.

<font color="red">**NOTE:**</font> This code should only run _if_ the PAM is able to be modified, as determined above.

There is a hint below if you want to try doing this part yourself, BUT we've also given you the solution, as this part of the code is fairly convoluted.

In [None]:
#@title <font color='green'>Click here for a hint</font>
%%capture

'''
Try a set of four conditionals checking which of the four possibilities
from above the chosen PAM falls into. (These are in the same order as above)
1. The frame of first of the two relevant positions in the PAM is 0 or 1
2. The frame of the first relevant position is 2, and the frame of the second
   relevant position is 0
3. The frame of the first relevant position is 2, and the frame of the second
   relevant position is -1
4. The frames of the two relevant positions are both -1

For each of these conditions, determine the range of positions at which
modifications could have been made (e.g. for the case in which the relevant
bases overlap two codons, the region from the beginning of the first codon
to the end of the second codon), and then narrow down the position of
the first modified base and the last.
'''

In [None]:
#@title <font color='green'>Click here for the **SOLUTION**</font>
%%capture

# This solution assumes that the PAM start position (index) is stored in the
# variable `PAM_pos` and that the PAM strand is either "+"" (forward) or
# "-" (reverse) and stored in the variable `PAM_strand`. It also assumes you
# have a variable `PAM_modifed` that stores a boolean based on whether you
# are able to modify the PAM (True) or not (False).

'''
if PAM_modified:
  
  if PAM_strand == '+':
    PAM_nuc_start = PAM_pos + 1
  else:
    PAM_nuc_start = PAM_pos

  if frame_positions[PAM_nuc_start] == -1 and frame_positions[PAM_nuc_start+1] == -1:
    check_region_start = PAM_nuc_start
    check_region_end = PAM_nuc_start + 2

  elif frame_positions[PAM_nuc_start] == 2 and frame_positions[PAM_nuc_start+1] == -1:
    if fully_modified_seq[PAM_nuc_start].upper() != target_seq[PAM_nuc_start].upper():
      check_region_start = PAM_nuc_start - 2
    else:
      check_region_start = PAM_nuc_start + 1
    check_region_end = PAM_nuc_start + 2

  elif frame_positions[PAM_nuc_start] == 2 and frame_positions[PAM_nuc_start+1] == 0:
    if fully_modified_seq[PAM_nuc_start] != target_seq[PAM_nuc_start]:
      check_region_start = PAM_nuc_start - 2
    else:
      check_region_start = PAM_nuc_start + 1
    if fully_modified_seq[PAM_nuc_start+1] != target_seq[PAM_nuc_start+1]:
      check_region_end = PAM_nuc_start + 4
    else:
      check_region_end = PAM_nuc_start + 1

  elif frame_positions[PAM_nuc_start] == 0 or frame_positions[PAM_nuc_start] == 1:
    check_region_start = PAM_nuc_start - frame_positions[PAM_nuc_start]
    check_region_end = check_region_end + 3

  PAM_modified_region_start = check_region_end
  PAM_modified_region_end = check_region_start

  for i in range(check_region_start, check_region_end):
    if fully_modified_seq[i].upper() != target_seq[i].upper():
      PAM_modified_region_start = min(PAM_modified_region_start, i)
      PAM_modified_region_end = max(PAM_modified_region_end, i+1)
'''

In [None]:
# If able to modify the PAM, find which of the 4 above scenarios it falls under
# and determine which positions are modified 

---
## Modifying the crRNA-binding Region

At this point, we've determined whether we are able to modify the PAM, and if so, we've recorded the range of positions in the target sequence that will be modified. But what if we were not able to modify the PAM sequence? (For the sequence we're using here, you shouldn't be able to modify the PAM if you are inserting the tag at the N-terminus.) If that's the case, we will instead need to modify the crRNA-binding region.

As before, we'll want to determine the range of the positions in the target sequence that will be modified. This will actually look very similar to what we did for the PAM sequence. We'll make the same assumption here that we did with the PAM sequence: the crRNA-binding region will not overlap any intronic regions.

In the code cell below, determine the start and end positions of the region in the target sequence that is changed when we modify the crRNA-binding region. As before, the end position should be 1 greater than the position of the last modified base. If the start or end of the crRNA-binding region overlaps _part_ of a codon, assume you'll modify that codon if you can.

<font color="red">**NOTE:**</font> This code should only run if the PAM _is not_ modified.

There is a hint below, if you think you need it.

In [None]:
#@title <font color='green'>Click here for a hint</font>
%%capture

'''
Check the frame of the first and last base in the crRNA-binding region. If
either the first or last base falls in the middle of a codon, update the
region to check for modifications as necessary.

As with the PAM, narrow down the first and last base within this region that
are actually modified.
'''

In [None]:
# If you are NOT able to modify the PAM, find the start and end position of the
# region around the crRNA-binding region that is modified. Remember, if the ends
# of the crRNA-binding region overlap codons, we'll assume we're going to try
# to modify those codons as well.

---
## Generating the Repair Template

Congratulations! You now have all the pieces you need to put together your repair template:
1. The sequence of the tag
2. The insert position
3. The positions of all other modifications to be made (either to the PAM or the crRNA-binding region)

Recall that the repair template needs to have 60 bp of homology on either end of any modifications. An easy place to start is determining the start and end position of the repair template in the target sequence. **Remember, inserting the tag also counts as a modification**.

In the code cell below, determine the start and end position (in the target sequence) of the repair template. The start should be 60bp upstream of any modifications (including inserting the tag), and the end should be 60bp downstream of any modifications. Store both of these positions in variables.

<font color="red">**NOTE:**</font> These positions will depend on whether PAM was modified or if the crRNA-binding region was modified.


In [None]:
# Find the start position and end position of the portion of the 
# target sequence that will be included in the repair template and
# store these positions in variables. Remember that we include 60bp
# of homology in the repair template

Now that you have the start and end position of the repair template in the target sequence, we can finally put the repair template togther.

In the code cell below, construct the repair template using the following pieces of information:
1. The start and end position of the repair template in the target sequence
2. The start and end position of the modifications made to either the PAM or the crRNA-binding region
3. The fully modified sequence created above
4. The insert position
5. The sequence of the tag

Print your complete repair template, and you're done!

There is a hint below, if you think you need it.

In [None]:
#@title <font color='green'>Click here for a hint</font>
%%capture

'''
Create an intial repair template without the tag first (i.e. one that
comprises just the modifications to the PAM or crRNA-binding region), and
then add the tag afterwards.
'''

In [None]:
# Find the sequence of the repair template and store it in a variable

# Print the sequence of the repair template

---
# Congratulations!

And that's it! You're done! You've written a program in Python that takes some target gene and tag of interest, and creates the crRNA and repair template necessary to add that tag to the target gene. As long as your target gene is in WormBase format (such that coding sequence is uppercase and non-coding sequence is lower case), this program should be able to create CRISPR reagents for any gene and tag. I don't know about you, but I think that's pretty cool.

There are, of course, some caveats. First, remember the assumption we made about the crRNA-binding region and the PAM: we assumed they did not overlap any intronic regions. Why did we make this assumption? Imagine a codon that is split by an intron. If we need to modify this codon when modifying either the PAM or the crRNA-binding region, it's possible the modified region could stretch across the intron, making our repair template huge. This assumption is, of course, not always going to be true. It's possible that one of the first or last exons could be sufficiently small that either of these regions could overlap an intron. Think about how you could modify your code to deal with a situation like this.

There are some other assumptions we made, whether we knew it or not. We assumed we could modify non-coding sequences however we wanted. What if we accidently modified a splice site such that it was no longer functional? What if we modified the ribosome binding site? Non-coding regions are not necessarily functionless. Think about how you could modify your program to incorporate knowledge about regulatory regions like this.

If you want to run your code with different target gene/tag combos, just change the parameters you set at the beginning. You can change the sequence of the target gene, the sequence of the tag, or the insert terminus (either N or C). Instead of running each code cell one at time, you can run your whole notebook sequentially by pressing `Ctrl + F9` (or selecting "Run All" under the Runtime tab at the top of the page).

Regardless of these caveats, it's still incredibly impressive what you've put together here, and we hope you had fun figuring it out! If you did have fun, we certainly encourage you to continue building these skills further. If you've gotten the coding bug, there are a lot of ways to keep it up. JHU has a number of courses that either teach you programming or have you learn it in application to another subject. If you want to talk about other ways to learn coding, or to expand or modify this code here, please get in touch - we'd be so happy to hear from you (Dylan Taylor - dtaylo95@jhu.edu; Sara Carioscia - saracarioscia@jhu.edu).
