<a href="https://colab.research.google.com/github/csbfx/apex/blob/main/APEX_Biology_Activity_Module_Bioinformatics_Sickle_Cell_Anemia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Bioinformatics with Python Programming: A Case Study on Sickle Cell Anemia
#### Created by Wendy Lee, Akiko Balitactac, Inika Bhatia, Valerie Carr, and Morris Jones, Jennifer Avena.
#### Last updated: August 8, 2023
#### Licensed under CC BY-NC-SA

<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/RPLP0_90_ClustalW_aln.gif" width=500><br>
Image credit: Miguel Andrade CC BY-SA 3.0

### Learning objectives:
  1. Using computing, determine the RNA and protein sequence from the DNA sequence of a gene, and compare gene sequences.
  2. Using computing, locate a specific region within the sequence of a gene.
  3. Predict the effects of mutations on protein function.


## A few notes before you start
* **BEFORE YOU START, MAKE A COPY!** This file is view only, meaning that you can't edit it.
    * * To generate an editable copy, select the icon <img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/opencolab.png" valign="bottom"> at the page's upper section. This action will open a new browser tab with the file accessible via Google Colab. Navigate to the top of the notebook and click on the <img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/copytodrive.png" valign="bottom" border=1px> icon. This will result in the opening of a fresh tab containing your personalized copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter` or hit the play button.   
* As you progress through the exercises in this module, run every code cell in order.  Many functions and variables are defined in the beginning exercises in the module and thus need to be run in order for later code in the module to work.

# INTRODUCTION: SICKLE CELL ANEMIA CASE STUDY

Today, you'll practice Python programming for bioinformatics using a case study for sickle cell anemia.

**Sickle Cell Anemia**
Sickle cell anemia is a disease that arises from a dysfunctional version of Hemoglobin. Hemoglobin is a protein in red blood cells that carries oxygen from the lungs to other tissues in the body.  Each hemoglobin protein consists of four subunits (i.e., four polypeptide chains): two alpha-globin subunits and two beta-globin subunits.  Sickle cell anemia is caused by a mutation in the beta-globin gene that converts the twentieth nucleotide in the coding sequence from an A to a T, resulting in a change in the amino acid sequence from a glutamic acid to a valine. This leads to a dysfunctional beta-globin protein subunit and thus a dysfunctional Hemoglobin protein. After hemoglobin delivers oxygen to the cells in the body, dysfunctional hemoglobin polymerizes, which causes the red blood cells to change shape to a crescent or “sickle”. The sickle shape of these red blood cells can block blood flow, thus reducing oxygen delivery to the body. These sickle cells also have a short lifespan, which can decrease the overall level of red blood cells in the body and thus the level of oxygen delivered to the cells in the body.  People diagnosed with this disease suffer from a range of symptoms. Some examples of symptoms are decreased energy (due to decreased oxygen delivery to cells) and swelling of the hands and feet (due to restricted blood flow). Sickle cell anemia is inherited in an autosomal recessive fashion, meaning that, if an individual has both copies of the mutated beta-globin gene, they have sickle cell anemia.


You can watch a short video summary of sickle cell anemia here:  https://vimeo.com/458992516.

<img src='https://www.sjsu.edu/people/wendy.lee/pics/apex/Hemoglobin_Structures.jpeg' width=600px>


The image above shows the hemoglobin protein, which consists of two alpha-globin and two beta-globin subunits. Image credit: Adapted from [Berg, Tymoczko, & Stryer, 2002.](https://www.ncbi.nlm.nih.gov/books/NBK22550/)

References: [NHLBI](https://www.nhlbi.nih.gov/health-topics/sickle-cell-disease); [U.S. National Library of Medicine](https://medlineplus.gov/genetics/gene/hbb/); [OMIM](https://www.omim.org/entry/603903?search=sickle%20cell%20anemia&highlight=%28anaemia%7Canemia%29%20cell%20sickle); [Berg, Tymoczko, & Stryer, 2002](https://www.ncbi.nlm.nih.gov/books/NBK22550/)


###**Case Study Background**
Kevin is 15 years old, and when Kevin was younger, his parents noticed that he often had swollen hands and feet, frequently had infections, and often had low energy. One day, while out with his family, Kevin had excruciating pain in his knees and hip. Concerned for their child's health, Kevin's parents rushed him to the hospital. While talking to the doctor, Kevin and his parents discussed his symptoms and also mentioned that he often has unexplained episodes of severe pain, which can prevent him from doing daily activities.  While discussing their family history with the doctor, the parents described that there was a history of pulmonary hypertension, anemia, and strokes.

Upon a physical examination, the doctor only noticed a bit of swelling in Kevin’s hands.  Due to the family’s medical history and the patient’s reported symptoms, the doctor diagnosed Kevin with sickle cell anemia but also ordered genetic testing to confirm whether Kevin has a mutation in his beta-globin gene.

Kevin also has a brother, Will.  While Will does not display any symptoms, he also received genetic testing, since the disease may run in the family.

Kevin's family pedigree:

<img src='https://www.sjsu.edu/people/wendy.lee/pics/apex/Pedigree.jpg' width=500px>


A family pedigree is indicated above, indicating that Kevin displays symptoms for sickle cell anemia, but the other family members in this pedigree do not.


----
# Exercise 1 - Analyze Gene Sequences

**Examining The Central Dogma: Analyzing Gene Sequences using Python Programming Language**
In this first exercise, you will use several String methods in Python to analyze the content of genomic sequences. In computer programming, a String is a sequence of text or characters. String Methods are functions built to perform a specific task for a String.

Genetic materials consist of DNA, some of which is transcribed into RNA sequences and translated into proteins.  This process is known as the Central Dogma of Biology, as illustrated in the two images below.

<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/centraldogma.jpeg" width="380"><br>
Image credit: [Ille, Alexander M., Hannah Lamont, and Michael B. Mathews. "The Central Dogma revisited: Insights from protein synthesis, CRISPR, and beyond." Wiley Interdisciplinary Reviews: RNA 13.5 (2022): e1718.](https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wrna.1718#pane-pcw-references)







DNA is double-stranded. The protein coding information is stored in the coding strand of DNA.

<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/open_reading_frame.jpg">Image credit: [National Human Genome Research Institute (NHGRI)](https://www.genome.gov/genetics-glossary/Open-Reading-Frame)




## Transcription: DNA -> RNA
In transcription, DNA is transcribed into RNA.  The non-coding (i.e., template) strand of DNA serves as a template for the cell to make an RNA strand that is complementary and antiparallel. The coding strand DNA sequence is highly similar to the RNA sequence. The only difference is that all the "T"s in the DNA are "U"s in RNA. For example, the DNA coding strand `ATGGACGATGAGCGATAG` will be transcribed to RNA as `AUGGACGAUGAGCGAUAG`.

The computer can simulate the transcription. We will do this today with part of the DNA sequence for the beta-globin gene for Kevin and his brother Will, who is asymptomatic.  For the DNA sequences for both Kevin and Will, you are provided with a region of the *coding sequence*.  A coding sequence (abbreviated CDS) for a gene is the sequence of DNA that begins with the start codon and ends with the stop codon that will be coded into protein, so it contains only DNA sequence for exons, not introns.  The coding sequence is written to resemble part of the *coding strand* sequence.

Below is an example of the coding strand of a small piece of Kevin's DNA, written as a string of letters. For each coding strand of DNA written in today's activity, the 5' will be located on the left side of the sequence, and 3' will be located on the right side of the sequence. Kevin's DNA sequence is assigned to a variable called "kevin_cds" to specify that this DNA is part of the coding sequence.  In this exercise, you will determine Kevin's RNA sequence (which will be assigned to a variable called "kevin_rna") based on the DNA coding strand using the String method [`replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace).  In this example, we will replace the coding strand T's with RNA strand U's using the command replace("original","new"). After performing this function, we want to ensure that the DNA coding strand sequence actually got replaced by the RNA sequence. To do this, we use the [`print()`](https://www.w3schools.com/python/ref_func_print.asp) function which prints the output (in this case, the string of Kevin's RNA sequence) or specified message.  

<a name='transcription'></a>
To run the code, you can use one of two methods: (1) Place the cursor on the boxes with code below, and hold down `shift`, and press `enter` or (2) select the run button (the right facing arrow).  Please run the code below.


In [None]:
kevin_cds = "ATGGTGCATCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"
kevin_rna = kevin_cds.replace("T", "U")
print("Kevin's rna:", kevin_rna)

Now, practice writing your own code to identify the RNA sequence for Will.  Will's coding strand DNA sequence is assigned to a variable called "will_cds" below.  Remember to run the code once you're done!

In [None]:
will_cds = "ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"
# indicate your command here to replace coding strand T's with RNA U's

# indicate a command here to display (i.e., "print") the RNA sequence


# Translation: RNA -> Protein

The RNA transcript contains the sequence read by the ribosome to produce the protein (amino acid sequence). Proteins consist of amino acids bound together in a sequence. The single amino acids are encoded by groups of 3 nucleic acid letters.  These groups of 3 basic letters are called codons that code for one of a group of 20 amino acids or the stop of translation (i.e., stop codon).

The codon table (in terms of RNA) is shown in the following picture

Recall that a coding sequence for a gene begins with the start codon `AUG` and ends with a stop codon that terminates the translation process. There are three stop codons: `UAA`, `UAG`, `UGA`. Since DNA and RNA are almost identical except for "T"s and "U"s, often, the coding strand of the DNA coding sequence is used directly to determine the amino acid sequence for the protein using computational methods.

<a name='codons'></a>

<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/codons.png" width=800>


The codon and amino acid data are stored in a Python dictionary. A Python dictionary can be thought of as a "container" holding pairs of items. The first item of one of these pairs is the `key`, and the second item is the `value`, creating `key:value` pairs.  In the dictionary below, the codon is used as the `key` and the corresponding amino acid is stored as the `value`.  Note that the stop of translation would be represented by an asterisk (*) in the output from this code. <a name='translation'></a>

Click on the play button in the code cell below to see the RNA to protein mapping for a small section of the coding strand of Will's beta-globin gene, and then you will have a chance to practice on your own to translate Kevin's sequence.  Remember that both of these sequences provided are part of the coding sequence, meaning that they only contain DNA sequence that will "code" for the protein sequence.

In [None]:
# Translate DNA sequence to amino acid sequence

def translate(seq):
    """Translate a DNA sequence to an amino acid sequence."""

    # the following is a Python dictionary that stores
    # the codons and amino acids as key:value pairs.
    geneticode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
    }

    length = len(seq)

    # Save the amino acid sequence in a list called protein
    protein = []
    for pos in range(0,length-2,3):
        codon = seq[pos:pos+3]
        # Get the appropriate amino acid from the dictionary
        aa = geneticode[codon]
        protein.append(aa)
        if aa == "*": # when we see a stop codon "*"
            return "".join(protein) # return the protein sequence
    # return the protein sequence when we finish processing all the codons
    return "".join(protein)


# Main program
will_cds = "ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"
will_protein = translate(will_cds)
print("Will's protein sequence: " + will_protein)

Now, practice writing your own code to identify the protein sequence for Kevin. Kevin's coding strand DNA sequence from part of the beta-globin gene is assigned to a variable called "kevin_dna" below.  Remember to run the code once you're done!

In [None]:
kevin_cds = "ATGGTGCATCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"
#indicate your command here to translate the coding sequence

#indicate a command here to display (i.e., "print") the protein sequence


## Check-in Questions

Instructions: Edit this text cell to respond to the following questions:

The sequences above were part of the coding sequence, meaning only nucleotides that will be coded into protein.  Will the coding sequence consist of sequence from exons, or introns, or both?

*   Your answer here:

The sequences above began with the start codon.  Why is it important to know where the start codon is located? How does this help us when we determine the protein sequence for a gene?

*   Your answer here:



----
# Exercise 2 - Analyze DNA sequences: Identify the first start codon "ATG" in the DNA sequence

The biological translational machinery needs to know where to start and stop constructing a protein. The code AUG (or ATG of the DNA coding strand) which would normally code for Met is also used as a start code. We can use the Python String method, [`find()`](https://docs.python.org/3/library/stdtypes.html#str.find), to search for the "ATG" within the string that stores the DNA sequence. If "ATG" is present in the string, then it will return the position (also known as the index) of the first occurence of "ATG" in the string, otherwise, it will return -1. Notice that the first index or position of the string is 0. Example: given a string "ABCD", the index of C in this string is 2, since the position or index of A is in the string "ABCD" is 0 in Python.

In exercise 1 with Kevin and Will's DNA coding sequence, scientists had already identified the start codon, so you were given the coding sequence of DNA, beginning with the start codon.  However, their cousin Molly also received genetic testing, and for her DNA, you are given a sequence that does not start with ATG, so you need to identify the location of the start codoon.  Run the cell below to find the position of "ATG" in the given DNA sequence for Molly.

<a name='atg'></a>

In [None]:
molly_dna = "AACAGACACCATGGTGCATCTGACTCCTGAAGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"
pos = molly_dna.find("ATG") # Finds the index of the first "ATG" in the above dna sequence
print("Molly's ATG is at position", pos)

We can also print Molly's DNA sequence and print a "***" symbol to mark the location of the position of "ATG".  Run the cell below to do so.

In [None]:
print(molly_dna)
print(" "*pos + "***") # here we print an empty space followed by a "***" symbol

<a name='cds'></a>
Now that we have specified the position of the start codon, we can use this information to create a new variable called "molly_cds" that only contains the coding sequence (CDS), the sequence that is coded into protein. To do this, we can use square brackets in the code to specify that this new "molly_cds" variable should contain the region of the "molly_dna" variable from the start of the ATG (which is defined by the variable "pos" based on the code above) until the end of the sequence (which is indicated by the space after the colon).  Run the cell below to do so.

In [None]:
molly_cds = molly_dna[pos:]
print("Molly's coding sequence: " + molly_cds)

## Check-in Questions

Instructions: Edit this text cell to respond to the following questions:

At what position was the start codon in Molly's DNA?

*   Your answer here:

When examining genetic information, what other useful features could you identify in a sequence using the .find() code?

*   Your answer here:

----
# Exercise 3 - Pairwise sequence comparisons <a name='pair'></a>


## Exercise 3A - Pairwise Comparison: Will's Sequence to Known Wild-type Sequence
Since Will shows no symptoms, let's determine whether his beta-globin gene coding sequence matches that of the wild-type sequence.  To do so, let's compare Will's sequence to the human wild-type (i.e., typical) sequence of the beta-globin gene.

Here is the wild-type sequence for a portion of the coding sequence of the beta-globin gene (also called hemoglobin subunit beta or HBB): ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG

Insert the sequence into the code cell below, between the quotation marks, in the empty variable "wildtype_cds."

In [None]:
# Comparing wild-type HBB coding sequence to Will's DNA sequence
will_cds = "ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"
wildtype_cds = "" # insert wild-type coding sequence here

# We will print a "." symbol between Will's dna and the wild-type sequence at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(will_cds)): # going through the sequence base by base
    if will_cds[i]==wildtype_cds[i]: # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequences are not the same

print("{}\n{}\n{}".format(will_cds,symbol,wildtype_cds))

## Exercise 3A Question

Does Will's sequence match the wild-type sequence?

*   Your answer here:


#Pairwise Comparisons between Sibling DNA
Now that we have identified that Will's beta-globin DNA sequence matches the wild-type sequence, we will perform two pairwise sequence comparison's to compare Will's DNA and protein to Kevin and then to Molly's.

# Exercise 3B - Pairwise Comparison 1: Will and Kevin

First, consider Will and Kevin's DNA. A region of the coding sequence is provided below.  After sequencing Kevin's beta-globin gene, doctors find that there is a mutation. Can you spot the change below? Run the cell below to find out.

It's easier and less error-prone to let the computer do the comparison, especially when you are having very long sequences to compare.

In [None]:
will_cds = "ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"
kevin_cds = "ATGGTGCATCTGACTCCTGTGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"

# We will print a "." symbol between Will and Kevin's DNA at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(will_cds)): # going through the sequence base by base
    if will_cds[i]==kevin_cds[i]: # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequences are not the same

print("{}\n{}\n{}".format(will_cds,symbol,kevin_cds))

To see the impact on both Will and Kevin's final protein sequence, identify the protein coded for each RNA sequence.

We can combine what we have learned from the previous two exercises to translate the protein sequences to observe the impact of the single-nucleotide change. We will be translating the DNA sequences instead of the RNA sequences because the translation lookup dictionary uses DNA codons.

**Note**: Since we need to use the translate() function in Exercise 1 above. Make sure you run the code cell under Translation in Exercise 1 before running the code cell below.

In [None]:
# First, use the translate() function to translate both Will and Kevin's DNA sequences. Then, compare their two sequences and
# insert a symbol wherever a protein is different, just like the previous code cell.
will_protein = translate(will_cds)
kevin_protein = translate(kevin_cds)

# symbol holds the symbol that indicates whether the nucleotides are the same or different between the two sequence.
symbol = " "*17 # create a string with 17 spaces
for i in range(len(will_protein)): # going through the sequence amino acid by amino acid
    if will_protein[i]==kevin_protein[i]: # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("Will's protein:  {}\n{}\nKevin's protein: {}".format(will_protein,symbol,kevin_protein))


## Exercise 3B Questions

Instructions: Edit this text cell to respond to the following questions:

For Will and Kevin, are their DNA sequences the same?

*   Your answer here:

What type of mutation does Kevin have?

*   Your answer here:

What effect does this mutation have on Kevin's protein sequence?  

*   Your answer here:

What effect does this mutation have on Kevin's protein function?  

*   Your answer here:

Explain why this mutation can lead to the symptoms of sickle cell anemia Kevin displays.

*   Your answer here:

# Exercise 3C - Pairwise Comparison 2: Will and Molly

Let's now repeat this process, but this time compare Will's DNA and protein sequences to his cousin Molly's sequences (starting at the start codon).  For DNA, a region of the coding sequence is provided below.  

In [None]:
will_cds = "ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"
molly_cds = "ATGGTGCATCTGACTCCTGAAGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"

# We will print a "." symbol between Will and Molly's dna at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(will_cds)): # going through the sequence base by base
    if will_cds[i]==molly_cds[i]: # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequenes are not the same

print("{}\n{}\n{}".format(will_cds,symbol,molly_cds))

# Now, translate both of their DNA sequences
will_protein = translate(will_cds)
molly_protein = translate(molly_cds)

symbol = " "*17 # create a string with 17 spaces
for i in range(len(will_protein)): # going through the sequence amino acid by amino acid
    if will_protein[i]==molly_protein[i]: # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nWill's  protein: {}\n{}\nMolly's protein: {}".format(will_protein,symbol,molly_protein))

## Exercise 3C Questions

Instructions: Edit this text cell to respond to the following questions:


What type of mutation does Molly have?  

*   Your answer here:

What effect does this mutation have on Molly's protein sequence?  

*   Your answer here:

What effect does this mutation have on Molly's protein function?  

*   Your answer here:

Explain why Molly's beta-globin gene has a mutation, but she does not show symptoms of sickle cell anemia.

*   Your answer here:

## Exercise 3 Check-in Question


Why does no one else in his immediate family, including his parents, display the same symptoms as Kevin?

*   Your answer here:


----
# Exercise 4: Apply your Programming Skills!

Now that you have had the chance to explore how we can use Python programming to analyze sequences, you have the chance to apply your skills in bioinformatics analysis in this assignment.

Lucia and Carlos are young siblings who recently received genetic testing to examine whether any mutations are present in their beta-globin genes.  Below are the DNA sequences from the coding strand of part of the beta-globin gene from Lucia and Carlos.  Each DNA sequence has been assigned a variable and is written as a string of letters, as seen below.


In [None]:
lucia_dna = "ACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"

carlos_dna = "ACAGACACCATGGTGCATCTGACTCCTGAGGAGAGGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAG"

Complete all of the following analyses using Python Programming in this Google Colab Module.  Show all of your work below using code cells to demonstrate your code and text cells to summarize the findings.  Note, if relevant, you may copy the code from previous example code cells to complete the following tasks. (Note that initial code cells have been created for you below each step.)





STEP 1: The sequences provided contain nucleotides both upstream and downstream of the start codon.  Identify the location of the first start codon for each sequence, and print the result. (*Hint: Refer back to [code](#atg) in Exercise 2 for finding the start codon*)

In [None]:
# STEP 1 - identify the start codon



STEP 2: Now that you’ve identified the location of the start codon, create new variables that contain only the sequence including and downstream of the start codon, in other words, the coding sequence; this region is all sequence that is coded into protein.  Use the new variable names “lucia_cds” and "carlos_cds" to indicate that these contain only the coding sequence for the gene. (*Hint: Refer back to [code](#cds) in Exercise 2 to help create these new coding sequence variables.*) Print these new variables. **You will use this coding sequence for the remaining steps.**

In [None]:
# STEP 2 - create new coding sequence variables using start codon positions from part 1 (HINT: Refer to Exercise 2 to create these new variables)



*BEFORE YOU CONTINUE*: Take a moment to check that your coding sequences you printed above begin with the start codon.  Remember to use this coding sequence for the remaining steps below.

STEP 3: Transcribe each DNA **coding sequence** (use your variables “lucia_cds” and "carlos_cds") to its corresponding mRNA sequence and print the result. (*Hint: Refer back to [code](#transcription) in Exercise 1 to help transcribe the CDS sequence.*)

In [None]:
# STEP 3 - transcribe DNA (use your variables “lucia_cds” and "carlos_cds") to mRNA



STEP 4: Translate each DNA **coding sequence** (use your variables “lucia_cds” and "carlos_cds") to its corresponding amino acid sequence and print the result (name your new protein sequences: "lucia_protein" and "carlos_protein"). (*Hint: Refer back to [code](#translation) in Exercise 1 to help transcribe the CDS sequence.*)

In [None]:
# STEP 4 - translate DNA (use your variables “lucia_cds” and "carlos_cds") to amino acid sequence



*BEFORE YOU CONTINUE*: Take a moment to check that your amino acid sequences you printed above begin with the amino acid that is coded by the start codon (you can refer to the [codon table](#codons) in Exercise 1).

STEP 5: Look at the provided code below that compares each beta-globin gene DNA coding sequence and protein sequence to the wild type version provided in Exercise 3a. For this step, you do NOT need to write new code, as we have done this for you; instead, any instance in which you see "XXXX", replace it with the appropriate sequence or variable names from your previous code cells, as described below.

In [None]:
# STEP 5 - pairwise comparison of Lucia's DNA (CDS) to wild-type sequence (For this code cell, edit the first line of code only;
                                                                        # the remaining code is already complete)


wildtype_dna = "XXXX" # enter the wildtype DNA coding sequence you previously found from exercise 3A here



# Comparison of Lucia's DNA (CDS) to wildtype DNA

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(lucia_cds)): # going through the sequence base by base
    if lucia_cds[i]==wildtype_dna[i]: # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequenes are not the same

print("{}\n{}\n{}".format(lucia_cds,symbol,wildtype_dna))

# Comparison of Lucia's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence

symbol = " "*18 # create a string with 18 spaces
for i in range(len(lucia_protein)): # going through the sequence amino acid by amino acid
    if lucia_protein[i]==wildtype_protein[i]: # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nLucia's protein:  {}\n{}\nWildtype protein: {}".format(lucia_protein,symbol,wildtype_protein))

In [None]:
# STEP 5 - pairwise comparison of Carlos' DNA (CDS) to wild-type sequence

# Comparison of Carlos' DNA to wildtype DNA - fill in the incomplete code; anytime you see "XXXX", replace it with the
# appropriate variable name for Carlos


symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(XXXX_cds)): # going through the sequence base by base
    if XXXX_cds[i]==wildtype_dna[i]: # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequenes are not the same

print("{}\n{}\n{}".format(carlos_cds,symbol,wildtype_dna))

# Comparison of Carlos' protein to wildtype protein

symbol = " "*18 # create a string with 18 spaces
for i in range(len(XXXX_protein)): # going through the sequence amino acid by amino acid
    if XXXX_protein[i]==wildtype_protein[i]: # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nCarlos' protein:  {}\n{}\nWildtype protein: {}".format(carlos_protein,symbol,wildtype_protein))

# Exercise 4 Check-in Questions
Instructions: Edit this text cell to respond to the following questions:

Are Lucia’s and Carlos’ sequences the same or different from the wild type sequence?

*   Your Answer Here:

How do you predict this may have an effect on protein function, if any?

*   Your Answer Here:

How do you predict this may have an effect on Lucia's or Carlos' phenotype, if any?

*   Your Answer Here: