<a href="https://colab.research.google.com/github/csbfx/apex/blob/main/APEX_Biology_Activity_Module_Bioinformatics_Cancer_Genetics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Bioinformatics with Python Programming: A Case Study on BRCA1's Role in Breast Cancer
#### Created by Wendy Lee, Najelie Crivelli, Michelle Jin, Ravneet Kaur, Akiko Balitactac, Inika Bhatia, Valerie Carr, Morris Jones, and Jennifer Avena.
#### Last updated: August 8, 2023
#### Licensed under CC BY-NC-SA

<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/RPLP0_90_ClustalW_aln.gif" width=500><br>
Image credit: Miguel Andrade CC BY-SA 3.0

### Learning objectives:
  1. Using computing, determine the RNA and protein sequence from the DNA sequence of a gene, and compare gene sequences.
  2. Predict the effects of mutations on protein function.


## A few notes before you start
* This file is view only, meaning that you can't edit it.
    * To generate an editable copy, select the icon <img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/opencolab.png" valign="bottom"> at the page's upper section. This action will open a new browser tab with the file accessible via Google Colab. Navigate to the top of the notebook and click on the <img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/copytodrive.png" valign="bottom" border=1px> icon. This will result in the opening of a fresh tab containing your personalized copy.
    * If you want to refer back to your copy in the future, you can find it in Google Drive in a folder called `Colab Notebooks`.
* To run a cell, use `shift` + `enter` or hit the play button.   
* As you progress through the exercises in this module, run every code cell in order.  Many functions and variables are defined in the beginning exercises in the module and thus need to be run in order for later code in the module to work.

# INTRODUCTION: The Cell Cycle's and BRCA1 Gene's Role in Breast Cancer


**Cell Cycle**

The cell cycle consists of Interphase (which includes the G1, S, and G2 phases) and cell division (M phase).  During interphase, cells prepare for cell division through growth (G1 and G2 phases) and DNA replication (S phase).  The process of cell division called mitosis then occurs in which two daughter cells will be produced from one original cell.  In order to ensure the integrity of the cells during the cell cycle, several checkpoints exist, including the G1/S, G2/M, and M checkpoints.  These checkpoints check that certain conditions are met within the cell before the cell continues to proceed in the cell cycle; if the conditions are not met, the cell cycle will pause.  During the G1/S checkpoint, prior to moving to S phase, the cell checks that cell growth is appropriate and the DNA is not damaged.  At the G2/M checkpoint, prior to moving to M phase, the cell checks that DNA replication occurred appropriately and the DNA integrity is maintained.  At the M checkpoint, prior to completion of M phase, assembly and chromosome attachment of the spindle is checked.  Cell cycle checkpoints are important in maintaining the integrity of cells throughout the process of the cell cycle, so mutations that occur in the genes that control these checkpoints may contribute to cancer.

<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/cell_cycle.jpg" width=500><br>
Image source: [NIH NHGRI](https://www.genome.gov/genetics-glossary/Cell-Cycle)






**Breast Cancer and the BRCA1 Gene**

BRCA 1 (breast cancer associated gene 1) is a human tumor suppressor gene that plays a role in DNA repair and regulation of the cell cycle.  Mutations in the BRCA1 gene can affect cell cycle checkpoints, including the G1/S, G2/M, and M checkpoints, and increase the risk of breast and ovarian cancer in females and breast and prostate cancer in males.

Cancer is a disease that occurs when aberrant cell division occurs due to an accumulation of mutations, such that a cell continues to divide when and/or where it typically would not.  About 10% of breast cancers are familial, indicating that a predisposition to cancer occurs due to the presence of an inherited mutation in genes, such as BRCA1.  Individuals that inherit one mutated allele of BRCA1 have an increased probability (~50-80%) of developing breast cancer; during their lifetime, if a mutation in the other BRCA1 allele occurs, this, along with mutations in other genes, may lead to a cancerous state.

Genetic testing for this gene is available, and individuals with a family history of breast cancer may choose to receive this testing to provide them information about their risk of breast cancer and to inform their preventative actions and/or treatment.



References: [NCBI](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1390683/); [NCBI](https://www.ncbi.nlm.nih.gov/gene/672); [NIH](https://www.genome.gov/genetics-glossary/Cell-Cycle#:~:text=A%20cell%20cycle%20is%20a,mitosis%2C%20and%20completes%20its%20division.); [JBC](https://www.jbc.org/article/S0021-9258(18)31037-8/fulltext); [OMIM](https://www.omim.org/entry/113705?search=brca1&highlight=brca1); [CDC](https://www.cdc.gov/genomics/disease/breast_ovarian_cancer/genes_hboc.htm); [Essentials of Genetics](https://www.pearson.com/en-us/subject-catalog/p/Klug-Modified-Mastering-Genetics-with-Pearson-e-Text-Standalone-Access-Card-for-Essentials-of-Genetics-10th-Edition/P200000006980/9780135588789); [Molecular Biology of the Cell](https://www.ncbi.nlm.nih.gov/books/NBK26824/)

# **Case Study Background**

Emily is a 46-year old woman who noticed she had a lump on her left breast. Despite not experiencing any other symptoms, she went to the doctor's office to have the lump evaluated. While discussing her family history with the doctor, she indicated that there was a history of breast cancer and other unknown cancers. Due to her family’s medical history and the presence of the lump on her breast, the doctor ordered a breast ultrasound and biopsy.  Based on her test results, she was diagnosed with an early stage (stage IA) of breast cancer in the left breast.

To learn more about the cause of Emily's breast cancer, genetic testing was performed on cells from her cancerous tumor.  Emily's sister Annie, who does not currently display any symptoms of cancer, received genetic testing to learn more about her risk of breast cancer. In this case study, we will examine the results of sequencing of the BRCA1 gene.


----
# Exercise 1 - Analyze BRCA1 Gene Sequences

**Examining The Central Dogma: Analyzing Gene Sequences using Python Programming Language**
In this first exercise, you will use several String methods in Python to analyze the content of genomic sequences. In computer programming, a String is a sequence of text or characters. String Methods are functions built to perform a specific task for a String.

Genetic materials consist of DNA, some of which is transcribed into RNA sequences and translated into proteins.  This process is known as the Central Dogma of Biology, as illustrated in the two images below.

<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/centraldogma.jpeg" width="380"><br>
Image credit: [Ille, Alexander M., Hannah Lamont, and Michael B. Mathews. "The Central Dogma revisited: Insights from protein synthesis, CRISPR, and beyond." Wiley Interdisciplinary Reviews: RNA 13.5 (2022): e1718.](https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wrna.1718#pane-pcw-references)







DNA is double-stranded. The protein coding information is stored in the coding strand of DNA.

<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/open_reading_frame.jpg">Image credit: [National Human Genome Research Institute (NHGRI)](https://www.genome.gov/genetics-glossary/Open-Reading-Frame)




## Transcription: DNA -> RNA
In transcription, DNA is transcribed into RNA.  The non-coding (i.e., template) strand of DNA serves as a template for the cell to make an RNA strand that is complementary and antiparallel. The coding strand DNA sequence is highly similar to the RNA sequence. The only difference is that all the "T"s in the DNA are "U"s in RNA. For example, the DNA coding strand `ATGGACGATGAGCGATAG` will be transcribed to RNA as `AUGGACGAUGAGCGAUAG`.

The computer can simulate the transcription process. We will do this today with part of the DNA sequence for the BRCA1 gene for cancer cells obtained from Emily, who was diagnosed with breast cancer, and her sister Annie.  For the remainder of this case study, when we refer to testing results for Emily, we refer only to testing of her cancer cells.  For the DNA sequences for both Emily and Annie, you are provided with a part of the sequence of *exon 4* in the BRCA1 gene.   Exons are regions of the gene that are coded into protein, and mutations in BRCA1 exons, including exon 4, have been associated with cancer.   The sequence below is written to resemble part of the *coding strand* sequence.

Below is a part of the exon 4 sequence of Emily's coding strand of DNA, written as a string of letters. For each strand of DNA written in today's activity, the 5' will be located on the left side of the sequence, and 3' will be located on the right side of the sequence. Emily's DNA sequence for exon 4 is assigned to a variable called "emily_dna". In this exercise, you will determine Emily's RNA sequence (which will be assigned to a variable called "emily_rna") based on the DNA coding strand using the String method [`replace()`](https://docs.python.org/3/library/stdtypes.html#str.replace). In this example, we will replace the coding strand T's with RNA strand U's using the command replace("original","new"). After performing this function, we want to ensure that the DNA coding strand sequence actually got replaced by the RNA sequence. To do this, we use the [`print()`](https://www.w3schools.com/python/ref_func_print.asp) function which prints the output (in this case, the string of Emily's RNA sequence) or specified message.

To run the code, you can use one of two methods: (1) Place the cursor on the boxes with code below, and hold down `shift`, and press `enter` or (2) select the run button (the right facing arrow).  Please run the code below.


<a name='transcription'></a>

In [None]:
emily_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"
emily_rna = emily_dna.replace("T", "U")
print("Emily's rna:", emily_rna)

Now, practice writing your own code to identify the RNA sequence for Annie.  Annie's coding strand DNA sequence is assigned to a variable called "annie_dna" below.  Remember to run the code once you're done!

In [None]:
annie_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
# indicate your command here to replace coding strand T's with RNA U's

# indicate a command here to display (i.e., "print") the RNA sequence


# Translation: RNA -> Protein

The RNA transcript contains the sequence read by the ribosome to produce the protein (amino acid sequence). Proteins consist of amino acids bound together in a sequence. The single amino acids are encoded by groups of 3 nucleic acid letters.  These groups of 3 basic letters are called codons that code for one of a group of 20 amino acids or the stop of translation (i.e., stop codon).

The codon table (in terms of RNA) is shown in the following picture

Recall that a coding sequence for a gene begins with the start codon `AUG` and ends with a stop codon that terminates the translation process. There are three stop codons: `UAA`, `UAG`, `UGA`. Since DNA and RNA are almost identical except for "T"s and "U"s, often, the coding strand of the DNA coding sequence is used directly to determine the amino acid sequence for the protein using computational methods.


<img src="https://www.sjsu.edu/people/wendy.lee/pics/apex/codons.png" width=800>


The codon and amino acid data are stored in a Python dictionary. A Python dictionary can be thought of as a "container" holding pairs of items. The first item of one of these pairs is the `key`, and the second item is the `value`, creating `key:value` pairs.  In the dictionary below, the codon is used as the `key` and the corresponding amino acid is stored as the `value`.  Note that the stop of translation would be represented by an asterisk (*) in the output from this code.

Click on the play button in the code cell below to see the RNA to protein mapping for part of the exon 4 region of the coding strand of Emily's BRCA1 gene, and then you will have a chance to practice on your own to translate Annie's sequence.

<a name='translation'></a>

In [None]:
# Translate DNA sequence to amino acid sequence

def translate(seq):
    """Translate a DNA sequence to an amino acid sequence."""

    # the following is a Python dictionary that stores
    # the codons and amino acids as key:value pairs.
    geneticode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
    }

    length = len(seq)

    # Save the amino acid sequence in a list called protein
    protein = []
    for pos in range(0,length-2,3):
        codon = seq[pos:pos+3].upper()
        # Get the appropriate amino acid from the dictionary
        aa = geneticode[codon]
        protein.append(aa)
        if aa == "*": # when we see a stop codon "*"
            return "".join(protein) # return the protein sequence
    # return the protein sequence when we finish processing all the codons
    return "".join(protein)


# Main program
emily_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"
emily_protein = translate(emily_dna)
print("Emily's protein sequence: " + emily_protein)

Now, practice writing your own code to identify the protein sequence for Annie. Annie's coding strand DNA sequence of exon 4 in the BRCA1 gene is assigned to a variable called "annie_dna" below.  Remember to run the code once you're done!

In [None]:
annie_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
#indicate your command here to translate the DNA sequence

#indicate a command here to display (i.e., "print") the protein sequence


## Exercise 1 Check-in Questions

Instructions: Edit this text cell to respond to the following questions:

Why do you predict we are focusing on the sequence of an exon, instead of an intron?

*   Your answer here:

Predict what a mutation in an exon of BRCA1 may have on the function of the BRCA1 protein.

*   Your answer here:



----
# Exercise 2 - Pairwise sequence comparisons


## Exercise 2A - Pairwise Comparison: Emily's Sequence with Known Wild Type Sequence
Since Emily shows symptoms and was diagnosed with breast cancer, let's compare Emily's BRCA1 sequence to the wild type sequence of BRCA1 to determine whether a mutation may exist in Emily's BRCA1 gene.

Here is the sequence for a portion of the wild-type (i.e., typical) sequence of exon 4 of the BRCA1 gene:

TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG

Insert the sequence into the code cell below, between the quotation marks, in the empty variable "wildtype_dna".  You can then run the code cell to compare a portion of Emily's exon 4 sequence to that of the wild type sequence.

In [None]:
# Comparing NCBI wild type DNA sequence to Emily's DNA sequence
emily_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"
wildtype_dna = "" # insert BRCA1 portion of exon 4 sequence here

# We will print a "." symbol between Emily's DNA and NCBI's sequence at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(emily_dna)): # going through the sequence base by base
    if emily_dna[i].upper()==wildtype_dna[i].upper(): # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequences are not the same

print("{}\n{}\n{}".format(emily_dna.upper(),symbol,wildtype_dna.upper()))


# Comparison of Emily's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence
emily_protein = translate(emily_dna)

symbol = " "*18 # create a string with 18 spaces
for i in range(len(emily_protein)): # going through the sequence amino acid by amino acid
    if emily_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nEmily's protein:  {}\n{}\nWildtype protein: {}".format(emily_protein.upper(), symbol, wildtype_protein.upper()))


*After you have run the code cell above, if you would like to check your answer, you can run the cell below by hitting the play button.*

In [None]:
#@title Check your answers

# Comparing NCBI wild type DNA sequence to Emily's DNA sequence
emily_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"
wildtype_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG" # insert retrieved NCBI BRCA1 portion of exon 4 sequence here

# We will print a "." symbol between Emily's DNA and NCBI's sequence at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(emily_dna)): # going through the sequence base by base
   if emily_dna[i].upper()==wildtype_dna[i].upper(): # check to see if the bases are different
       symbol += "|" # add | if the nucleotides between the two sequences are the same
   else:
       symbol += "." # add . if the nucleotides between the two sequences are not the same

print("{}\n{}\n{}".format(emily_dna.upper(),symbol,wildtype_dna.upper()))



# Comparison of Emily's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence
emily_protein = translate(emily_dna)

symbol = " "*18 # create a string with 18 spaces
for i in range(len(emily_protein)): # going through the sequence amino acid by amino acid
   if emily_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
       symbol += "|" # add | if the amino acid between the two sequences are the same
   else:
       symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nEmily's protein:  {}\n{}\nWildtype protein: {}".format(emily_protein.upper(), symbol, wildtype_protein.upper()))


##Exercise 2A Check-in Questions
Instructions: Edit this text cell to respond to the following questions.

Does Emily's BRCA1 gene contain a mutation?  If so, what type of mutation is this and what effect would it have on the BRCA1 protein?

*   Your answer here:

What may be the likely cause of Emily's diagnosed breast cancer?

* Your answer here:

What medical and/or ethical questions might Emily and her doctor consider before sharing the results with her family?

* Your answer here:

# Exercise 2B - Pairwise Comparison: Annie's Sequence with Known Wild Type Sequence
Annie currently shows no symptoms of breast cancer but receives genetic testing to help identify the potential risk of breast cancer. Let's compare Annie's BRCA1 sequence to the wild type sequence of BRCA1 obtained from the NCBI Nucleotide database.  We'll specifically focus on a portion of exon 4 again.

Insert the same portion of exon 4 wild type BRCA1 sequence from exercise 2a (above) into the code cell below, between the quotation marks, in the empty variable "wildtype_dna."  You can then run the code cell to compare a portion of Annie's exon 4 sequence to that of the wild type sequence.

In [None]:
# Comparing NCBI wild-type DNA sequence to Annie's DNA sequence
annie_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
wildtype_dna = "" # insert retrieved NCBI DNA sequence here (specifically exon 4)

# We will print a "." symbol between Annie's dna and NCBI's sequence at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(annie_dna)): # going through the sequence base by base
    if annie_dna[i].upper()==wildtype_dna[i].upper(): # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequences are not the same

print("{}\n{}\n{}".format(annie_dna.upper(),symbol,wildtype_dna.upper()))


# Comparison of Annie's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence
annie_protein = translate(annie_dna)

symbol = " "*18 # create a string with 18 spaces
for i in range(len(annie_protein)): # going through the sequence amino acid by amino acid
    if annie_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nAnnie's protein:  {}\n{}\nWildtype protein: {}".format(annie_protein.upper(), symbol, wildtype_protein.upper()))



*After you have run the code cell above, if you would like to check your answer, you can run the cell below by hitting the play button.*

In [None]:
#@title Check your answers
# Comparing NCBI wild-type DNA sequence to Annie's DNA sequence
annie_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
wildtype_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG" # insert retrieved NCBI DNA sequence here (specifically exon 4)

# We will print a "." symbol between Annie's dna and NCBI's sequence at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(annie_dna)): # going through the sequence base by base
   if annie_dna[i].upper()==wildtype_dna[i].upper(): # check to see if the bases are different
       symbol += "|" # add | if the nucleotides between the two sequences are the same
   else:
       symbol += "." # add . if the nucleotides between the two sequences are not the same

print("{}\n{}\n{}".format(annie_dna.upper(),symbol,wildtype_dna.upper()))


# Comparison of annie's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence
annie_protein = translate(annie_dna)

symbol = " "*18 # create a string with 18 spaces
for i in range(len(annie_protein)): # going through the sequence amino acid by amino acid
   if annie_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
       symbol += "|" # add | if the amino acid between the two sequences are the same
   else:
       symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nAnnie's protein:  {}\n{}\nWildtype protein: {}".format(annie_protein.upper(), symbol, wildtype_protein.upper()))


## Exercise 2B Check-in Questions

Instructions: Edit this text cell to respond to the following questions.


Does Annie's BRCA1 gene contain a mutation?  If so, what type of mutation is this and what effect would it have on the BRCA1 protein?

*   Your answer here:


Based on these results, what might you suggest to Annie regarding her risk of developing Breast Cancer?  Explain.

*   Your answer here:

----
# Exercise 3: Apply your Programming Skills!

Now that you have had the chance to explore how we can use Python programming to analyze sequences, you have the chance to apply your skills in bioinformatics analysis in this assignment.

Emily and Annie have a younger cousin named Olive who does not show any signs of cancer. However, she also received genetic testing. She has recently received the results of her genetic testing and wants to know whether any mutations are present in her BRCA1 gene. The results from Olive's DNA testing show that she has two different versions (i.e., alleles) of BRCA1.  Below are Olive's two DNA sequences from the coding strands of a portion of exon 4 of the BRCA1 gene written as strings of letters, which have been assigned to the variables "olive1_dna" and "olive2_dna".


In [None]:
olive1_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
olive2_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"

Complete all of the following analyses using Python Programming in this Google Colab Module.  Show all of your work below using code cells to demonstrate your code and text cells to summarize the findings.  Note, if relevant, you may copy the code from previous example code cells to complete the following tasks. (Note that initial code cells have been created for you below each step.)





STEP 1: Transcribe the “olive1_dna” and “olive2_dna” sequences to their corresponding mRNA sequences and print the result (name your new mRNA sequences: "olive1_rna" and "olive2_rna", respectively). (*Hint: Refer back to [code](#transcription) in Exercise 1 to help transcribe the DNA.*)

In [None]:
# STEP 1 - transcribe DNA (use your variables “olive1_dna” and "olive2_dna") to mRNA



STEP 2: Translate the  “olive1_dna” and “olive2_dna” sequences to their corresponding amino acid sequences and print the result (name your new protein sequences: "olive1_protein" and "olive2_protein", respectively). (Hint: Refer back to [code](#translation) in Exercise 1 to help translate the DNA sequence.)

In [None]:
# STEP 2 - translate DNA (use your variables “olive1_dna” and "olive2_dna") to amino acid sequences



STEP 3: Look at the provided code below that compares Olive's BRCA1 DNA sequences and protein sequences to the wild type version that you previously found from NCBI. For this step, you do NOT need to write new code, as we have done this for you; instead, any instance in which you see "XXXX", replace it with the appropriate sequence or variable names from your previous code cells, as described below.

In [None]:
# STEP 3 - pairwise comparison of Olive's DNA to NCBI wildtype sequence (For this code cell, edit the first line of code only;
                                                                        # the remaining code is already complete)

# STEP 3 - PART I
wildtype_dna = "XXXX" # enter the wildtype portion of exon 4 BRCA1 DNA sequence, EXCLUDING the first nucleotide,  you previously found from NCBI in exercise 2A here


# STEP 3 - PART II
# Comparison of Olive1's DNA to wildtype DNA

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(olive1_dna)): # going through the sequence base by base
    if olive1_dna[i].upper() ==wildtype_dna[i].upper(): # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequenes are not the same

print("{}\n{}\n{}".format(olive1_dna.upper(), symbol, wildtype_dna.upper()))

# Comparison of Olive1's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence

symbol = " "*18 # create a string with 18 spaces
for i in range(len(olive1_protein)): # going through the sequence amino acid by amino acid
    if olive1_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nOlive1's protein: {}\n{}\nWildtype protein: {}".format(olive1_protein.upper(), symbol, wildtype_protein.upper()))


# STEP 3 - PART III
# Comparison of Olive2's DNA to wildtype DNA
# Fill in the incomplete code; anytime you see "XXXX", replace it with the appropriate variable name for olive2

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(XXXX_dna)): # going through the sequence base by base
    if XXXX_dna[i].upper() ==wildtype_dna[i].upper(): # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequenes are not the same

print("{}\n{}\n{}".format(olive2_dna.upper(), symbol, wildtype_dna.upper()))

# Comparison of Olive2's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence

symbol = " "*18 # create a string with 18 spaces
for i in range(len(XXXX_protein)): # going through the sequence amino acid by amino acid
    if XXXX_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nOlive2's protein: {}\n{}\nWildtype protein: {}".format(olive2_protein.upper(), symbol, wildtype_protein.upper()))


# Exercise 3 Check-in Questions
Instructions: Edit this text cell to respond to the following questions.


Does Olive's BRCA1 gene contain a mutation?  If so, which allele(s) contain this mutation, and what type of mutation is this? What effect will this have on the BRCA1 protein?

*   Your answer here:


Based on these results, what might you suggest to Olive regarding her risk of developing Breast Cancer?  Explain.

*   Your answer here: