<a href="https://colab.research.google.com/github/csbfx/apex/blob/main/Answer_Key_APEX_Bioinformatics_Cancer_Genetics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Bioinformatics with Python Programming: A Case Study on BRCA1's Role in Breast Cancer: KEY
#### Created by Wendy Lee, Najelie Crivelli, Michelle Jin, Ravneet Kaur, Akiko Balitactac, Inika Bhatia, Valerie Carr, Morris Jones, and Jennifer Avena.
#### Last updated: June 30, 2023
#### Licensed under CC BY-NC-SA


### Learning objectives:
  1. Using computing, determine the RNA and protein sequence from the DNA sequence of a gene, and compare gene sequences.
  2. Predict the effects of mutations on protein function.


## Transcription: DNA -> RNA





In [None]:
emily_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"
emily_rna = emily_dna.replace("T", "U")
print("Emily's rna:", emily_rna)

Emily's rna: UUUUGCAUGCUGAAACUUCUCAACCAGAAGAAAGGGCCUUCACAGGGUCCUUUAUGUAAGAAUGAUAUAACCAAAAG


In [None]:
annie_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
# indicate your command here to replace coding strand T's with RNA U's
annie_rna = annie_dna.replace("T", "U")
# indicate a command here to display (i.e., "print") the RNA sequence
print("Annie's rna:", annie_rna)


Annie's rna: UUUUGCAUGCUGAAACUUCUCAACCAGAAGAAAGGGCCUUCACAGUGUCCUUUAUGUAAGAAUGAUAUAACCAAAAG


# Translation: RNA -> Protein



In [None]:
# Translate DNA sequence to amino acid sequence

def translate(seq):
    """Translate a DNA sequence to an amino acid sequence."""

    # the following is a Python dictionary that stores
    # the codons and amino acids as key:value pairs.
    geneticode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
    }

    length = len(seq)

    # Save the amino acid sequence in a list called protein
    protein = []
    for pos in range(0,length-2,3):
        codon = seq[pos:pos+3].upper()
        # Get the appropriate amino acid from the dictionary
        aa = geneticode[codon]
        protein.append(aa)
        if aa == "*": # when we see a stop codon "*"
            return "".join(protein) # return the protein sequence
    # return the protein sequence when we finish processing all the codons
    return "".join(protein)


# Main program
emily_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"
emily_protein = translate(emily_dna)
print("Emily's protein sequence: " + emily_protein)

Emily's protein sequence: FCMLKLLNQKKGPSQGPLCKNDITK


In [None]:
annie_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
#indicate your command here to translate the DNA sequence
annie_protein = translate(annie_dna)
#indicate a command here to display (i.e., "print") the protein sequence
print("Annie's protein sequence: " + annie_protein)

Annie's protein sequence: FCMLKLLNQKKGPSQCPLCKNDITK


## Exercise 1 Check-in Questions

Instructions: Edit this text cell to respond to the following questions:

Why do you predict we are focusing on the sequence of an exon, instead of an intron?

*   Your answer here: Exons contain the region of the gene that is expressed as protein, while introns do not.  Thus, mutations in exons have the potential to affect the protein sequence and function.

Predict what a mutation in an exon of BRCA1 may have on the function of the BRCA1 protein.

*   Your answer here: The effect of the mutation depends on the type of mutation.  A missense mutation will change the sequence of the protein and thus potentially alter the protein function.  A nonsense mutation would lead to a shorter protein, likely affecting function. A silent mutation will not change the amino acid sequence and thus not affect the function.  Deletions and insertions also may likely affect protein function.



----
# Exercise 2 - Pairwise sequence comparisons


## Exercise 2A - Pairwise Comparison: Emily's Sequence with Known Wild Type Sequence


In [None]:
# Comparing NCBI wild type DNA sequence to Emily's DNA sequence
emily_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"
wildtype_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"

# We will print a "." symbol between Emily's DNA and NCBI's sequence at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(emily_dna)): # going through the sequence base by base
    if emily_dna[i].upper()==wildtype_dna[i].upper(): # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequences are not the same

print("{}\n{}\n{}".format(emily_dna.upper(),symbol,wildtype_dna.upper()))


# Comparison of Emily's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence
emily_protein = translate(emily_dna)

symbol = " "*18 # create a string with 18 spaces
for i in range(len(emily_protein)): # going through the sequence amino acid by amino acid
    if emily_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nEmily's protein:  {}\n{}\nWildtype protein: {}".format(emily_protein.upper(), symbol, wildtype_protein.upper()))


TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG
|||||||||||||||||||||||||||||||||||||||||||||.|||||||||||||||||||||||||||||||
TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG

Emily's protein:  FCMLKLLNQKKGPSQGPLCKNDITK
                  |||||||||||||||.|||||||||
Wildtype protein: FCMLKLLNQKKGPSQCPLCKNDITK


##Exercise 2A Check-in Questions
Instructions: Edit this text cell to respond to the following questions.

Does Emily's BRCA1 gene contain a mutation?  If so, what type of mutation is this and what effect would it have on the BRCA1 protein?

*   Your answer here:Yes, it contains a substitution that is specifically a missense mutation, changing one nucleotide from a T to G and thus the amino acid C to G.  

What may be the likely cause of Emily's diagnosed breast cancer?

* Your answer here: This missense mutation likely leads to an alteration in the function of the BRCA1 protein, so that it no longer functions in the cell cycle as it typically would, potentially leading to excessive cell division.  Since she has only this mutated sequence of DNA in her cancer cells, both copies of her BRCA1 gene may contain this same mutation or her cancer cells may have only 1 copy of the BRCA1 gene (i.e., loss of heterozygosity). This, along with mutations in other genes, could then lead to cancer.

What medical and/or ethical questions might Emily and her doctor consider before sharing the results with her family?

* Your answer here: It would be useful to test Emily's non-cancer cells to identify whether she carries the BRCA1 mutation detected in all her cells.  Since breast cancer is present in her family history, this allele may be present in other members of her family.  If other members of the family do indeed carry this allele (i.e., have one mutated allele and one wild-type allele), this can increase the likelihood they may develop cancer.  Knowledge of their carrier status could inform whether they choose to take preventative measures, such as a masectomy (surgical removal of the breasts).

# Exercise 2B - Pairwise Comparison: Annie's Sequence with Known Wild Type Sequence


In [None]:
# Comparing NCBI wild-type DNA sequence to Annie's DNA sequence
annie_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
wildtype_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG" # insert retrieved NCBI DNA sequence here (specifically exon 4)

# We will print a "." symbol between Annie's dna and NCBI's sequence at the position where the bases are different
# If the bases are the same between the two sequences, we will put a "|" symbol in between the two sequences.

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(annie_dna)): # going through the sequence base by base
    if annie_dna[i].upper()==wildtype_dna[i].upper(): # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequences are not the same

print("{}\n{}\n{}".format(annie_dna.upper(),symbol,wildtype_dna.upper()))


# Comparison of Annie's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence
annie_protein = translate(annie_dna)

symbol = " "*18 # create a string with 18 spaces
for i in range(len(annie_protein)): # going through the sequence amino acid by amino acid
    if annie_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nAnnie's protein:  {}\n{}\nWildtype protein: {}".format(annie_protein.upper(), symbol, wildtype_protein.upper()))



TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG

Annie's protein:  FCMLKLLNQKKGPSQCPLCKNDITK
                  |||||||||||||||||||||||||
Wildtype protein: FCMLKLLNQKKGPSQCPLCKNDITK


## Exercise 2B Check-in Questions

Instructions: Edit this text cell to respond to the following questions.


Does Annie's BRCA1 gene contain a mutation?  If so, what type of mutation is this and what effect would it have on the BRCA1 protein?

*   Your answer here: No, there is no mutation present in this region of the gene examined.


Based on these results, what might you suggest to Annie regarding her risk of developing Breast Cancer?  Explain.

*   Your answer here: Annie does not have the same mutation as her sister, so she does not have an inherited predisposition to breast cancer due to the mutation that exists in her sister's BRCA1 gene.  However, we could look at the remainder of the BRCA1 gene to see if other mutations exist.  There is a possibility that Annie could develop cancer due to somatic mutations (those that are not inherited) that occur over her lifetime.

----
# Exercise 3: Apply your Programming Skills!



In [None]:
olive1_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG"
olive2_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG"

STEP 1: Transcribe the “olive1_dna” and “olive2_dna” sequences to their corresponding mRNA sequences and print the result (name your new mRNA sequences: "olive1_rna" and "olive2_rna", respectively)..

In [None]:
# STEP 1 - transcribe DNA (use your variables “olive1_dna” and "olive2_dna") to mRNA
olive1_rna = olive1_dna.replace("T", "U")
olive2_rna = olive2_dna.replace("T", "U")
print("Olive1's rna:", olive1_rna)
print("Olive2's rna:", olive2_rna)


Olive1's rna: UUUUGCAUGCUGAAACUUCUCAACCAGAAGAAAGGGCCUUCACAGUGUCCUUUAUGUAAGAAUGAUAUAACCAAAAG
Olive2's rna: UUUUGCAUGCUGAAACUUCUCAACCAGAAGAAAGGGCCUUCACAGGGUCCUUUAUGUAAGAAUGAUAUAACCAAAAG


STEP 2: Translate the  “olive1_dna” and “olive2_dna” sequences to their corresponding amino acid sequences and print the result (name your new protein sequences: "olive1_protein" and "olive2_protein", respectively).

In [None]:
# STEP 2 - translate DNA (use your variables “olive1_dna” and "olive2_dna") to amino acid sequences
olive1_protein = translate(olive1_dna)
olive2_protein = translate(olive2_dna)
print("Olive1's protein:", olive1_protein)
print("Olive2's protein:", olive2_protein)


Olive1's protein: FCMLKLLNQKKGPSQCPLCKNDITK
Olive2's protein: FCMLKLLNQKKGPSQGPLCKNDITK


STEP 3: Look at the provided code below that compares Olive's BRCA1 DNA sequences and protein sequences to the wild type version that you previously found from NCBI. For this step, you do NOT need to write new code, as we have done this for you; instead, any instance in which you see "XXXX", replace it with the appropriate sequence or variable names from your previous code cells, as described below.

In [None]:
# STEP 3 - pairwise comparison of Olive's DNA to NCBI wildtype sequence (For this code cell, edit the first line of code only;
                                                                        # the remaining code is already complete)

# STEP 3 - PART I
wildtype_dna = "TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG" # enter the wildtype portion of exon 4 BRCA1 DNA sequence, EXCLUDING the first nucleotide,  you previously found from NCBI in exercise 2A here

# STEP 3 - PART II
# Comparison of Olive1's DNA to wildtype DNA

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(olive1_dna)): # going through the sequence base by base
    if olive1_dna[i].upper() ==wildtype_dna[i].upper(): # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequenes are not the same

print("{}\n{}\n{}".format(olive1_dna.upper(), symbol, wildtype_dna.upper()))

# Comparison of Olive1's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence

symbol = " "*18 # create a string with 18 spaces
for i in range(len(olive1_protein)): # going through the sequence amino acid by amino acid
    if olive1_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nOlive1's protein: {}\n{}\nWildtype protein: {}".format(olive1_protein.upper(), symbol, wildtype_protein.upper()))


# STEP 3 - PART III
# Comparison of Olive2's DNA to wildtype DNA
# Fill in the incomplete code; anytime you see "XXXX", replace it with the appropriate variable name for olive2

symbol = "" # create a new empty string to hold the symbols to indicate if the bases are identical or not
for i in range(len(olive2_dna)): # going through the sequence base by base
    if olive2_dna[i].upper() ==wildtype_dna[i].upper(): # check to see if the bases are different
        symbol += "|" # add | if the nucleotides between the two sequences are the same
    else:
        symbol += "." # add . if the nucleotides between the two sequenes are not the same

print("{}\n{}\n{}".format(olive2_dna.upper(), symbol, wildtype_dna.upper()))

# Comparison of Olive2's protein to wildtype protein

wildtype_protein = translate(wildtype_dna) # we need to translate the wildtype DNA to its corresponding protein sequence

symbol = " "*18 # create a string with 18 spaces
for i in range(len(olive2_protein)): # going through the sequence amino acid by amino acid
    if olive2_protein[i].upper()==wildtype_protein[i].upper(): # check to see if the amino acids are different
        symbol += "|" # add | if the amino acid between the two sequences are the same
    else:
        symbol += "." # add . if the amino acid between the two sequences are not the same
print("\nOlive2's protein: {}\n{}\nWildtype protein: {}".format(olive2_protein.upper(), symbol, wildtype_protein.upper()))

TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG

Olive1's protein: FCMLKLLNQKKGPSQCPLCKNDITK
                  |||||||||||||||||||||||||
Wildtype protein: FCMLKLLNQKKGPSQCPLCKNDITK
TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGGGTCCTTTATGTAAGAATGATATAACCAAAAG
|||||||||||||||||||||||||||||||||||||||||||||.|||||||||||||||||||||||||||||||
TTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAAAG

Olive2's protein: FCMLKLLNQKKGPSQGPLCKNDITK
                  |||||||||||||||.|||||||||
Wildtype protein: FCMLKLLNQKKGPSQCPLCKNDITK


# Exercise 3 Check-in Questions
Instructions: Edit this text cell to respond to the following questions.


Does Olive's BRCA1 gene contain a mutation?  If so, which allele(s) contain this mutation, and what type of mutation is this? What effect will this have on the BRCA1 protein?

*   Your answer here: Olive has two different alleles of the BRCA1 gene: one wild-type copy and one mutated copy.  The mutated allele is the same as Annie's (missense mutation).  This missense mutation likely leads to an alteration in the function of the BRCA1 protein, so that it no longer functions in the cell cycle checkpoints as it typically would. However, Olive still has a wild-type copy of the gene as well that still provides BRCA1 protein with the appropriate function in the cell cycle.


Based on these results, what might you suggest to Olive regarding her risk of developing Breast Cancer?  Explain.

*   Your answer here:  Since Olive is a carrier (i.e., have one mutated allele and one wild-type allele) for this mutant BRCA1 allele that was also seen in Emily's cancerous cells, this can increase the likelihood she may develop cancer.  If the remaining wild-type copy of the gene is mutated, this, along with other mutations, can contribute to a cancerous state. She has the option to consider whether she would like to take preventative measures.