<a href="https://colab.research.google.com/github/daisysong76/bioinformatics-research/blob/main/Lab01_c146_v03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [DS4Bio] Lab 1: Python Coding of Genetic Coding
### Data Science for Biology
**Notebook developed by:** *Sarp Dora Kurtoglu*<br>
**Notebook updated by:** *Skye Pickett, Zcjanin Ollesca, Xiaomei Song, Diego Sotomayor, Evie Currington*

### Learning Outcomes

In this notebook, you will learn about:
* Central Dogma
* Manipulating genetic sequences represented as strings
* Transcribing strands into mRNA
* Translating mRNA into amino acid sequences
* *Optional:* Alternative codon tables


## Table of Contents
1. [Central Dogma](#1.-Central-Dogma)
1. [Transcription](#2.-Transcription)
1. [Translation](#3.-Translation)
1. [EXTRA/OPTIONAL: Alternative Codon Tables](#4.-EXTRA/OPTIONAL:-Alternative-Codon-Tables)
1. [Conclusion](#5.-Conclusion)
***

### Helpful Data Science Resources
Here are some resources you can check out while doing this notebook!

- [Reference Sheet for the datascience Module](http://data8.org/sp22/python-reference.html)<br>(This is extremely helpful whenever you need a cheatsheet!)
- [Documentation for the datascience Module](http://data8.org/datascience/index.html)

### Peer Consulting

If you find yourself having trouble with any content in this notebook, Data Peer Consultants are an excellent resource! Click [here](https://dlab.berkeley.edu/training/frontdesk-info) to locate live help.

Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook, applications of data science in the world, or other data science courses offered at Berkeley.

---

To prepare our notebook environment, run the following cell which imports the necessary packages. It will print `All necessary packages have been imported.` below the cell when it's completed importing.

In [None]:
# Run this cell
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sp
plt.style.use('fivethirtyeight')
from IPython.display import Image
print("All necessary packages have been imported.")


## 1. Central Dogma

The **central dogma of molecular biology** is a ***theory*** stating that genetic information flows in one direction, from DNA --> RNA --> protein, or RNA directly to protein. The conversion from DNA to RNA is called **transcription** and the conversion from RNA to protein is called **translation**.
***
### 1.1 Coding Strand of GFP DNA

In this lab, we will look at the GFP protein's DNA sequence. GFP is a fluorescent protein that helps biological researchers visualize structures in laboratory settings.

The coding strand of GFP DNA (5' -> 3') is provided below ([source](https://www.ncbi.nlm.nih.gov/nuccore/L29345.1)). In the code cell, we assign the string containing the coding strand of GFP DNA to the variable `coding_strand`.

In [None]:
coding_strand = "tacacacgaataaaagataacaaagatgagtaaaggagaagaacttttcactggagttgtcccaattcttgttgaattagatggcgatgttaatgggcaaaaattctctgtcagtggagagggtgaaggtgatgcaacatacggaaaacttacccttaaatttatttgcactactgggaagctacctgttccatggccaacacttgtcactactttctcttatggtgttcaatgcttttcaagatacccagatcatatgaaacagcatgactttttcaagagtgccatgcccgaaggttatgtacaggaaagaactatattttacaaagatgacgggaactacaagacacgtgctgaagtcaagtttgaaggtgatacccttgttaatagaatcgagttaaaaggtattgattttaaagaagatggaaacattcttggacacaaaatggaatacaactataactcacataatgtatacatcatggcagacaaaccaaagaatggaatcaaagttaacttcaaaattagacacaacattaaagatggaagcgttcaattagcagaccattatcaacaaaatactccaattggcgatggccctgtccttttaccagacaaccattacctgtccacacaatctgccctttccaaagatcccaacgaaaagagagatcacatgatccttcttgagtttgtaacagctgctgggattacacatggcatggatgaactatacaaataaatgtccagacttccaattgacactaaagtgtccgaacaattactaaattctcagggttcctggttaaattcaggctgagactttatttatatatttatagattcattaaaattttatgaataatttattgatgttattaataggggctattttcttattaaataggctactggagtgtat"
coding_strand

Conventionally, the coding strand is usually provided instead of the template strand because it is closer to what the mRNA will look like.

It is much easier to look only at the coding strand and predict the amino acid sequence that it will code for. However, obtaining the mRNA from the coding strand is not very challenging in terms of coding: **all we need to do is change all of the thymines to uracils.**
***
### 1.2 Template Strand of GFP DNA

So, we will instead start with the **template strand (3' -> 5')** assigned to the variable `template_strand`.

In [None]:
template_strand = "atgtgtgcttattttctattgtttctactcatttcctcttcttgaaaagtgacctcaacagggttaagaacaacttaatctaccgctacaattacccgtttttaagagacagtcacctctcccacttccactacgttgtatgccttttgaatgggaatttaaataaacgtgatgacccttcgatggacaaggtaccggttgtgaacagtgatgaaagagaataccacaagttacgaaaagttctatgggtctagtatactttgtcgtactgaaaaagttctcacggtacgggcttccaatacatgtcctttcttgatataaaatgtttctactgcccttgatgttctgtgcacgacttcagttcaaacttccactatgggaacaattatcttagctcaattttccataactaaaatttcttctacctttgtaagaacctgtgttttaccttatgttgatattgagtgtattacatatgtagtaccgtctgtttggtttcttaccttagtttcaattgaagttttaatctgtgttgtaatttctaccttcgcaagttaatcgtctggtaatagttgttttatgaggttaaccgctaccgggacaggaaaatggtctgttggtaatggacaggtgtgttagacgggaaaggtttctagggttgcttttctctctagtgtactaggaagaactcaaacattgtcgacgaccctaatgtgtaccgtacctacttgatatgtttatttacaggtctgaaggttaactgtgatttcacaggcttgttaatgatttaagagtcccaaggaccaatttaagtccgactctgaaataaatatataaatatctaagtaattttaaaatacttattaaataactacaataattatccccgataaaagaataatttatccgatgacctcacata"
template_strand

<font color = #d14d0f>**QUESTION 1**:</font> **How many nucleotides long is the GFP template strand?**
>*Hint:<br>- The built-in function `len(__)` could be useful.*
<br> - Use the code cell below.

In [None]:
# Question 1 -- YOUR CODE HERE
len(template_strand)


<font color = #d14d0f>**QUESTION 2**:</font> **What is the percentage of adenosine (a) nucleotides in the *double stranded* GFP DNA? How about guanine? Thymine? Cytosine? Show your work and code below:**
>*Hint:<br>- Think about iteration and how we can use it throught the specific variable we want.<br>*

In [None]:
# Question 2 -- YOUR CODE HERE

#Let's create variables that will be the number of the specific element we will look for
adenosine = 0
thymine = 0
guanine = 0
cytosine = 0

#Let's iterate through the string using a for loop and conditionals
for character in coding_strand:
  if character == 'a':
    adenosine += 1
  elif character == 't':
    thymine += 1
  elif character == 'g':
    guanine += 1
  elif character == 'c':
    cytosine += 1

#Let's get the length of the string to calculate the percent
length = len(coding_strand)

#Let's calculate the percentage of our integers
percentage_adenosine = adenosine / length * 100
percentage_thymine = thymine / length * 100
percentage_guanine = guanine / length * 100
percentage_cytosine = cytosine / length * 100

percentage_adenosine, percentage_thymine, percentage_guanine, percentage_cytosine


#Hint: The percetanges for adenosine and thymine, and percetages for guanine and cytosine should be equal due to base pairing trends.

***
## 2. Transcription

<font color = #d14d0f>**QUESTION 3**:</font> **Now, we would like to transcribe the GFP template strand into mRNA. First, let's create a dictionary called `base_pairs` to pair the 4 base pairs for transcription purposes. This will be useful when transcibing our gene's DNA sequence into an mRNA. Be aware that RNAs contain the uracil (U) nucleotide instead of thymine (T).**
>*Hint:<br>- Use the dictionary in the cell below, which includes the first two entries for you.<br> - Use the image below to reference for conceptual understanding.*

<img src="images/base_pairs.jpg" alt="Base Pairs mRNA"/>

In [None]:
# Question 3 -- YOUR CODE HERE
#The first two entries are provided for you.
base_pairs = {
    "a":"u",
    "t":"a",
    "g":"c",
    "c":"g"
}

<font color = #d14d0f>**QUESTION 4**:</font> **Now, write a function `transcribe()` that takes a string seq (5'->3') as input and outputs the mRNA (5'->3') that the DNA inputted codes for. Be aware that transcription occurs by taking the DNA provided as template. So, the mRNA outputted by transcribe() should be reverse complement of the DNA:**
>*Hint:<br>- The dictionary we created in Question 3 can be useful specially if we use the method `get()`.
<br>- The function is named `transcribe` and has only one argument, `seq`.*


In [None]:
# Question 4 -- YOUR CODE HERE
def transcribe(seq):
  translation = ""
  for element in seq:
    new_letter = base_pairs.get(element)
    translation += new_letter
  return translation


<font color = #d14d0f>**QUESTION 5**:</font> **Now let's put to use the `transcribe` function we just defined in Question 4. We want to obtain the mRNA sequence for the GFP DNA. Make sure it is in the 5'->3' direction. Save the final mRNA sequence in a string called `mrna`.**

In [None]:
# Question 5 -- YOUR CODE HERE
mrna = transcribe(coding_strand)
mrna

<font color = #d14d0f>**QUESTION 6**:</font> **Looking at `mrna` from Question 5, where is the first start codon, and where does the open reading
frame (ORF) starting with it end? Print the reading frame portion of the mRNA (including the start and stop codons). Save the final mRNA reading frame as a string under the variable `mrna_frame`. Show your work and code below.**
>*Hint:<br>- Remember that a string needs to be in quotes (ie, `"dna"` and `'dna'` are strings but `dna` is not).*



In [None]:
# Question 6 -- YOUR CODE HERE
mrna_frame = ...
mrna_frame

***
## 3. Translation

**Finally, for the last part of the central dogma, we will translate our mRNA into an amino acid sequence!**

You are provided the standard codon conversion table below. The keys of this dictionary are the 3-nucleotides long codons and corresponding values are the amino acids they code for.
>*Hint:<br>-Reminder that dictionary keys are the values in front of the colons and the values are the values after the colon. A dictionary is a list of pairs, each with one key and one value.<br> ie) `dict = {'<key_1>': "<value_1>", "<key_2>": "<value_2">}`* is a dictionary with 2 pairs.

Be sure to **run the cell below** so the `codon_table` variable gets defined.

In [None]:
codon_table = {
    'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T',
    'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R',
    'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',
    'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R',
    'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A',
    'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',
    'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S',
    'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L',
    'UAC':'Y', 'UAU':'Y', 'UAA':'_', 'UAG':'_',
    'UGC':'C', 'UGU':'C', 'UGA':'_', 'UGG':'W'}


<font color = #d14d0f>**QUESTION 7**:</font> **Using the variable, `mrna_frame`, from Question 6, create a list called `codons` which contains every codon in the reading frame *in order*. Finally, print `codons`.**<br> **Important: The mrna nucleotides were all lower case but the codons in the codon_table are all upper case. You will need to capitalize all of the nucleotides before using codon_table.**
>*Hint:<br>- You can use `string.upper()` to convert each character of the string to upper case.*


In [None]:
# Question 7 -- YOUR CODE HERE
codons = [...]

print(codons)



<font color = #d14d0f>**QUESTION 8**:</font> **Using the `codon_table` dictionary defined for you above Question 7, write a function `translate()` that takes in a (5'->3') string, `mrna`, and a codon `table` as inputs, and outputs the corresponding sequence of amino acids after translation in string form. Your final string should *only* contain the amino acids (and possibly not any `_` (underscore) signs when encountering stop codons).**

In [None]:
# Question 8 -- YOUR CODE HERE
def translate(mrna, table):
  ...
  return ...


<font color = #d14d0f>**QUESTION 9**:</font> **Obtain the amino acid sequence from the GFP mRNA. Save the final amino acid sequence as a string to the variable `aa_seq`.**



In [None]:
# Question 9 -- YOUR CODE HERE
aa_seq = ...
aa_seq


***
## 4. EXTRA/OPTIONAL: Alternative Codon Tables

In contrast to what you are taught in most biology courses, the codon conversion table is not universal. Today, we know that the genetic code evolves with time, which leads to differences in the codon conversion tables for different species. So, we need to pay attention to where our genetic code comes from and pick the correct codon table. For example, in ciliated protozoa, the universal stop codons UAA and UAG code for glutamine.




<font color = #d14d0f>**QUESTION 10**:</font> **Let's imagine that the GFP DNA we have actually came from yeast mitochondrial code. Below, the differences between the standard codon table and the mitchondrial yeast codon table are provided. Read the rest of the question in the cell below the code cell.**


In [None]:
 #Codon      Mit.Yeast       Standard

 # AUA        Met  M         Ile  I
 # CUU        Thr  T         Leu  L
 # CUC        Thr  T         Leu  L
 # CUA        Thr  T         Leu  L
 # CUG        Thr  T         Leu  L
 # UGA        Trp  W         Ter  _

#Ter stands for termination
#This information is adapted from https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?chapter=tgencodes#SG3

**Directly alter our previous codon table called `codon_table` to reflect the differences in the alternative codon table and save it as a dictionary under the variable `alt_table`. Use the cell below to create `alt_table`.**

In [None]:
# Question 10 -- YOUR CODE HERE
alt_table = ...




<font color = #d14d0f>**QUESTION 11**:</font> **Using the new alternative codon conversion table you have just created for Question 10, obtain the amino acid sequence from the GFP mRNA reading frame that you had chosen.**


In [None]:
# Question 11 -- YOUR CODE HERE




<font color = #d14d0f>**QUESTION 12**:</font> **Find the first start codon.**
>*Hint:<br>- To find the location of the first start codon in the mRNA sequence, you can use the find() function. This function will return the index of the first occurrence of a given substring.*


In [None]:
# Question 12 -- YOUR CODE HERE

# Use the find() function to locate the first occurrence of the start codon in the mRNA sequence.

# Print the index of the first start codon. Remember that the index is usually zero-based, so the first base is at index 0.



<font color = #d14d0f>**QUESTION 13**:</font> **Find the frequency of pairs of codeons (dinucleotides) in the GFP mRNA sequence (`mrna` variable). **
>*Hint:<br>- You can create a dictionary to count the occurrences of each codon pair. Then, you can iterate through the sequence and update the counts accordingly.*


In [None]:
# QUESTION 13 -- YOUR CODE HERE

#	Convert the GFP mRNA sequence (mrna) into a list of codon pairs (dinucleotides).

# Create a dictionary to store the count of each codon pair.

# Calculate the frequency (count/total codon pair count) of each codon pair.

# Identify the codon pair with the highest frequency and print it.


***
### Congratulations! You have finished LAB 1!

***
## 5. Conclusion
Over the course of this notebook, you:
* Learned about coding and template DNA strands
* Identified nucleotides in GFP DNA
* Coded using if statements and/or for loops to create percentages
* Created dictionaries for pair matching
* Created functions to transcribe and translate DNA sequences


### Congratulations! You have finished Lab 1!
***