## Lab 4: Coding region and other mRNA features

##### Q0: What's your name?

The purpose of this exercise is to retrieve your gene's mRNA sequence from Genbank and perform some preliminary analysis. You'll parse your gene using the "Features" annotations of its Genbank record.

##### Q1: What gene have you chosen for your final project?

Query GenBank

In [24]:
from Bio import Entrez # we'll import a set of code written for common biological applications
from Bio import SeqIO

In [25]:
GI_id = "NM_000321" # enter the accession # or GeneID for the mRNA sequence of your gene ***as a string***

Documentation for efecth http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch  

Code adapted from Biopython Tutorial, 5.3.1  Parsing GenBank records from the net

In [26]:
Entrez.email = "" # let NCBI know who you are
handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=GI_id) # db = database to query, rettype = retrieval type, retmode = retrieval mode 
seq_record = SeqIO.read(handle, "gb") # read the contents "fetched" from NCBI
handle.close()

#print the contents of "seq_record"
print(seq_record)

#print the type of data in "seq_record"
type(seq_record)

ID: NM_000321.2
Name: NM_000321
Description: Homo sapiens RB transcriptional corepressor 1 (RB1), mRNA.
Number of features: 67
/source=Homo sapiens (human)
/structured_comment=defaultdict(<class 'dict'>, {'Evidence-Data': {'RNAseq introns': 'single sample supports all introns', 'Transcript exon combination': 'BC040540.1, SRR1660809.155247.1'}})
/organism=Homo sapiens
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AL392048.9, BC040540.1 and
L41870.1.
This sequence is a reference standard in the RefSeqGene project.
On Jun 9, 2006 this sequence version replaced gi:4506434.
Summary: The protein encoded by this gene is a negative regulator
of the cell cycle and was the first tumor suppressor gene found.
The encoded protein also stabilizes constitutive heterochromatin to
maintain the overall chromatin structure. The active,
hypophosphorylated form of the protein binds transcription factor
E2F1. Defects in this gene are a cause of

Bio.SeqRecord.SeqRecord

#### Q2: How many "features" does your gene have?

In [27]:
#print the number of features
# ?seq_record    ?___ for info about a function
print(len(seq_record.features))

67


###Features
Open the following documentation from INSDC, a collaborative effort from DDBJ/EMBL/GenBank databases to develop standards for feature annotation practices, and read about what features are and how to read this information.  http://www.insdc.org/documents/feature_table.html 

Understand the following info (or shared descriptors) for a feature: Type, Location and Qualifiers.

Next, open up a web browser and navigate to NCBI Nucleotide to perform an interactive search for your gene. When you click to see the record for your gene, a Genbank record is displayed. Scroll down until you see the "Features" section.

**Q3: What type of information is found here?**
 

Locate the feature called "CDS". 

**Q4: What does "CDS" mean?**

Next we'll loop over the features of your gene's genbank record and print each feature.

In [28]:
for f in seq_record.features: # for each feature "f" in this record's features
    print(f) # no need to increment b/c it loops over it
# display the feature

type: source
location: [0:4772](+)
qualifiers:
    Key: chromosome, Value: ['13']
    Key: db_xref, Value: ['taxon:9606']
    Key: map, Value: ['13q14.2']
    Key: mol_type, Value: ['mRNA']
    Key: organism, Value: ['Homo sapiens']

type: gene
location: [0:4772](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:5925', 'HGNC:HGNC:9884', 'MIM:614041']
    Key: gene, Value: ['RB1']
    Key: gene_synonym, Value: ['OSRC; p105-Rb; pp110; PPP1R130; pRb; RB']
    Key: note, Value: ['RB transcriptional corepressor 1']

type: exon
location: [0:303](+)
qualifiers:
    Key: gene, Value: ['RB1']
    Key: gene_synonym, Value: ['OSRC; p105-Rb; pp110; PPP1R130; pRb; RB']
    Key: inference, Value: ['alignment:Splign:1.39.8']

type: misc_feature
location: [64:67](+)
qualifiers:
    Key: gene, Value: ['RB1']
    Key: gene_synonym, Value: ['OSRC; p105-Rb; pp110; PPP1R130; pRb; RB']
    Key: note, Value: ['upstream in-frame stop codon']

type: CDS
location: [166:2953](+)
qualifiers:
    Key: codon_start, 

**Q5: What "type" of features are returned? You know about the "CDS" feature - are there other features you're unfamiliar with? If so look up those feature types in the URL above (INSDC) and make note below of the type of information each contains.**

Notice how each of the features has some shared descriptors: type, location, and qualifiers. For a given feature you can access the information in these descriptors directly.   
For example, if I wanted to loop over all the features in a record and print out the type of each feature I would use a loop like this:

In [29]:
# for each feature in this records features
for f in seq_record.features:
#print out the feature followed by its type
    print("feature=\n", f, "type(feature):", type(f), "\n")

feature=
 type: source
location: [0:4772](+)
qualifiers:
    Key: chromosome, Value: ['13']
    Key: db_xref, Value: ['taxon:9606']
    Key: map, Value: ['13q14.2']
    Key: mol_type, Value: ['mRNA']
    Key: organism, Value: ['Homo sapiens']
 type(feature): <class 'Bio.SeqFeature.SeqFeature'> 

feature=
 type: gene
location: [0:4772](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:5925', 'HGNC:HGNC:9884', 'MIM:614041']
    Key: gene, Value: ['RB1']
    Key: gene_synonym, Value: ['OSRC; p105-Rb; pp110; PPP1R130; pRb; RB']
    Key: note, Value: ['RB transcriptional corepressor 1']
 type(feature): <class 'Bio.SeqFeature.SeqFeature'> 

feature=
 type: exon
location: [0:303](+)
qualifiers:
    Key: gene, Value: ['RB1']
    Key: gene_synonym, Value: ['OSRC; p105-Rb; pp110; PPP1R130; pRb; RB']
    Key: inference, Value: ['alignment:Splign:1.39.8']
 type(feature): <class 'Bio.SeqFeature.SeqFeature'> 

feature=
 type: misc_feature
location: [64:67](+)
qualifiers:
    Key: gene, Value: ['RB1']

If I wanted to print out the location of each feature I might use this loop:

In [30]:
for f in seq_record.features: # for each feature in this records features
#print out the location of feature   
    print("feature =\n", f, "location(feature) =", f.location, "\n") 

feature =
 type: source
location: [0:4772](+)
qualifiers:
    Key: chromosome, Value: ['13']
    Key: db_xref, Value: ['taxon:9606']
    Key: map, Value: ['13q14.2']
    Key: mol_type, Value: ['mRNA']
    Key: organism, Value: ['Homo sapiens']
 location(feature) = [0:4772](+) 

feature =
 type: gene
location: [0:4772](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:5925', 'HGNC:HGNC:9884', 'MIM:614041']
    Key: gene, Value: ['RB1']
    Key: gene_synonym, Value: ['OSRC; p105-Rb; pp110; PPP1R130; pRb; RB']
    Key: note, Value: ['RB transcriptional corepressor 1']
 location(feature) = [0:4772](+) 

feature =
 type: exon
location: [0:303](+)
qualifiers:
    Key: gene, Value: ['RB1']
    Key: gene_synonym, Value: ['OSRC; p105-Rb; pp110; PPP1R130; pRb; RB']
    Key: inference, Value: ['alignment:Splign:1.39.8']
 location(feature) = [0:303](+) 

feature =
 type: misc_feature
location: [64:67](+)
qualifiers:
    Key: gene, Value: ['RB1']
    Key: gene_synonym, Value: ['OSRC; p105-Rb; pp110;

Using the above two loops as a guide - write a for loop to print the location if the feature type is CDS. Since we're using an mRNA sequence, there should be only one CDS region in this Genbank record. **If there is more than one CDS feature listed in the Genbank record, see your TA before continuing on with the exercise.**

In [31]:
for f in seq_record.features: # for each feature in this records features
# ask if the feature is a CDS   
    if f.type == "CDS":
        # if so print out the location of the feature
        print("location =", f.location, "\n")   

location = [166:2953](+) 



**Q6: Compare the location of the CDS returned by Biopython to the location of the CDS according to the NCBI Nucleotide record for your gene. Are they the same? Explain any differences you see.**  

Hint: Read the Biopython cookbook (http://biopython.org/DIST/docs/tutorial/Tutorial.html) section 4.3.3,  Sequence described by a feature or location.

#### **Q7: Does the CDS region start at the first nucleotide of your mRNA sequence? Is the last nucleotide of the CDS region the last nucleotide of your mRNA sequence?   What is the region of the mRNA before the CDS called? What is the region of the mRNA following the CDS called?**

The location descriptor has some additional characteristics which are very useful. Suppose we want to retrieve the nucleotide sequence of CDS and assign it to a variable called CDS_seq. Notice that the CDS feature does not actually contain the nucleotide sequence of the CDS. Since the location of the CDS is defined relative to the sequence of the original record we can use the information in "location" to return the slice of the original sequence corresponding to the CDS. Since we've used an mRNA query, especifically a CDS, we don't have to deal with piecing together exons - this has already been done for us. 

But how do we retrieve the beginning and ending position from \[###:###](+)? We'll see in later labs that we can use regular expressions to pieces of information out of a string without knowing beforehand the position of the information within the string. But we don't have to do that here. 

For example, if f.location is \[297:369](+)  
f.location.start will return 297.  
f.location.end will return 369.

Modify the for loop you wrote above (the one which prints the location if the feature type is CDS) - to assign the starting position of the CDS to the variable cds_start and to assign the ending position of the CDS to the variable cds_end.

In [32]:
for f in seq_record.features: # for each feature in this records features
    if f.type == "CDS": # ask if the feature is a CDS
        cds_start = f.location.start # if so, assign the start location to the variable cds_start
        cds_end = f.location.end # if so, assign the end location to the variable cds_end       
        #print the value of cds_start
        print("cds_start =", cds_start)
        #print the value of cds_end
        print("cds_end =", cds_end)

cds_start = 166
cds_end = 2953


Use the variables cds_start and cds_end to slice the mRNA sequence of your gene into three pieces. The nucleotide sequence is found in seq_record.seq

3 pieces:
a. The 5' untranslated region. Call this variable "fiveUTR_seq". 
mRNA sequence from index 0 to begining of CDS region, inclusive.
b. The coding sequence. Call this variable "CDS_seq"
This is the entire CDS region which includes the stop codon (TAA, TAG, or TGA).
c. The 3' UTR sequence. Call this variable "threeUTR_seq".
mRNA sequence from end of CDS region to end of mRNA sequence.



In [33]:
for f in seq_record.features: # for each feature in this records features
    if f.type == "CDS": # ask if the feature is a CDS
        cds_start = f.location.start
        cds_end = f.location.end        
#assign the nucleotide sequence of your mRNA (seq_record.seq) to a variable mRNA_seq
mRNA_seq = seq_record.seq

#print the length of the mRNA sequence
print("mRNA sequence length =", len(mRNA_seq))

# slice mRNA_seq to return the portion of the mRNA sequence before the begining of the coding sequence, assign this sequence to a variable fiveUTR_seq
#print the length of the 5' UTR sequence
fiveUTR_seq = mRNA_seq[:cds_start]
print("Length of 5 UTR sequence =" , len(fiveUTR_seq))

# slice mRNA_seq to return the portion of the mRNA sequence that is the coding sequence, assign this sequence to a variable CDS_seq
#print the length of the coding sequence
CDS_seq = mRNA_seq[cds_start:cds_end]
print("CDS_seq length=", len(CDS_seq))

# slice mRNA_seq to return the portion of the mRNA sequence after the end of the coding sequence, assign this sequence to a variable threeUTR_seq
#print the length of the 3' UTR sequence
threeUTR_seq = mRNA_seq[cds_end:]
print("threeUTR_seq length =", len(threeUTR_seq))

#check that the sum of the lengths of the 5'UTR, the coding region, and the 3'UTR are equal to the total length of the mRNA
assert len(fiveUTR_seq) + len(CDS_seq) + len(threeUTR_seq) == len(mRNA_seq), "Oooops counting mismatch, try again!"


mRNA sequence length = 4772
Length of 5 UTR sequence = 166
CDS_seq length= 2787
threeUTR_seq length = 1819


Now that you've idenfitied the sequences corresponding to different parts of your gene - let's save these sequences as fasta files so that you can access them later.

FASTA file format:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...

In [34]:
print(GI_id)

# make a name for the file where you'll save your gene's 5'UTR sequence
five_UTR_filename = GI_id + "_5UTR_seq.fasta" 
# make a name for the file where you'll save your gene's CDS
CDS_filename = GI_id + "_CDS_seq.fasta" 
# make a name for the file where you'll save your gene's 3'UTR sequence
three_UTR_filename = GI_id + "_3UTR_seq.fasta" 

# open a text file to write to ("w") and call it fivefile
with open(five_UTR_filename, "w") as fivefile: 
    # create a fasta header describing the sequence that will follow
    header = ">" + GI_id + "_5UTR" + '\n' 
    # write the header to file
    fivefile.write(header) 
    #note that the write funciton only accepts strings, so we need to convert the Biopython object containing the sequence to a string
    fivefile.write(str(fiveUTR_seq))
    
# open a text file to write to ("w") and call it cdsfile
with open(CDS_filename, "w") as cdsfile : 
    # create a fasta header describing the sequence that will follow
    header = ">" + GI_id + "_CDS_" + '\n' 
    # write the header to file
    cdsfile.write(header) 
    #note that the write funciton only accepts strings, so we need to convert the Biopython object containing the sequence to a string    
    cdsfile.write(str(CDS_seq))
    
# open a text file to write to ("w") and call it threefile
with open(three_UTR_filename, "w") as threefile: 
    # create a fasta header describing the sequence that will follow
    header = ">" + GI_id + "_3UTR" + '\n'
    # write the header to file
    threefile.write(header) 
    #note that the write funciton only accepts strings, so we need to convert the Biopython object containing the sequence to a string
    threefile.write(str(threeUTR_seq))    

NM_000321


That's all folks!  
Please save your notebook and upload the notebook and your .fasta files to Blackboard. Be sure to save ALL of your files as you will be using this notebook for your final project and you'll use the .fasta files generated during this lab to complete next week's lab exercise.