## Lab 4: Coding region and other mRNA features

##### Q0: What's your name?

The purpose of this exercise is to retrieve your gene's mRNA sequence from Genbank and perform some preliminary analysis. You'll parse your gene using the "Features" annotations of its Genbank record.

##### Q1: What gene have you chosen for your final project?

###Query GenBank

In [1]:
from Bio import Entrez # we'll import a set of code written for common biological applications
from Bio import SeqIO

In [2]:
GI_id = "NM_000088.3" # enter the accession # or GeneID for the mRNA sequence of your gene ***as a string***

Documentation for efecth http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch  

Code adapted from Biopython Tutorial, 5.3.1  Parsing GenBank records from the net

In [3]:
Entrez.email = "Andrew.JudellHalf001@umb.edu" # let NCBI know who you are
handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=GI_id) # db = database to query, rettype = retrieval type, retmode = retrieval mode 
seq_record = SeqIO.read(handle, "gb") #read the contents "fetched" from NCBI
print seq_record
print type(seq_record)
handle.close()

#print the contents of "seq_record"
#print the type of data in "seq_record"

ID: NM_000088.3
Name: NM_000088
Description: Homo sapiens collagen, type I, alpha 1 (COL1A1), mRNA.
Number of features: 89
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from AC015909.14 and Z74615.1.
This sequence is a reference standard in the RefSeqGene project.
On Jul 14, 2006 this sequence version replaced gi:14719826.
Summary: This gene encodes the pro-alpha1 chains of type I collagen
whose triple helix comprises two alpha1 chains and one alpha2
chain. Type I is a fibril-forming collagen found in most connective
tissues and is abundant in bone, cornea, dermis and tendon.
Mutations in this gene are associated with osteogenesis imperfecta
types I-IV, Ehlers-Danlos syndrome type VIIA, Ehlers-Danlos
syndrome Classical type, Caffey Disease and idiopathic
osteoporosis. Reciprocal translocations between chromosomes 17 and
22, where this gene and the gene for platelet-derived growth factor
beta are located, are associated with a p

#### Q2: How many "features" does your gene have?

In [None]:
# print the number of features

###Features
Open the following documentation from INSDC, a collaborative effort from DDBJ/EMBL/GenBank databases to develop standards for feature annotation practices, and read about what features are and how to read this information.  http://www.insdc.org/documents/feature_table.html 

Understand the following info (or shared descriptors) for a feature: Type, Location and Qualifiers.

Next, open up a web browser and navigate to NCBI Nucleotide to perform an interactive search for your gene. When you click to see the record for your gene, a Genbank record is displayed. Scroll down until you see the "Features" section.

**Q3: What type of information is found here?**
 

Locate the feature called "CDS". 

**Q4: What does "CDS" mean?**

Next we'll loop over the features of your gene's genbank record and print each feature.

In [4]:
for f in seq_record.features:
    print f
    print type(f)
    # for each feature "f" in this record's features
# display the feature

type: source
location: [0:5927](+)
qualifiers:
    Key: chromosome, Value: ['17']
    Key: db_xref, Value: ['taxon:9606']
    Key: map, Value: ['17q21.33']
    Key: mol_type, Value: ['mRNA']
    Key: organism, Value: ['Homo sapiens']

<class 'Bio.SeqFeature.SeqFeature'>
type: gene
location: [0:5927](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:1277', 'HGNC:HGNC:2197', 'MIM:120150']
    Key: gene, Value: ['COL1A1']
    Key: gene_synonym, Value: ['EDSC; OI1; OI2; OI3; OI4']
    Key: note, Value: ['collagen, type I, alpha 1']

<class 'Bio.SeqFeature.SeqFeature'>
type: exon
location: [0:229](+)
qualifiers:
    Key: gene, Value: ['COL1A1']
    Key: gene_synonym, Value: ['EDSC; OI1; OI2; OI3; OI4']
    Key: inference, Value: ['alignment:Splign:1.39.8']

<class 'Bio.SeqFeature.SeqFeature'>
type: misc_feature
location: [114:117](+)
qualifiers:
    Key: gene, Value: ['COL1A1']
    Key: gene_synonym, Value: ['EDSC; OI1; OI2; OI3; OI4']
    Key: note, Value: ['upstream in-frame stop codon']



**Q5: What "type" of features are returned? You know about the "CDS" feature - are there other features you're unfamiliar with? If so look up those feature types in the URL above (INSDC) and make note below of the type of information each contains.**

Notice how each of the features has some shared descriptors: type, location, and qualifiers. For a given feature you can access the information in these descriptors directly.   
For example, if I wanted to loop over all the features in a record and print out the type of each feature I would use a loop like this:

In [6]:
for f in seq_record.features:
    print f.type
    #ach feature in this records features
# print out the feature followed by its type

source
gene
exon
misc_feature
CDS
sig_peptide
proprotein
misc_feature
mat_peptide
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
misc_feature
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
STS
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
exon
STS
exon
STS
exon
exon
exon
exon
exon
exon
STS
STS
STS
exon
STS
STS
STS
exon
STS
exon
STS
regulatory
regulatory
polyA_site
STS
regulatory
polyA_site


If I wanted to print out the location of each feature I might use this loop:

In [9]:
for f in seq_record.features:
    print f.type, f.location
# for each feature in this records features
# print out the type of feature

source [0:5927](+)
gene [0:5927](+)
exon [0:229](+)
misc_feature [114:117](+)
CDS [126:4521](+)
sig_peptide [126:192](+)
proprotein [192:4518](+)
misc_feature [606:612](+)
mat_peptide [609:3780](+)
misc_feature [609:660](+)
misc_feature [609:612](+)
misc_feature [633:636](+)
misc_feature [660:3702](+)
misc_feature [918:921](+)
misc_feature [2358:2367](+)
misc_feature [2982:2988](+)
misc_feature [3402:3411](+)
misc_feature [3447:3450](+)
misc_feature [3615:3618](+)
misc_feature [3702:3780](+)
misc_feature [3747:3750](+)
misc_feature [3777:3783](+)
exon [229:424](+)
exon [424:459](+)
exon [459:495](+)
exon [495:597](+)
exon [597:669](+)
exon [669:714](+)
exon [714:768](+)
exon [768:822](+)
exon [822:876](+)
exon [876:930](+)
exon [930:984](+)
exon [984:1029](+)
exon [1029:1083](+)
exon [1083:1128](+)
exon [1128:1182](+)
exon [1182:1281](+)
exon [1281:1326](+)
exon [1326:1425](+)
exon [1425:1479](+)
exon [1479:1587](+)
exon [1587:1641](+)
exon [1641:1740](+)
exon [1740:1794](+)
STS [1742:

Using the above two loops as a guide - write a for loop to print the location if the feature type is CDS. Since we're using an mRNA sequence, there should be only one CDS region in this Genbank record. **If there is more than one CDS feature listed in the Genbank record, see your TA before continuing on with the exercise.**

In [10]:
for f in seq_record.features:
    if f.type=="CDS":
        print f.type, f.location
        # for each feature in this records features
# ask if the feature is a CDS
# if so print out the location of the feature

CDS [126:4521](+)


**Q6: Compare the location of the CDS returned by Biopython to the location of the CDS according to the NCBI Nucleotide record for your gene. Are they the same? Explain any differences you see.**  

Hint: Read the Biopython cookbook (http://biopython.org/DIST/docs/tutorial/Tutorial.html) section 4.3.3,  Sequence described by a feature or location.

#### **Q7: Does the CDS region start at the first nucleotide of your mRNA sequence? Is the last nucleotide of the CDS region the last nucleotide of your mRNA sequence?   What is the region of the mRNA before the CDS called? What is the region of the mRNA following the CDS called?**

The location descriptor has some additional characteristics which are very useful. Suppose we want to retrieve the nucleotide sequence of CDS and assign it to a variable called CDS_seq. Notice that the CDS feature does not actually contain the nucleotide sequence of the CDS. Since the location of the CDS is defined relative to the sequence of the original record we can use the information in "location" to return the slice of the original sequence corresponding to the CDS. Since we've used an mRNA query, especifically a CDS, we don't have to deal with piecing together exons - this has already been done for us. 

But how do we retrieve the beginning and ending position from \[###:###](+)? We'll see in later labs that we can use regular expressions to pieces of information out of a string without knowing beforehand the position of the information within the string. But we don't have to do that here. 

For example, if f.location is \[297:369](+)  
f.location.start will return 297.  
f.location.end will return 369.

Modify the for loop you wrote above (the one which prints the location if the feature type is CDS) - to assign the starting position of the CDS to the variable cds_start and to assign the ending position of the CDS to the variable cds_end.

In [14]:
x = 0

for f in seq_record.features: 
    x += 1
    if f.type == "CDS":
        cds_start = int(f.location.start)
        cds_end = int(f.location.end)
        
        print cds_start     
        print cds_end

        
    # for each feature in this records features
# ask if the feature is a CDS
# if so, assign the start location to the variable cds_start
# if so, assign the end location to the variable cds_end
# print the value of cds_start
# print the value of cds_end


126
4521


Use the variables cds_start and cds_end to slice the mRNA sequence of your gene into three pieces. The nucleotide sequence is found in seq_record.seq

3 pieces:
a. The 5' untranslated region. Call this variable "fiveUTR_seq". 
mRNA sequence from index 0 to begining of CDS region, inclusive.
b. The coding sequence. Call this variable "CDS_seq"
This is the entire CDS region which includes the stop codon (TAA, TAG, or TGA).
c. The 3' UTR sequence. Call this variable "threeUTR_seq".
mRNA sequence from end of CDS region to end of mRNA sequence.



In [47]:
# assign the nucleotide sequence of your mRNA (seq_record.seq) to a variable mRNA_seq
mRNA_seq = seq_record.seq
print "sequence length", len(mRNA_seq)
fiveUTR_seq = mRNA_seq[0:126]
print "5' untranslated region", len(fiveUTR_seq)
print fiveUTR_seq
CDS_seq = mRNA_seq[126:4521]
print "coding sequence length", len(CDS_seq)
print CDS_seq
threeUTR_seq = mRNA_seq[4521:]
print "3' UTR sequence", len(threeUTR_seq)
print threeUTR_seq
# print the length of the mRNA sequence

# slice mRNA_seq to return the portion of the mRNA sequence before the begining of the coding sequence, assign this sequence to a variable fiveUTR_seq

# print the length of the 5' UTR sequence

# slice mRNA_seq to return the portion of the mRNA sequence that is the coding sequence, assign this sequence to a variable CDS_seq

# print the length of the coding sequence

# slice mRNA_seq to return the portion of the mRNA sequence after the end of the coding sequence, assign this sequence to a variable threeUTR_seq

# print the length of the 3' UTR sequence

#check that the sum of the lengths of the 5'UTR, the coding region, and the 3'UTR are equal to the total length of the mRNA
assert len(fiveUTR_seq) + len(CDS_seq) + len(threeUTR_seq) == len(mRNA_seq), "Oooops counting mismatch, try again!"


sequence length 5927
5' untranslated region 126
TCGTCGGAGCAGACGGGAGTTTCTCCTCGGGGTCGGAGCAGGAGGCACGCGGAGTGTGAGGCCACGCATGAGCGGACGCTAACCCCCTCCCCAGCCACAAAGAGTCTACATGTCTAGGGTCTAGAC
coding sequence length 4395
ATGTTCAGCTTTGTGGACCTCCGGCTCCTGCTCCTCTTAGCGGCCACCGCCCTCCTGACGCACGGCCAAGAGGAAGGCCAAGTCGAGGGCCAAGACGAAGACATCCCACCAATCACCTGCGTACAGAACGGCCTCAGGTACCATGACCGAGACGTGTGGAAACCCGAGCCCTGCCGGATCTGCGTCTGCGACAACGGCAAGGTGTTGTGCGATGACGTGATCTGTGACGAGACCAAGAACTGCCCCGGCGCCGAAGTCCCCGAGGGCGAGTGCTGTCCCGTCTGCCCCGACGGCTCAGAGTCACCCACCGACCAAGAAACCACCGGCGTCGAGGGACCCAAGGGAGACACTGGCCCCCGAGGCCCAAGGGGACCCGCAGGCCCCCCTGGCCGAGATGGCATCCCTGGACAGCCTGGACTTCCCGGACCCCCCGGACCCCCCGGACCTCCCGGACCCCCTGGCCTCGGAGGAAACTTTGCTCCCCAGCTGTCTTATGGCTATGATGAGAAATCAACCGGAGGAATTTCCGTGCCTGGCCCCATGGGTCCCTCTGGTCCTCGTGGTCTCCCTGGCCCCCCTGGTGCACCTGGTCCCCAAGGCTTCCAAGGTCCCCCTGGTGAGCCTGGCGAGCCTGGAGCTTCAGGTCCCATGGGTCCCCGAGGTCCCCCAGGTCCCCCTGGAAAGAATGGAGATGATGGGGAAGCTGGAAAACCTGGTCGTCCTGGTGAGCGTGGGCCTCCTGGGCCTCAGGGTGCTCGAGGATTGCCCGGAACAGCTGGCCTCCCTGGAATGAAGGG

Now that you've idenfitied the sequences corresponding to different parts of your gene - let's save these sequences as fasta files so that you can access them later.

FASTA file format:
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...

In [16]:
five_UTR_filename = GI_id + "_5UTR_seq.fasta" # make a name for the file where you'll save your gene's 5'UTR sequence
CDS_filename = GI_id + "_CDS_seq.fasta" # make a name for the file where you'll save your gene's CDS
three_UTR_filename = GI_id + "_3UTR_seq.fasta" # make a name for the file where you'll save your gene's 3'UTR sequence

with open(five_UTR_filename, "w") as fivefile: # open a text file to write to ("w") and call it fivefile
    header = ">" + GI_id + "_5UTR" + '\n' # create a fasta header describing the sequence that will follow
    fivefile.write(header) # write the header to file
    fivefile.write(str(fiveUTR_seq)) #note that the write funciton only accepts strings, so we need to convert the Biopython object containing the sequence to a string
    print fivefile
with open(CDS_filename, "w") as cdsfile:
    header = ">" + GI_id + "_CDS" + '\n' # create a fasta header describing the sequence that will follow
    cdsfile.write(header) # write the header to file
    cdsfile.write(str(CDS_seq))
    print cdsfile
with open(three_UTR_filename, "w") as threefile:
    header = ">" + GI_id + "_3UTR" + '\n' # create a fasta header describing the sequence that will follow
    threefile.write(header) # write the header to file
    threefile.write(str(three_utr))
    print threefile

# open a text file to write to ("w") and call it cdsfile
# create a fasta header describing the sequence that will follow
# write the header to file
#note that the write funciton only accepts strings, so we need to convert the Biopython object containing the sequence to a string    
#with open(three_UTR_filename, "w") as threefile:
# open a text file to write to ("w") and call it threefile
# create a fasta header describing the sequence that will follow
# write the header to file
#note that the write funciton only accepts strings, so we need to convert the Biopython object containing the sequence to a string


<open file 'NM_000088.3_5UTR_seq.fasta', mode 'w' at 0x0000000002B58270>
<open file 'NM_000088.3_CDS_seq.fasta', mode 'w' at 0x0000000002B58390>
<open file 'NM_000088.3_3UTR_seq.fasta', mode 'w' at 0x0000000002B58300>


That's all folks!  
Please save your notebook and upload the notebook and your .fasta files to Blackboard. Be sure to save ALL of your files as you will be using this notebook for your final project and you'll use the .fasta files generated during this lab to complete next week's lab exercise.