import function

In [1]:
from sequence import *

DNA

In [2]:
seq1 = Sequence('AAAAAAAAGGGG')
print(seq1.alphabet)

('A', 'C', 'G', 'T')


RNA

In [3]:
seq2 = Sequence('AAAAAAAAGGUG')
print(seq2.alphabet)

('A', 'C', 'G', 'U')


Standard protein alphabet

In [4]:
seq3 = Sequence('AWAAAAAAGGVG')
print(seq3.alphabet)

('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y')


However, you can of course override the 'guess'. In the above examples, seq1 was guessed to be a DNA sequence although based on its sequence it could also be an RNA sequence. Python allows parameters to be set 'by default' and be overridden (you would have seen how this is done for __init__). Let's override the Sequence class's guess.

In [5]:
seq4 = Sequence('AAAAAAAAGGGG', RNA_Alphabet)
print(seq4.alphabet)

('A', 'C', 'G', 'U')


In [6]:
seq5 = Sequence('AAAAAAAAGGGG', Alphabet('AGU'))
print(seq5.alphabet)

Of course, once you start taking responsibility for what the alphabet is, it may lead to an error if the alphabet does not capture what symbols the sequence actually contains. Let's override the default guess by introducing a new alphabet not found in sym.py. What happens if new alphabet does not reflect the sequence?

In [7]:
seq6 = Sequence('AAAAAAAAGGGG', Alphabet('ATU'))

RuntimeError: Invalid symbol: G in sequence 

In [11]:
print(Bool_Alphabet)

('F', 'T')


In [8]:
seqs = readFastaFile('mystery1.fa')
len(seqs)

27

In [9]:
long_seqs = []
for seq in seqs:
    print(seq.name, len(seq), len(seq.alphabet))
    if len(seq) > 250:
        long_seqs.append(seq)

Q9H9L7 192 20
P41223 144 20
Q13352 177 20
Q9UFW8 167 20
P23528 166 20
P21291 193 20
P50461 194 20
P60981 165 20
P07992 297 20
P51858 240 20
Q0VD86 236 20
P61244 160 20
O60682 206 20
Q01658 176 20
Q9HAN9 279 20
P06748 294 20
P52945 283 20
Q6MZT1 257 20
P61571 104 20
Q8NHV9 184 20
Q96EU6 259 20
Q14493 270 20
Q9NS25 103 20
Q8IZU3 236 20
Q9P016 225 20
Q96B42 140 20
O60688 119 20


In [10]:
writeFastaFile('mystery1_long.fa', long_seqs)

In [13]:
seqs = readFastaFile('mystery2.fa')
len(seqs)

35

### Exercises part 2 : Sequence databases

There are many biological sequence databases available online, some of which contain massive amounts of data (and you would not want to store all of that data on your own hard disk). NCBI http://www.ncbi.nlm.nih.gov is one authoritative source for nucleotide sequence data and we referred to this in Practical 0. NCBI also stores data for proteins but the richest source of protein data is available from Uniprot (uniprot.org). Uniprot naturally links with complementary sources of information, including the so-called Gene Ontology.

We will often find ourselves wanting to search for proteins (and their sequences) in Uniprot, and the exercise below will help us develop strategies for doing that in an automated manner with Python, without a web browser.

Let's first go to the Uniprot web site, and find 'RNS1_ARATH' Ribonuclease 1 in Arabidopsis thaliana. The information is broken down into sections. Study the information and focus on its sequence annotation. Uniprot says that it contains a so-called Signal peptide at positions 1-22.

Now, let's try retrieving the sequence of this single gene, given its identifier (e.g. RNS1_ARATH), from Uniprot using sequence.py:

In [14]:
rns1 = getSequence('RNS1_ARATH', 'uniprot')
print(rns1)

UNIPROT:RNS1_ARATH: MKILLASLCLISLLVILPSVFSASSSSEDFDFFYFVQQWPGSYCDTQKKCCYPNSGKPAADFGIHGLWPNYKDGTYPSNCDASKPFDSSTISDLLTSMKKSWPTLACPSGSGEAFWEHEWEKHGTCSESVIDQHEYFQTALNLKQKTNLLGALTKAGINPDGKSYSLESIRDSIKESIGFTPWVECNRDGSGNSQLYQVYLCVDRSGSGLIECPVFPHGKCGAEIEFPSF


In [16]:
for item in seqs:
    print(item.name)

NP_078820.2
NP_001171939.1
NP_001171938.1
NP_002929.1
NP_056981.2
NP_004662.2
NP_775956.1
NP_775298.1
NP_006258.3
NP_006090.2
NP_057250.1
NP_919237.1
NP_919236.1
NP_919235.1
NP_003336.1
NP_002874.1
NP_872619.1
NP_006348.1
NP_004497.1
NP_203123.1
NP_001225.1
NP_116036.1
NP_001135758.1
NP_078900.1
NP_005180.1
NP_005137.1
NP_055925.2
NP_001230023.1
NP_001129036.1
NP_003843.3
NP_872590.1
NP_002583.1
NP_057353.1
XP_747553.1
XP_750661.1


In [17]:
spat = searchSequences('"signal+peptide"+AND+organism:3702+AND+length:[700+TO+*]')

In [26]:
len(spat)

1852

Visit the Uniprot website online and use 'Advanced Search' techniques to find all proteins belonging to 'Lipid metabolism' that contain a signal peptide. 

In [2]:
## do again on Lipid metabolism
spat = searchSequences('"lipid+metabolism"+AND+organism:3702+AND+length:[700+TO+*]')

In [3]:
len(spat)

153

In [5]:
seqs = []
for identifier in spat:
    seqs.append(getSequence(identifier))

In [10]:
writeFastaFile('lipmet_at.fa', seqs)

Triacylglycerols (TAGs) are an important reserve of carbon and energy in Eukaryotes. Triacylglycerol (TAG) lipases have been thoroughly characterized in mammals and microorganisms. By contrast, very little is known about plant TAG lipases.

We expect TAGs to have both a signal peptide and be involved in lipid metabolism. Hence, we would expect that any Triacylglycerol lipase(s) should be in both of the FASTA files retrieved in Exercise 4. Let's check if there any TAG lipases in A. thaliana and, if so, how many there are.

In [45]:
sigseqs = searchSequences('"signal+peptide"+AND+organism:3702+AND+length:[700+TO+*]')
lipseqs = searchSequences('"lipid+metabolism"+AND+organism:3702+AND+length:[700+TO+*]')

In [46]:
ids1=[];ids2=[]
ids1=set(sigseqs)
ids2=set(lipseqs)

common_ids=list(ids1.intersection(ids2)) #the clever method
print(common_ids)
print(len(common_ids))

['F4HQM3', 'A0A384KP75', 'Q304B9', 'A0A5S9YFA9', 'A0A178UFI4', 'A0A654GCH1', 'F4KHQ8']


### Introducing the Alignment class

Each sequence is aligned vertically in the file. If the sequences are too long for one line, another block of more sequence data is included at the bottom after a blank line.

In [49]:
from sequence import *
aln = readClustalFile('p450.aln', Protein_Alphabet)
print(aln)

C2F1_HUMAN|P24903 GGTKTVSTTLHHAFLALMKYPKVQARVQEEIDAVIHEVQRFADIMPFSAGRRLCCLGELLARMELF
C2F2_MOUSE|P33267 GGTETVGTTLRHAFLILMKYPKVQARVQEEIDAVIHEVQRFADVMPFSAGRRLCCLGEPLARMELF
C2F3_CAPHI|O18809 GGTETVGTTLRHAFRLLMKYPEVQVRVQEEIDAVIHEVQRFADIMPFSAGRRLCCLGEALARMELF
C2F4_RAT|O35293   GGTETVGTTLRHAFLILMKYPKVQARVQEEIDAVIHEVQRFADVMPFSAGRRLCCLGEPLARMELF
C4D1_DROME|P33269 EGHDTTSSALMFFFYNIATHPEAQKKCFEEIRLCVKETLRMYPSIPFSAGPRNCCIGQKFAMLEIK
C4D1_DROSI|O16805 EGHDTTSSALMFFFYNIATHPEAQKKCFEEIRLCVKETLRMYPSIPFSAGPRNCCIGQKFAMLEIK
C4D2_DROME|Q27589 EGHDTTTSAISFCLYEISRHPEVQQRLQQEIRNVIKESLRLHPPIPFSAGPRNCCIGQKFAMLEMK
C4DA_DROMT|O18596 EGHDTTSSGITFFFYNIALYPECQRKCVEEIVLCIKETLRMYPSIPFSAGPRNCCIGQKFAMLEIK
C4E5_DROMT|O44221 EGHDTTTSGVAFAGYILSRFPEEQRKLYEEQQLFIKEAQRVYPSVPFSAGPRNCCIGQKFALLELK
C6B1_PAPPO|Q04552 AGYETSATTMTYLFYELAKNPDIQDKLIAEIDKVFDETLRKYPVLPFSAGPRNCCLGMRFAKWQSE
C6B2_HELAM|Q27664 AGYETSATTMAYLTYQLALNPDIQNKLIAEIDKVFDETLRMYSILPFGLGQRNCCIGMRFGRLQSL
C6B3_PAPPO|Q27756 AGYETSATTMTYLFYELAKNPDIQDKLIAEIDRVFDETLRKYPVLPF

Glancing over the alignment you will observe that each sequence occupies its own row, and that columns tend to contain the same or similar amino acids — a result of evolutionary conservation. To facilitate the inspection of this data, it is common to colour amino acids according to their physico-chemical properties. There’s a method for writing the alignment to an HTML file:

In [50]:
aln.writeHTML('p450.html')

'<html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">\n<title>Sequence Alignment</title>\n</head><body><pre>\n                           1         2         3         4         5         6     66\n                           0         0         0         0         0         0     \nC2F1_HUMAN|P24903 <font style="BACKGROUND-COLOR: green">G</font><font style="BACKGROUND-COLOR: green">G</font><font style="BACKGROUND-COLOR: #66bbff">T</font><font style="BACKGROUND-COLOR: red">K</font><font style="BACKGROUND-COLOR: #66bbff">T</font><font style="BACKGROUND-COLOR: green">V</font><font style="BACKGROUND-COLOR: #66bbff">S</font><font style="BACKGROUND-COLOR: #66bbff">T</font><font style="BACKGROUND-COLOR: #66bbff">T</font><font style="BACKGROUND-COLOR: green">L</font><font style="BACKGROUND-COLOR: red">H</font><font style="BACKGROUND-COLOR: red">H</font><font style="BACKGROUND-COLOR: green">A</font><font style="BACKGROUND-COLOR: green">F</font><font style="BACKGRO

In [2]:
from sequence import *
aln = readClustalFile('gpcr.aln', Protein_Alphabet)
print(aln)
aln.writeHTML('gpcr.html')

AG2R_HUMAN  -------------------------------------MILNSSTEDG------------------------------------IKRIQDDCPKAGRHNYIFVMIPTLYSIIFVVGIFGNSLVVIVIYFYMKLK--TVASVFLLNLALADLCFLLTLPLWAVYTAMEYRWP-----FGNYLCKIASASVSFNLYASVFLLTCLSIDRYLAIVHPMKSRLRRTMLVAKVTCIIIWLLAGLASLPAIIHRNVFFIENTN----ITVCAFHYESQ----------NSTLPIGLGLTKNILGFLFPFLIILTSYTLIWKALKKAY------EI----------------------------------------------------------------QKNKPRNDDIFKIIMAIVLFFFFSWIPHQIFTFLDVLIQLGIIRD-CRIADIVDTAMPITICIAYFNNCLNPLFYGFLGKKFKRYFLQLLKYIPP--------------KAKSHSNLSTKMSTLSYRPSDNVSSSTKKPAPCFEVE----------------------------------
BKRB1_HUMAN ---------------------------------MASSWPPLELQSSN-------------------QSQLFPQNATAC------------DNAPEAWDLLHRVLPTFIISICFFGLLGNLFVLLVFLL-PRRQ-LNVAEIYLANLAASDLVFVLGLPFWAENIWNQ----FNWPFG-ALLCRVINGVIKANLFISIFLVVAISQDRYRVLVHPMASRRQQRRRQARVTCVLIWVVGGLLSIPTFLLR--SIQAVPD--LNITACILLLPHE-----------AAWHFARIVELNILGFLLPLAAIVFFNYHILASLRTREEVSRT-------------------------------------------------------------------RCGGRKDSKTT

'<html><head><meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">\n<title>Sequence Alignment</title>\n</head><body><pre>\n                     1         2         3         4         5         6         7         8         9         1         1         1         1         1         1         1         1         1         1         2         2         2         2         2         2         2         2         2         2         3         3         3         3         3         3         3         3         3         3         4         4         4         4         4         4         4         4         4         4         5         5         5         5         5         5         5         5 572\n                     0         0         0         0         0         0         0         0         0         0         1         2         3         4         5         6         7         8         9         0         1         2         3         4         5        