**<center style="color:#5F018B;font-size:24pt;weight:bold">SCIE2100 Practical 1 (Week 3):</center>**

**<center style="color:#5F018B;font-size:20pt;weight:bold">Sequences, alphabets, databases, and homology</center>**

------------------------------------------------------------------------------------------------------------------------------
**<SPAN STYLE="color:#5F018B;font-size:18pt">Aim</SPAN>**

**After reading and completing this practical, you should:**
* Be familiar with some of the Python modules that will be used for sequence analysis from now on, in this course
* Know a few more things about classes in Python 
* Know how to access data both in the form of files and web resources and be aware of some of the issues involved
* Have some experience in writing ‘scripts’

------------------------------------------------------------------------------------------------------------------------------

**<SPAN STYLE="color:#5F018B;font-size:18pt">Outline</SPAN>**

[**<span style="color:black">Part 1: More work with classes</span>**](#part1)
* Reviewing the FASTA file format
* An extended `Sequence` class and its friends
* Exercises 1-2

[**<span style="color:black">Part 2: Sequence databases</span>**](#part2)
* Online databases
* Retrieving database entries with Python
* Exercises 3-5

[**<span style="color:black">Part 2: Sequence alignments</span>**](#part3)
* Multiple sequence alignments
* Exercise 6

------------------------------------------------------------------------------------------------------------------------------

**<SPAN STYLE="color:#5F018B;font-size:18pt;weight:bold">Introduction</SPAN>**

### Answer submission

All your answers should be submitted through [Coder Quiz](http://bioinf.scmb.uq.edu.au:81/coderquiz/) before **<span style="color:blue"> Friday 19th of March, 2021 at 12PM. </span>**

To receive full marks for this practical, you must answer ALL questions. Coder Quiz has the ability to check correctness of provided answers (where applicable) prior to submission. This means that as you progress through the practical, you can use Coder Quiz to determine whether your answer is correct or not. Coder Quiz will also warn you if your submission contains incorrect answer. You can submit incorrect answers and still receive full marks if the answer shows a reasonable attempt at answering the question.

If you leave the answer to a question blank, your submission will be marked incomplete and receive 0 marks. If you cannot answer a question, it is sufficient to provide an explanation of why the question confused you, the approach you used to attempt to solve the question, or how you think the question should have been answered. Demonstrating you attempted the question will gain full marks whereas a blank field will receive no marks. Coder Quiz will warn you prior to submission if you have left blank fields and will receive 0 marks. Simply enter any explanation into the field relating to the question you haven't completed.

Coder Quiz may ask you to provide code for an answer, in this case you just need to save the relevant section of code as a Python file and then upload this file to Coder Quiz. It is for visual inspection of your attempt and does not need to run (i.e. you do not need to include import statements) but should be commented and understandable by a tutor.

You may submit as many attempts to Coder Quiz as you like, the most recent submission before the due date will be the attempt that is assessed. On the 'View Submissions' page you may view your latest submission or all of your attempts to ensure that you have answered every question.

Coder Quiz does not save or retrieve partial attempts, so we recommend storing your work and answers in a seperate file and using Coder Quiz to validate and submit once you are complete.

Coder Quiz submissions will be validated by a tutor prior to release of marks.

------------------------------------------------------------------------------------------------------------------------------
<a id="part1"></a>
**<SPAN STYLE="color:#5F018B;font-size:18pt;weight:bold">Exercises part 1 : More work with classes</SPAN>**

### Reviewing the FASTA file format

In *Practical 0* we learned that the most common format for permanently storing sequence data, and the one we will use predominately is the FASTA format. A FASTA file can store one or more nucleotide or protein sequences, and looks like:

```
>SequenceA
AGCTCCGCGATATACCATAAAACGTA
>SequenceB, with some more information
ACGCTAGCTAGCTGCGCGCTATATATGCGCATAGATCGTCGCG
AGTCGCTCGTAGCTAGTAGTCG
```

Each sequence in the file begins with a line starting with a `>` followed by the sequence’s name and possibly other unstructured information. The sequence itself starts on the following line, and ends when a line starts with a `>`, or the file ends. 

### An extended `Sequence` class and its friends

The module `sequence.py` defines a few important classes and imports a few from other modules. Most of the classes are defined in more limited forms in `guide.py` as well, including the `Sequence` class that we previously worked with in Practical 0. In the `sequence.py` module we now have a new `Sequence` class that has additional functionality compared to what we previously experienced. For example, the new `Sequence` class does not require that you specify the alphabet (e.g. `RNA_Alphabet`) when creating a class object as it will default to taking a guess instead. 

Fire up a Python interpreter (e.g. IDLE or Spyder) and import `sequence.py` to test it out for yourself with the examples below:

``` python
>>> from sequence import *
>>> seq1 = Sequence('AAAAAAAAGGGG')
>>> print(seq1.alphabet)
('A', 'C', 'G', 'T')
>>> seq2 = Sequence('AAAAAAAAGGUG')
>>> print(seq2.alphabet)
('A', 'C', 'G', 'U')
>>> seq3 = Sequence('AWAAAAAAGGVG')
>>> print(seq3.alphabet)
('A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y')
```

However, you can of course override the 'guess'. In the above examples, `seq1` was guessed to be a DNA sequence although based on its sequence it could also be an RNA sequence. Python allows parameters to be set 'by default' and be overridden (you would have seen how this is done for `__init__`). Let's override the `Sequence` class's guess.

```python
>>> seq4 = Sequence('AAAAAAAAGGGG', RNA_Alphabet)
>>> print(seq4.alphabet)
('A', 'C', 'G', 'U')
```

Of course, once you start taking responsibility for what the alphabet is, it may lead to an error if the alphabet does not capture what symbols the sequence actually contains. Let's override the default guess by introducing a new alphabet not found in `sym.py`. What happens if new alphabet does not reflect the sequence?

```python
>>> seq5 = Sequence('AAAAAAAAGGGG', Alphabet('AGU'))
('A', 'G', 'U')
>>> seq6 = Sequence('AAAAAAAAGGGG', Alphabet('ATU'))
Traceback (most recent call last):
...
RuntimeError: Invalid symbol: G in sequence
```

------------------------------------------------------------------------------------------------------------------------------

**Exercise 1.** 

When you import the module `sequence.py`, it automatically imports the `sym.py` module as well. In `sym.py`, a class for `Alphabet` is defined within which some standard alphabets are instantiated.

**<span style="color:blue">Submit: List the standard alphabets defined in sym.py (e.g. Bool_Alphabet). Please ensure your spelling is correct and your answers are separated by a comma (,).</span>**

------------------------------------------------------------------------------------------------------------------------------

**Exercise 2.** 

A number of methods are defined in `sequence.py` that operate on lists of sequences, including `readFastaFile` that we used in Practical 0. You should have a FASTA file `mystery1.fa` in your directory. Study the below example that uses the FASTA file `mystery1.fa` in your directory. 

We know what `len(seqs)` does, because `seqs` is a `list`. But what does `len(seq)` do? `seq` is not a `list`, or a `set`… it is a `Sequence`. In fact, when we write classes we can decide what the `len` operator should do, by defining the special `def __len__` method. Look it up. 
 
```python
>>> seqs = readFastaFile('mystery1.fa')
>>> len(seqs)
27
>>> long_seqs = []
>>> for seq in seqs:
...    print(seq.name, len(seq), len(seq.alphabet))
...    if len(seq) > 250:
...        long_seqs.append(seq)

Q9H9L7 192 20
P41223 144 20
Q13352 177 20
...
>>> writeFastaFile('mystery1_long.fa', long_seqs)
```

**<span style="color:blue">Submit: List the 'special' methods from the sequence class in the first field. Just enter the method name for each, for example, for `__len__(self)` you would enter `__len__`.  In the second, provide an example of each function's use using text, not code.</span>**

------------------------------------------------------------------------------------------------------------------------------
<a id="part2"></a>
**<SPAN STYLE="color:#5F018B;font-size:18pt;weight:bold">Exercises part 2 : Sequence databases</SPAN>**

### Online databases

There are many biological sequence databases available online, some of which contain massive amounts of data (and you would not want to store all of that data on your own hard disk).

NCBI [http://www.ncbi.nlm.nih.gov](http://www.ncbi.nlm.nih.gov) is one authoritative source for nucleotide sequence data and we referred to this in *Practical 0*. NCBI also stores data for proteins but the richest source of protein data is available from Uniprot ([uniprot.org](uniprot.org)). Uniprot naturally links with complementary sources of information, including the so-called Gene Ontology.

### Retrieving database entries with Python

We will often find ourselves wanting to search for proteins (and their sequences) in Uniprot, and the exercise below will help us develop strategies for doing that in an automated manner with Python, without a web browser.

Let's first go to the Uniprot web site, and find 'RNS1_ARATH' Ribonuclease 1 in Arabidopsis thaliana. The information is broken down into sections. Study the information and focus on its sequence annotation. Uniprot says that it contains a so-called Signal peptide at positions 1-22. 

Now, let's try retrieving the sequence of this single gene, given its identifier (e.g. RNS1_ARATH), from Uniprot using `sequence.py`:

```python
>>> rns1 = getSequence('RNS1_ARATH', 'uniprot')
>>> print(rns1)
```

To better understand what is contained in the rns1 variable, can you use code to calculate how many Serines (S) it has?

*Hint: Read `sequence.py`, find `getSequence`, understand what it returns and what methods you can apply to the `rns1` object.*

------------------------------------------------------------------------------------------------------------------------------

**Exercise 3 : Exploring database identifiers** 

Create a new Python file, say `mystery.py`. Import `sequence.py`. Using the sequence module, write code that loads the FASTA file called `mystery2.fa` which contains protein sequences from online databases. Online databases have different identifiers for their respective sequences. For example, in Ensembl, a human gene may have an identifier like ENSG00000162599. Iterate through the sequences in `mystery2.fa`, and print out their names.

**<span style="color:blue">Submit: How many different types of identifiers are associated with the sequences in mystery2.fa? (Submit the first two letters common to the identifiers). Which databases do these identifiers map to?</span>**

------------------------------------------------------------------------------------------------------------------------------

**Exercise 4 : Retrieving many sequences online and with Python** 

Now let's compare two different approaches to downloading a large number of sequences. You will first download the sequences online using a webtool, and then you will use Python to programmatically download the sequences.

Let's first retrieve the sequences online using a webtool. Visit the [Uniprot](http://www.uniprot.org/) website online and use 'Advanced Search' techniques to find all proteins belonging to *Arabidopsis thaliana* that contain a signal peptide. For the purpose of completing this practical without taking too long, restrict the length of the protein to 700 amino acids or more. Download your matches from Uniprot as a FASTA file called `sigpep_at.fa`. 

*Hint: Advanced searches were introduced in Practical 0.*

Now, let's write a Python code to programmatically perform another search for *Arabidopsis thaliana* proteins in the Uniprot database, but this time searching for proteins involved in 'Lipid metabolism' (keyword search). If you're unsure of the search syntax, an example of the signal peptide advanced search you did using the webtool is below.

``` python
>>> spat = searchSequences('"signal+peptide"+AND+organism:3702+AND+length:[700+TO+*]')
```

Cryptic? This string is the result of the advanced search on Uniprot. It forms part of the URL used by Uniprot search and should also appear in the search bar of the Uniprot webpage. So, play with Uniprot search and note the search string in your web browser. Except for replacing all spaces with ‘+’, you can ‘cut-and-paste’ that string into your Python code.

The function `searchSequences()` returns a list of identifiers which meet our search criteria and **not** the actual sequence information. We can loop over the result of `searchSequences` and use another function, `getSequence()`, to create a new list of Sequence objects (which will contain the actual sequence) from our original list of identifiers. Save the results from your search in a FASTA file called `lipmet_at.fa` using the `writeFastaFile` function demonstrated in Exercises part 1.

As getting sequence data using the `getSequence()` function in `sequence.py` can take a while when you have many sequences, you might want to make sure the code works before downloading the entire set of sequences. For example, you could limit your initial search to larger proteins (> 700 amino acids) or first try it out on a subset of ten sequences from your list.


**<span style="color:blue">Submit: How many entries are in each of your fasta files (sigpep_at.fa and lipmet_at.fa)? Submit the code you used to identify these sequences with comments describing the process.</span>**

*Hint: You do not actually have to download the sequences to know how many would be in the fasta files. (e.g. the number of IDs in spat will equal the length of sigpep_at.fasta)*

------------------------------------------------------------------------------------------------------------------------------

**Exercise 5.** 

Triacylglycerols (TAGs) are an important reserve of carbon and energy in Eukaryotes. Triacylglycerol (TAG) lipases have been thoroughly characterized in mammals and microorganisms. By contrast, very little is known about plant TAG lipases. 

We expect TAGs to have both a **signal peptide** and be involved in **lipid metabolism**. Hence, we would expect that any Triacylglycerol lipase(s) should be in both of the FASTA files retrieved in *Exercise 4*. Let's check if there any TAG lipases in *A. thaliana* and, if so, how many there are.

One approach for answering this would be to store their IDs in **sets** by extracting them from the FASTA files. Then, you could find the IDs that are found in both sets using an intersection method for sets of strings. Sets are a python data structure you should familiarise yourself with! Set theory is useful in programming as it can allow easy comparison of data. See below for an example:

```python
>>> seqs1=readFastaFile('sigpep_at.fa')
>>> seqs2=readFastaFile('lipmet_at.fa')
>>> ids1=[];ids2=[]
>>> ... ADD 1-2 LINES OF CODE HERE: to add the ids from sigpep to ids1 
>>> ... ADD ANOTHER 1-2 LINES OF CODE HERE: to add the ids from lipmet to ids2
>>> ids1=set(ids1)
>>> ids2=set(ids2)
>>> common_ids=list(ids1.intersection(ids2)) #the clever method
>>> print(common_ids)
>>> for seq in seqs1:
>>> ... ADD 1-2 LINES OF CODE HERE: to print the details of identical IDs 
```

**<span style="color:blue">Submit: How many TAG lipases did you find? Submit your code with comments describing the process.<span style="color:blue">**
    
*Hint: If your internet connection was too slow to download the sequences above, then instead of using the fasta files, you can just directly use the IDs outputted from searchSequences().*

------------------------------------------------------------------------------------------------------------------------------
<a id="part3"></a>
**<SPAN STYLE="color:#5F018B;font-size:18pt;weight:bold">Exercises part 3 : Sequence alignments</SPAN>**


### Multiple sequence alignments

Aligning sequences to one another lets us see where sequences are similar or different. Similarity in sequences may indicate that the sequences have the same structure, perform the same function, or that they have evolved from the same ancestor.

`Alignment` class stores an alignment of sequences. To get a feel for the data, load one of the `p450.aln` or `gpcr.aln` files, and print the alignment.


```python
>>> from sequence import *
>>> aln = readClustalFile('p450.aln', Protein_Alphabet)
>>> print(aln)
Q27499  AGMETTSNTLNWALLYVLRNPEVRQKVYEELDATINESQRLANLVPFSIGKRQCCPGEGLAKMELL
Q16872  AGTETTSTTLRYALLLLLKHPEVTAKVQEEIEAVVHEVQRYIDLMPFSAGKRICCVGEALAGMELF
...
```

Glancing over the alignment you will observe that each sequence occupies its own row, and that columns tend to contain the same or similar amino acids — a result of evolutionary conservation. To facilitate the inspection of this data, it is common to colour amino acids according to their physico-chemical properties. There’s a method for writing the alignment to an HTML file:

```python
>>> aln.writeHTML('p450.html')
```

You can view this file in a web browser. (Open file; or simply double-click in the file explorer.) Here’s a snippet:
![p450 alignment snippet](p450.png)

------------------------------------------------------------------------------------------------------------------------------

**Exercise 6 : Investigating an alignment of GPCRs** 

Load `gpcr.aln` and write it to a HTML file to view as was shown above. The file contains the full sequences of 23 so-called G protein coupled receptors (GPCRs). All GPCRs contain seven transmembrane domains (i.e. sections of 15-25 consecutive residues that embed in lipid membrane). 

Investigate and describe the meanings of the colours used in the alignments by default. (`writeHTML` uses a colour 'annotation' that is defined for `Protein_Alphabet` in `sym.py`, halfway down the file.) Find out which amino acids are considered hydrophobic and then colour all of them blue by changing the colour annotation in `sym.py`. Colour the other residues white. 

In the following diagram of what a GPCR protein looks like we can see that it bends and crosses over the lipid membrane seven times (as indicated by the numbers in the diagram).

<img src="gpcr.png" alt="GPCR diagram" style="width: 400px;"/>

If you're not familiar with the hydrophobic properties of a lipid membrane, look them up. Based on this knowledge we know that if we took the sequence of amino acids that make up this protein, then even without knowing exactly what the amino acids are, we know that anytime it crosses through the membrane it will contain mostly hydrophobic amino acids.

On your alignment, identify the boundaries of the transmembrane domains.


**<span style="color:blue">Submit: Describe the physico-chemical properties represented by each default colour used in the alignment.</span>**

**<span style="color:blue">Submit: Show your own 'hydrophobic' colour scheme (as a list of affected amino acids) along with a screen capture of your GPCR alignment.</span>**

**<span style="color:blue">Submit: Provide the rough boundaries of the fifth transmembrane domain.</span>**

------------------------------------------------------------------------------------------------------------------------------

**<SPAN STYLE="color:#5F018B;font-size:18pt;weight:bold">Assessment</SPAN>**

### Be sure to provide your responses to the items marked with Submit through Coder Quiz by the due date !

------------------------------------------------------------------------------------------------------------------------------
