# Labs - Biopython and data formats

## Outline

- Managing dependencies in Python with environments
- Biopython 
    - Sequences (parsing, representation, manipulation)
    - Structures (parsing, representation, manipulation)

### 1. Python environments

- handles issues with dependencies versions
- ensures reproducibility
- does not clutter users' global site-packages directory

`python3 -m venv venv/       # Creates an environment called venv/`

`source venv/bin/activate`

`pip install biopython`

`pip freeze > requirements.txt`

`(venv) % deactivate`

On a different machine, the environment can be replicated by creating a new environment and running

`pip install -r requirements.txt`

### 2. Biopython

Biopython is a library consisting of tools for both sequence and structure bioinformatics. Among other things it enables parsing, handling and storing molecular data present in common formats such as FASTA, PDB or mmCIF.

Install biopython using `pip install biopython`

Functionality divided into packages, list of which is available in the [docs](https://biopython.org/docs/latest/api/Bio.html). 

Main sequence and structure packages:
 - [Bio.Seq](https://biopython.org/docs/latest/api/Bio.Seq.html)
 - [Bio.Align](https://biopython.org/docs/latest/api/Bio.Align.html) 
 - [Bio.SeqIO](https://biopython.org/docs/latest/api/Bio.SeqIO.html)
 - [Bio.PDB](https://biopython.org/docs/latest/api/Bio.PDB.html) 
 

 
#### Sequences 
 
 Loading a sequence from a string: 

In [None]:
from Bio.Seq import Seq
seq = Seq("AGTACACTG")
print(seq)

This creates a [sequence object](https://biopython.org/docs/latest/api/Bio.Seq.html) with a couple of fancy methods, especially when it comes to nuclotide sequences, such as `reverse_complement` or `translate`.

In [None]:
print(seq)
print(seq.translate())
print(seq.reverse_complement())
print(seq.reverse_complement().transcribe())
print(seq.reverse_complement().translate())
print(seq.reverse_complement().transcribe().translate())

In [None]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(coding_dna.translate())
print(coding_dna.translate(to_stop=True))
print(coding_dna.translate(table=2))
print(coding_dna.translate(table=2, to_stop=True))

Notice, in the example above we used different genetic tables. Check [NCBI genetic codes](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) for details.

To list all the methods, run, e.g., one of the following:

In [None]:
print(dir(seq))
print(help(seq))

Subscripting methods are available as well.

In [None]:
print(seq[3])
print(seq[3:5])
print(seq[::-1])

If needed, the `Seq` object can be converted into a string.

In [None]:
print(str(seq))
print(str(seq).translate({65: 88}))
print(str(seq).replace('A', 'X'))

To parse sequence from a file, you can use [Bio.SeqIO](https://biopython.org/docs/latest/api/Bio.SeqIO.html). [Here](https://biopython.org/wiki/SeqIO#file-formats) is the list of supported formats. The format name is passed into the `parse` method.

In [None]:
from Bio import SeqIO

sars2_it = SeqIO.parse("R1A-B_SARS2.fasta", "fasta")
for seq_record in sars2_it:    
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
sars2_seq_recs = list(sars2_it)

The result is an iterator of [SeqRecord](https://biopython.org/docs/latest/api/Bio.SeqRecord.html)s. Other attributes of `SeqRecord` such as features or annotations are more relevant for other formats, such as genbank. The underlying gene for the two isoforms (R1A_SARS2/P0DTC1 and R1AB_SARS2/P0DTD1) is ORF1ab and the two isoforms are caused by ribosomal slippage during translation (see, e.g., [here](https://www.science.org/doi/full/10.1126/science.abf3546)). Both reading frames R1A_SARS2 and R1AB_SARS2 are polyproteins and are encoded by the same [gene](https://www.ncbi.nlm.nih.gov/gene/43740578). Let's explore this.

In [None]:
gb_rec = list(SeqIO.parse("NC_045512.gb", "genbank"))[0]
print(gb_rec.id)

In [None]:
print(gb_rec.annotations)
print(gb_rec.features)

Let's obtain all CDS (coding sequence) features.

In [None]:
cds = [seq_feature for seq_feature in gb_rec.features if seq_feature.type == 'CDS']

In [None]:
cds

In [None]:
print(dir(cds[1]))

In [None]:
import json
print(json.dumps(cds[1].qualifiers, indent=3))

In [None]:
cds[1].extract(gb_rec.seq).translate()

Now, let's get the DNA sequence for the the polyprotein 1ab.

In [None]:
aa_seq = cds[0].extract(gb_rec.seq).translate()
print(aa_seq[:10])
print(gb_rec.seq[265:].translate()[:10])

To write a sequence into a file, use `SeqIO.write`.

In [None]:
SeqIO.write([gb_rec, SeqIO.SeqRecord(aa_seq, id="id", description="aa")], "fasta_from_gb.fasta", "fasta")

### ---- Begin Exercise ----

- Obtain the protein sequnece for polyprotein 1ab. Check with UniProt that it matches (just by eyeballing).
- Obtain the protein sequence for the polyprotein 1a.
- Obtain protein sequences for all the proteins in p1a and list them together with their names

The result should look something like:

```
['YP_009725297.1']: MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGG
['YP_009725298.1']: AYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFPLNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTGDFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESGLKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNLLEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGNFKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAAITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKLKPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLVNKFLALCADSIIIGGAKLKALNLGETFVTHSKGLYRKCVKSREETGLLMPLKAPKEIIFLEGETLPTEVLTEEVVLKTGDLQPLEQPTSEAVEAPLVGTPVCINGLMLLEIKDTEKYCALAPNMMVTNNTFTLKGG
...
```


### ---- End Exercise ----

#### Structures
Structure processing is handled by the [Bio.PDB](https://biopython.org/docs/latest/api/Bio.PDB.html) package.

To read a structure from a PDB file, use the `PDBParser`. We will be using the 3C-like protease protein, which is one of the processed proteins present in the ORF1a discussed above. One of it's structures is [7ALH](https://www.ebi.ac.uk/pdbe/entry/pdb/7alh). To see all the structures, I suggest checking out the PDBe-KB page for [P0DTD1](https://www.ebi.ac.uk/pdbe/pdbe-kb/proteins/P0DTD1).

In [None]:
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser(PERMISSIVE=1)
structure = parser.get_structure("7alh", "7alh.ent")

As the PDB format is considered deprecated, one should use the mmCIF file instead. This is done the same way as in case of PDB files.

In [None]:
from Bio.PDB.MMCIFParser import MMCIFParser
parser = MMCIFParser()
structure = parser.get_structure("7alh", "7alh.cif")

To retrieve the individual CIF dictionary fields, one can use the `MMCIF2Dict` module.

In [None]:
from Bio.PDB.MMCIFParser import MMCIF2Dict
mmcif_dict = MMCIF2Dict("7alh.cif")
print(mmcif_dict["_citation.title"])

The structure record has the structure->model->chain->residue architecture.

![SMRCA](https://biopython.org/docs/latest/_images/smcra.png)

Each of the levels in the hierarchy is represented by a submodule in Bio.PDB, namely [Bio.Structure](https://biopython.org/docs/latest/api/Bio.PDB.Structure.html), [Bio.Model](https://biopython.org/docs/latest/api/Bio.PDB.Model.html),[Bio.Chain](https://biopython.org/docs/latest/api/Bio.PDB.Chain.html),[Bio.Residue](https://biopython.org/docs/latest/api/Bio.PDB.Residue.html) and [Bio.Atom](https://biopython.org/docs/latest/api/Bio.PDB.Atom.html). For details regarding IDs, check the [section on ID](https://biopython.org/docs/1.75/api/Bio.PDB.Entity.html#Bio.PDB.Entity.Entity.get_full_id) of the Entity class which is the superclass of the Module/Chain/Residue/Atom classes.

In [None]:
print(structure.get_list())

In [None]:
print('---------- MODEL INFO ----------')

model = structure[0]
print(f"Full ID: {model.get_full_id()}\nID: {model.get_id()}")
print(model.get_list())

In [None]:
print('---------- CHAIN INFO ----------')
chain = model['A']
print(f"Full ID: {chain.get_full_id()}\nID: {chain.get_id()}")
print(chain.get_list())

In [None]:
print('---------- RESIDUE INFO ----------')
res = chain[(' ',1,' ')]
print(f"Full ID: {res.get_full_id()}\nID: {res.get_id()}")
print(res.get_resname())
res = chain[1]
print(res.get_resname())

In the above script, notice that the residue ID is a triplet where the first position stores the residue type ('H' for hetero atoms, 'W' for water and ' ' for everything else), the second its number and the last position is the insertion code.

In [None]:
print('---------- ATOM INFO ----------')
atom=res['CA']
print(f"Full ID: {atom.get_full_id()}\nID: {atom.get_id()}")
print(f"{atom.get_name()}\n{atom.get_id()}\n{atom.get_coord()}\n{atom.get_fullname()}")
print(atom.get_vector())


To download a file from PDB, one can use the PDBList module.

In [None]:
from Bio.PDB.PDBList import PDBList
pdbl = PDBList()
pbl_7lkr=pdbl.retrieve_pdb_file("7LKR", file_format="mmCif", pdir=".")

In [None]:
from Bio.PDB.MMCIFParser import MMCIFParser
parser = MMCIFParser()
structure = parser.get_structure("7lkr", "7lkr.cif")

### ---- Begin Exercise ----

- Iterate over all atoms of the structure
- List all water residues (the first field of the residue id is 'W')
- How many water molecules are in the recrod?
- How many heteroatoms are there in the recod (the first field of the residue id is 'H').
- Find a structure in PDB with at least one ligand (different from water) and write a code which lists all the ligands. (All such ligand can be found in `HETNAM` sections in PDB and in `_chem_comp.id` records in mmCIF).

### ---- End Exercise ----