# Labs - Biopython and data formats

## Outline

- Managing dependencies in Python with environments
- Biopython 
    - Sequences (parsing, representation, manipulation)
    - Structures (parsing, representation, manipulation)

### 1. Python environments

- handles issues with dependencies versions
- ensures reproducibility
- does not clutter users' global site-packages directory

`python3 -m venv venv/       # Creates an environment called venv/`

`source venv/bin/activate`

`pip install biopython`

`pip freeze > requirements.txt`

`(venv) % deactivate`

On a different machine, the environment can be replicated by creating a new environment and running

`pip install -r requirements.txt`

### 2. Biopython

Biopython is a library consisting of tools for both sequence and structure bioinformatics. Among other things it enables parsing, handling and storing molecular data present in common formats such as FASTA, PDB or mmCIF.

Install biopython using `pip install biopython`

Functionality divided into packages, list of which is available in the [docs](https://biopython.org/docs/latest/api/Bio.html). 

Main sequence and structure packages:
 - [Bio.Seq](https://biopython.org/docs/latest/api/Bio.Seq.html)
 - [Bio.Align](https://biopython.org/docs/latest/api/Bio.Align.html) 
 - [Bio.SeqIO](https://biopython.org/docs/latest/api/Bio.SeqIO.html)
 - [Bio.PDB](https://biopython.org/docs/latest/api/Bio.PDB.html) 
 

 
#### Sequences 
 
 Loading a sequence from a string: 

In [9]:
from Bio.Seq import Seq
seq = Seq("AGTACACTG")
print(seq)

AGTACACTG


This creates a [sequence object](https://biopython.org/docs/latest/api/Bio.Seq.html) with a couple of fancy methods, especially when it comes to nuclotide sequences such as `reverse_complement` or `translate`.

In [2]:
print(seq.translate())
print(seq)
print(seq.reverse_complement())
print(seq.reverse_complement().transcribe())
print(seq.reverse_complement().translate())
print(seq.reverse_complement().transcribe().translate())

STL
AGTACACTG
CAGTGTACT
CAGUGUACU
QCT
QCT


In [3]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(coding_dna.translate())
print(coding_dna.translate(to_stop=True))
print(coding_dna.translate(table=2))
print(coding_dna.translate(table=2, to_stop=True))

MAIVMGR*KGAR*
MAIVMGR
MAIVMGRWKGAR*
MAIVMGRWKGAR


Notice, in the example above we used different genetic tables. Check [NCBI genetic codes](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) for details.

To list all the methods, run, e.g., one of the following:

In [5]:
print(dir(seq))
print(help(seq))

['__abstractmethods__', '__add__', '__array_ufunc__', '__bytes__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__imul__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_data', 'back_transcribe', 'complement', 'complement_rna', 'count', 'count_overlap', 'defined', 'defined_ranges', 'endswith', 'find', 'index', 'islower', 'isupper', 'join', 'lower', 'lstrip', 'replace', 'reverse_complement', 'reverse_complement_rna', 'rfind', 'rindex', 'rsplit', 'rstrip', 'split', 'startswith', 'strip', 'transcribe', 'translate', 'ungap', 'upper']
Help on Seq in module Bio.Seq object:

class Seq(_SeqAbstractBaseClass)
 |  Seq(data, length=N

Subscripting methods are available as well.

In [6]:
print(seq[3])
print(seq[3:5])
print(seq[::-1])

A
AC
GTCACATGA


If needed, the `Seq` object can be converted into a string.

In [7]:
print(str(seq))
print(str(seq).translate({65: 88}))
print(str(seq).replace('A', 'X'))

AGTACACTG
XGTXCXCTG
XGTXCXCTG


To parse sequence from a file, you can use [Bio.SeqIO](https://biopython.org/docs/latest/api/Bio.SeqIO.html). [Here](https://biopython.org/wiki/SeqIO#file-formats) is the list of supported formats. The format name is passed into the `parse` method.

In [10]:
from Bio import SeqIO

sars2_it = SeqIO.parse("R1A-B_SARS2.fasta", "fasta")
for seq_record in sars2_it:    
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
sars2_seq_recs = list(sars2_it)

sp|P0DTD1|R1AB_SARS2
Seq('MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLV...VNN')
7096
sp|P0DTC1|R1A_SARS2
Seq('MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLV...FAV')
4405


The result is an iterator of [SeqRecord](https://biopython.org/docs/latest/api/Bio.SeqRecord.html)s. Other attributes of `SeqRecord` such as features or annotations are more relevant for other formats, such as genbank. The underlying gene for the two isoforms (R1A_SARS2/P0DTC1 and R1AB_SARS2/P0DTD1) is ORF1ab and the two isoforms are caused by ribosomal slippage during translation (see, e.g., [here](https://www.science.org/doi/full/10.1126/science.abf3546)). Both reading frames R1A_SARS2 and R1AB_SARS2 are polyproteins and are encoded by the same [gene](https://www.ncbi.nlm.nih.gov/gene/43740578). Let's explore this.

In [11]:
gb_rec = list(SeqIO.parse("NC_045512.gb", "genbank"))[0]
print(gb_rec.id)

NC_045512.2


In [14]:
print(gb_rec.annotations)
print(gb_rec.features)

{'molecule_type': 'ss-RNA', 'topology': 'linear', 'data_file_division': 'VRL', 'date': '18-JUL-2020', 'accessions': ['NC_045512'], 'sequence_version': 2, 'keywords': ['RefSeq'], 'source': 'Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)', 'organism': 'Severe acute respiratory syndrome coronavirus 2', 'taxonomy': ['Viruses', 'Riboviria', 'Orthornavirae', 'Pisuviricota', 'Pisoniviricetes', 'Nidovirales', 'Cornidovirineae', 'Coronaviridae', 'Orthocoronavirinae', 'Betacoronavirus', 'Sarbecovirus'], 'references': [Reference(title='A new coronavirus associated with human respiratory disease in China', ...), Reference(title='Programmed ribosomal frameshifting in decoding the SARS-CoV genome', ...), Reference(title='The structure of a rigorously conserved RNA element within the SARS virus genome', ...), Reference(title="A phylogenetically conserved hairpin-type 3' untranslated region pseudoknot functions in coronavirus RNA replication", ...), Reference(title='Direct Submission', .

Let's obtain all CDS (coding sequence) features.

In [41]:
cds = [seq_feature for seq_feature in gb_rec.features if seq_feature.type == 'CDS']

In [16]:
cds

[SeqFeature(CompoundLocation([SimpleLocation(ExactPosition(265), ExactPosition(13468), strand=1), SimpleLocation(ExactPosition(13467), ExactPosition(21555), strand=1)], 'join'), type='CDS', location_operator='join', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(265), ExactPosition(13483), strand=1), type='CDS', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(21562), ExactPosition(25384), strand=1), type='CDS', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(25392), ExactPosition(26220), strand=1), type='CDS', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(26244), ExactPosition(26472), strand=1), type='CDS', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(26522), ExactPosition(27191), strand=1), type='CDS', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(27201), ExactPosition(27387), strand=1), type='CDS', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(27393), ExactPosition(27759), strand=1), type='CDS', qualifier

In [36]:
print(dir(cds[1]))

['__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_flip', '_get_location_operator', '_get_ref', '_get_ref_db', '_get_strand', '_set_location_operator', '_set_ref', '_set_ref_db', '_set_strand', '_shift', 'extract', 'id', 'location', 'location_operator', 'qualifiers', 'ref', 'ref_db', 'strand', 'translate', 'type']


In [42]:
import json
print(json.dumps(cds[1].qualifiers, indent=3))

{
   "gene": [
      "ORF1ab"
   ],
   "locus_tag": [
      "GU280_gp01"
   ],
   "note": [
      "pp1a"
   ],
   "codon_start": [
      "1"
   ],
   "product": [
      "ORF1a polyprotein"
   ],
   "protein_id": [
      "YP_009725295.1"
   ],
   "db_xref": [
      "GeneID:43740578"
   ],
   "translation": [
      "MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGGAYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFPLNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTGDFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESGLKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNLLEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGNFKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAAITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKLKPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLVNK

In [17]:
cds[1].extract(gb_rec.seq).translate()

Seq('MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLV...AV*')

Now, let's get the DNA sequence for the the polyprotein 1ab.

In [37]:
aa_seq = cds[0].extract(gb_rec.seq).translate()
print(aa_seq[:10])
print(gb_rec.seq[265:].translate()[:10])

MESLVPGFNE
MESLVPGFNE




To write a sequence into a file, use `SeqIO.write`.

In [38]:
SeqIO.write([gb_rec, SeqIO.SeqRecord(aa_seq, id="id", description="aa")], "fasta_from_gb.fasta", "fasta")

2

### ---- Begin Exercise ----

- Obtain the protein sequnece for polyprotein 1ab. Check with UniProt that it matches (just by eyeballing).
- Obtain the protein sequence for the polyprotein 1a.
- Obtain protein sequences for all the proteins and list them together with their names

### ---- End Exercise ----

#### Structures
Structure processing is handled by the [Bio.PDB](https://biopython.org/docs/latest/api/Bio.PDB.html) package.

To read a structure from a PDB file, use the `PDBParser`. We will be using the 3C-like proteinase protein, which is one of the processed proteins present in the ORF1a discussed above. One of it's structures is [7ALH](https://www.ebi.ac.uk/pdbe/entry/pdb/7alh). To see all the structures, I suggest checking out the PDBe-KB page for [P0DTD1](https://www.ebi.ac.uk/pdbe/pdbe-kb/proteins/P0DTD1).

In [43]:
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser(PERMISSIVE=1)
structure = parser.get_structure("7alh", "7alh.ent")

As the PDB format is considered deprecated, one should use the mmCIF file instead. This is done the same way as in case of PDB files.

In [44]:
from Bio.PDB.MMCIFParser import MMCIFParser
parser = MMCIFParser()
structure = parser.get_structure("7alh", "7alh.cif")

To retrieve the individual CIF dictionary fields, one can use the `MMCIF2Dict` module.

In [45]:
from Bio.PDB.MMCIFParser import MMCIF2Dict
mmcif_dict = MMCIF2Dict("7alh.cif")
print(mmcif_dict["_citation.title"])

['Crystal structure of the main protease (3CLpro/Mpro) of SARS-CoV-2 at 1.65A resolution (spacegroup C2).']


The structure record has the structure->model->chain->residue architecture.

![SMRCA](http://biopython.org/DIST/docs/tutorial/images/smcra.png)

Each of the levels in the hierarchy is represented by a submodule in Bio.PDB, namely [Bio.Structure](https://biopython.org/docs/latest/api/Bio.PDB.Structure.html), [Bio.Model](https://biopython.org/docs/latest/api/Bio.PDB.Model.html),[Bio.Chain](https://biopython.org/docs/latest/api/Bio.PDB.Chain.html),[Bio.Residue](https://biopython.org/docs/latest/api/Bio.PDB.Residue.html) and [Bio.Atom](https://biopython.org/docs/latest/api/Bio.PDB.Atom.html). For details regarding IDs, check the [section on ID](https://biopython.org/docs/1.75/api/Bio.PDB.Entity.html#Bio.PDB.Entity.Entity.get_full_id) of the Entity class which is the superclass of the Module/Chain/Residue/Atom classes.

In [47]:
print(structure.get_list())

print('---------- MODEL INFO ----------')

model = structure[0]
print(f"Full ID: {model.get_full_id()}\nID: {model.get_id()}")
print(model.get_list())

print('---------- CHAIN INFO ----------')
chain = model['A']
print(f"Full ID: {chain.get_full_id()}\nID: {chain.get_id()}")
print(chain.get_list())

# print('---------- RESIDUE INFO ----------')
# res = chain[(' ',1,' ')]
# print(f"Full ID: {res.get_full_id()}\nID: {res.get_id()}")
# print(res.get_resname())
# res = chain[1]
# print(res.get_resname())

# print(res.get_list())
# print('---------- ATOM INFO ----------')
# atom=res['CA']
# print(f"Full ID: {atom.get_full_id()}\nID: {atom.get_id()}")
# print(f"{atom.get_name()}\n{atom.get_id()}\n{atom.get_coord()}\n{atom.get_fullname()}")
# print(atom.get_vector())


[<Model id=0>]
---------- MODEL INFO ----------
Full ID: ('7alh', 0)
ID: 0
[<Chain id=A>]
---------- CHAIN INFO ----------
Full ID: ('7alh', 0, 'A')
ID: A
[<Residue SER het=  resseq=1 icode= >, <Residue GLY het=  resseq=2 icode= >, <Residue PHE het=  resseq=3 icode= >, <Residue ARG het=  resseq=4 icode= >, <Residue LYS het=  resseq=5 icode= >, <Residue MET het=  resseq=6 icode= >, <Residue ALA het=  resseq=7 icode= >, <Residue PHE het=  resseq=8 icode= >, <Residue PRO het=  resseq=9 icode= >, <Residue SER het=  resseq=10 icode= >, <Residue GLY het=  resseq=11 icode= >, <Residue LYS het=  resseq=12 icode= >, <Residue VAL het=  resseq=13 icode= >, <Residue GLU het=  resseq=14 icode= >, <Residue GLY het=  resseq=15 icode= >, <Residue CYS het=  resseq=16 icode= >, <Residue MET het=  resseq=17 icode= >, <Residue VAL het=  resseq=18 icode= >, <Residue GLN het=  resseq=19 icode= >, <Residue VAL het=  resseq=20 icode= >, <Residue THR het=  resseq=21 icode= >, <Residue CYS het=  resseq=22 icode

In the above script, notice that the residue ID is a triplet where the first position stores the residue type ('H' for hetero atoms, 'W' for water and ' ' for everything else), the second its number and the last position is the insertion code. 

To download a file from PDB, one can use the PDBList module.

In [19]:
from Bio.PDB.PDBList import PDBList
pdbl = PDBList()
pbl_7lkr=pdbl.retrieve_pdb_file("7LKR", file_format="mmCif", pdir=".")

Structure exists: '.\7lkr.cif' 


In [20]:
from Bio.PDB.MMCIFParser import MMCIFParser
parser = MMCIFParser()
structure = parser.get_structure("7lkr", "7lkr.cif")



### ---- Begin Exercise ----

- Iterate over all atoms of the structure
- List all water residues (the first field of the residue id is 'W')
- How many water molecules are in the recrod?
- How many heteroatoms are there in the recod (the first field of the residue id is 'H').
- Find a structure in PDB with at least one ligand (different from water) and write a code which lists all the ligands. (All such ligand can be found in `HETNAM` sections in PDB and in `_chem_comp.id` records in mmCIF).

### ---- End Exercise ----