# 0. Download structural ensemble from RCSB PDB 

## What is the goal of this notebook?


You can use this notebook to download all the structures that match a specific 'keyword' and that are deposited on the PDB. For example, in this tutorial, we will download all the stuctures that contain the keyword 'triosephosphate isomerase' in their metadata. 

You can also download structures based on the sequence of the molecule of interest. 

You can choose to use only one of the methods, or you can do both if you want to make sure that you obtain all possible matching structures deposited in [RCSB PDB](https://www.rcsb.org/). 



## Import libraries, including PDBClean

In [1]:
import numpy as np
from PDBClean import pdbclean_io, pdbutils

## Define and create working directory (where you want to save your structures)

In [2]:
PROJDIR="./TIM/"

With check_project we can check if our project directory (PROJDIR) exists, and if it doesn't, we create it. 

In [3]:
pdbclean_io.check_project(projdir=PROJDIR)

## Check if the file with all sequences from PDB already exists. 

'seqfile' is where we will download the sequences of all the structures currently deposited in [RCSB PDB](rcsb.org).
If you haven't downloaded them, or if you want to update the list, keep:
> update=True

otherwise change to:
>update=False 

And make sure to change 'seqfile' to the path where you have stored the sequences. 


In [4]:
update=True            
seqfile=PROJDIR+'seqres.txt' 

## Retrieve reference sequence(s) from keyword

Change the keyword to any protein (or any other biomolecule) of interest. 


In [5]:
keyword='triosephosphate isomerase' 

Run the next cell. 

The next cell will download all the sequences currently deposited in [RCSB PDB](rcsb.org).

It will use the given 'keyword' to search for all the matching sequences, based on the metadata (mode='metadata'). \
Results will be stored in 'ref_sequences' and 'ref_metadata', and printed to screen. 

Note: Depending on your internet connection, it may take a few minutes to download the sequences. But you only need \
to do this step once. 

In [6]:
ref_sequences, ref_metadata = pdbutils.retrieve_sequence_from_PDB(keyword, mode='metadata', update=True, seqfile=seqfile)

print('{0} sequences were identified as potential hits! \n'.format(len(ref_sequences)))

for iseq in np.arange(len(ref_sequences)):
    print('{0} {1}'.format(ref_metadata[iseq], ref_sequences[iseq]))

wrote ./TIM/seqres.txt from ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt
572 sequences were identified as potential hits! 

>1ag1_O mol:protein length:250  TRIOSEPHOSPHATE ISOMERASE
 MSKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ

>1ag1_T mol:protein length:250  TRIOSEPHOSPHATE ISOMERASE
 MSKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ

>1aw1_A mol:protein length:256  TRIOSEPHOSPHATE ISOMERASE
 MRHPVVMGNWKLNGSKEMVVDLLNGLNAELEGVTGVDVAVAPPALFVDLAERTLTEAGSAIILGAQNTDLNNSGAFTGDMSPAMLKEFGATHIIIGHSERREYHAESDEFVAKKFAFLKENGLTPVLCIGESDAQNEAGETMAVCARQLDAVINTQGVEALEGAIIAYEPIWAIGTGKAATAED

## Retrieve all sequences that match the reference sequence(s)

We will use the sequences we retrieved in our previous search (mode='metadata'), to run new searches based on sequence (mode='sequence'). 

We will print to screen all the sequences that are retrieved. 

Notice that in this step, 'update=False', since we downloaded the RCSB PDB sequences in the previous step. 


In [7]:
sequences, metadata = pdbutils.retrieve_sequence_from_PDB(ref_sequences[0], mode='sequence', update=False, seqfile=seqfile)

for seq in ref_sequences[1:]:
    newseq, newmet = pdbutils.retrieve_sequence_from_PDB(seq, mode='sequence', update=False, seqfile=seqfile)
    sequences = np.append(sequences, newseq)
    metadata  = np.append(metadata, newmet)

print('{0} sequences were retrieved! \n'.format(len(sequences)))

for iseq in np.arange(len(sequences)):
    print('{0} {1}'.format(metadata[iseq], sequences[iseq]))

4189 sequences were retrieved! 

>1ag1_O mol:protein length:250  TRIOSEPHOSPHATE ISOMERASE
 MSKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ

>1ag1_T mol:protein length:250  TRIOSEPHOSPHATE ISOMERASE
 MSKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ

>1iig_A mol:protein length:250  TRIOSEPHOSPHATE ISOMERASE
 MSKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQ

>1iig_B mol:protein length:250  TR

## Download PDB files

We first create a new directory where to download the structures ('raw_bank'). We use the function check_project. \
Notice that the option 'level' allows you to name the directory. 

'dowload_pdb_from_metadata' downloads the structures from [RCSB PDB](rcsb.org). We are using the metadata \
we populated in the previous 2 steps. 


In [8]:
pdbclean_io.check_project(projdir=PROJDIR, level='raw_bank')

In [9]:
pdbutils.download_pdb_from_metadata(metadata, projdir=PROJDIR)

wrote ./TIM//raw_bank/1ag1.cif from https://files.rcsb.org/download/1AG1.cif
wrote ./TIM//raw_bank/1aw1.cif from https://files.rcsb.org/download/1AW1.cif
wrote ./TIM//raw_bank/1aw2.cif from https://files.rcsb.org/download/1AW2.cif
wrote ./TIM//raw_bank/1b9b.cif from https://files.rcsb.org/download/1B9B.cif
wrote ./TIM//raw_bank/1btm.cif from https://files.rcsb.org/download/1BTM.cif
wrote ./TIM//raw_bank/1ci1.cif from https://files.rcsb.org/download/1CI1.cif
wrote ./TIM//raw_bank/1dkw.cif from https://files.rcsb.org/download/1DKW.cif
wrote ./TIM//raw_bank/1hg3.cif from https://files.rcsb.org/download/1HG3.cif
wrote ./TIM//raw_bank/1hti.cif from https://files.rcsb.org/download/1HTI.cif
wrote ./TIM//raw_bank/1i45.cif from https://files.rcsb.org/download/1I45.cif
wrote ./TIM//raw_bank/1if2.cif from https://files.rcsb.org/download/1IF2.cif
wrote ./TIM//raw_bank/1iig.cif from https://files.rcsb.org/download/1IIG.cif
wrote ./TIM//raw_bank/1iih.cif from https://files.rcsb.org/download/1IIH.cif

wrote ./TIM//raw_bank/3gvg.cif from https://files.rcsb.org/download/3GVG.cif
wrote ./TIM//raw_bank/3krs.cif from https://files.rcsb.org/download/3KRS.cif
wrote ./TIM//raw_bank/3kxq.cif from https://files.rcsb.org/download/3KXQ.cif
wrote ./TIM//raw_bank/3m9y.cif from https://files.rcsb.org/download/3M9Y.cif
wrote ./TIM//raw_bank/3pf3.cif from https://files.rcsb.org/download/3PF3.cif
wrote ./TIM//raw_bank/3psv.cif from https://files.rcsb.org/download/3PSV.cif
wrote ./TIM//raw_bank/3psw.cif from https://files.rcsb.org/download/3PSW.cif
wrote ./TIM//raw_bank/3pvf.cif from https://files.rcsb.org/download/3PVF.cif
wrote ./TIM//raw_bank/3pwa.cif from https://files.rcsb.org/download/3PWA.cif
wrote ./TIM//raw_bank/3py2.cif from https://files.rcsb.org/download/3PY2.cif
wrote ./TIM//raw_bank/3qsr.cif from https://files.rcsb.org/download/3QSR.cif
wrote ./TIM//raw_bank/3qst.cif from https://files.rcsb.org/download/3QST.cif
wrote ./TIM//raw_bank/3s6d.cif from https://files.rcsb.org/download/3S6D.cif

wrote ./TIM//raw_bank/6ooi.cif from https://files.rcsb.org/download/6OOI.cif
wrote ./TIM//raw_bank/6r8h.cif from https://files.rcsb.org/download/6R8H.cif
wrote ./TIM//raw_bank/6tim.cif from https://files.rcsb.org/download/6TIM.cif
wrote ./TIM//raw_bank/6up1.cif from https://files.rcsb.org/download/6UP1.cif
wrote ./TIM//raw_bank/6up5.cif from https://files.rcsb.org/download/6UP5.cif
wrote ./TIM//raw_bank/6up8.cif from https://files.rcsb.org/download/6UP8.cif
wrote ./TIM//raw_bank/6upf.cif from https://files.rcsb.org/download/6UPF.cif
wrote ./TIM//raw_bank/6w4u.cif from https://files.rcsb.org/download/6W4U.cif
wrote ./TIM//raw_bank/7abx.cif from https://files.rcsb.org/download/7ABX.cif
wrote ./TIM//raw_bank/7az3.cif from https://files.rcsb.org/download/7AZ3.cif
wrote ./TIM//raw_bank/7az4.cif from https://files.rcsb.org/download/7AZ4.cif
wrote ./TIM//raw_bank/7az9.cif from https://files.rcsb.org/download/7AZ9.cif
wrote ./TIM//raw_bank/7aza.cif from https://files.rcsb.org/download/7AZA.cif