# Creating Datasets from the PDB

Graphein provides a utility for curating and splitting datasets from the [RCSB PDB](https://www.rcsb.org/).


Initialising a PDBManager will download PDB Metadata which we can use to make complex selections of protein structures.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb)

In [27]:
from rich import inspect
from graphein.datasets import PDBManager

manager = PDBManager(root_dir=".")

The manager wraps two dataframes:

* `manager.df` - your working selection.
* `manager.source` - A clean copy of the original metadata.

In [29]:
manager.df

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type
0,100d_A,100d,A,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,[SPM],,1.90,1994-12-05,diffraction
1,100d_B,100d,B,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,[SPM],,1.90,1994-12-05,diffraction
2,101d_A,101d,A,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction
3,101d_B,101d,B,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction
4,101m_A,101m,A,154,protein,MYOGLOBIN,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,,"[HEM, NBN, SO4]",Physeter catodon,2.07,1997-12-13,diffraction
...,...,...,...,...,...,...,...,...,...,...,...,...,...
773981,9xia_A,9xia,A,388,protein,XYLOSE ISOMERASE,MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...,,"[DFR, MN]",Streptomyces rubiginosus,1.90,1990-10-11,diffraction
773982,9xim_A,9xim,A,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction
773983,9xim_B,9xim,B,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction
773984,9xim_C,9xim,C,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction


In [30]:
manager.source

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type
0,100d_A,100d,A,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,[SPM],,1.90,1994-12-05,diffraction
1,100d_B,100d,B,10,na,DNA/RNA (5'-R(*CP*)-D(*CP*GP*GP*CP*GP*CP*CP*GP...,CCGGCGCCGG,,[SPM],,1.90,1994-12-05,diffraction
2,101d_A,101d,A,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction
3,101d_B,101d,B,12,na,DNA (5'-D(*CP*GP*CP*GP*AP*AP*TP*TP*(CBR)P*GP*C...,CGCGAATTCGCG,,"[CBR, MG, NT]",,2.25,1994-12-14,diffraction
4,101m_A,101m,A,154,protein,MYOGLOBIN,MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDR...,,"[HEM, NBN, SO4]",Physeter catodon,2.07,1997-12-13,diffraction
...,...,...,...,...,...,...,...,...,...,...,...,...,...
773981,9xia_A,9xia,A,388,protein,XYLOSE ISOMERASE,MNYQPTPEDRFTFGLWTVGWQGRDPFGDATRRALDPVESVQRLAEL...,,"[DFR, MN]",Streptomyces rubiginosus,1.90,1990-10-11,diffraction
773982,9xim_A,9xim,A,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction
773983,9xim_B,9xim,B,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction
773984,9xim_C,9xim,C,393,protein,D-XYLOSE ISOMERASE,SVQATREDKFSFGLWTVGWQARDAFGDATRTALDPVEAVHKLAEIG...,,"[MN, XLS]",Actinoplanes missouriensis,2.40,1992-04-03,diffraction


## Selection Properties

In [34]:
print("Num chains: ", manager.get_num_chains())
print("Num unique pdbs: ", manager.get_num_unique_pdbs())
print("Longest chain: ", manager.get_longest_chain())
print("Shortest chain: ", manager.get_shortest_chain())
print("Best Resolution: ", manager.get_best_resolution())
print("Worst Resolution: ", manager.get_worst_resolution())
print("Experiment Types: ", manager.get_experiment_types())
print("Molecule Types: ", manager.get_molecule_types())

Num chains:  773986
Num unique pdbs:  201763
Longest chain:  16181
Shortest chain:  1
Best Resolution:  -1.0
Worst Resolution:  70.0
Experiment Types:  ['diffraction' 'NMR' 'other' 'EM']
Molecule Types:  ['na' 'protein']


# Making Selections

Selection functions return a pd.DataFrame. All selection functions provide an `update: bool` argument controlling whether or not `manager.df` is updated in place:

In [35]:
print("Number of chains: ", len(manager.df))

print(len(manager.resolution_better_than_or_equal_to(2.0)))
print(len(manager.df))

# Update inplace
manager.resolution_better_than_or_equal_to(2.0, update=True)
print(len(manager.df))

Number of chains:  773986
201310
773986
201310


If you want to reset the selection:

In [36]:
manager.reset()
print(len(manager.df))

773986


Here is an example selection:

In [37]:
manager.length_shorter_than(6, update=True)
manager.length_longer_than(4, update=True)
manager.molecule_type("protein", update=True)
manager.resolution_better_than_or_equal_to(1.5, update=True)
manager.experiment_type("diffraction", update=True)
manager.remove_non_standard_alphabet_sequences(update=True)
manager.df

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type
19388,1hhz_D,1hhz,D,5,protein,CELL WALL PEPTIDE,AEKAA,,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction
19389,1hhz_E,1hhz,E,5,protein,CELL WALL PEPTIDE,AEKAA,,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction
19390,1hhz_F,1hhz,F,5,protein,CELL WALL PEPTIDE,AEKAA,,"[3FG, DAL, DVC, FGA, GHP, MLU, OMY, OMZ, PGR]",,0.99,2000-12-29,diffraction
50603,1sha_B,1sha,B,5,protein,PHOSPHOPEPTIDE A,YVPML,,[PTR],Rous sarcoma virus,1.50,1992-08-18,diffraction
50815,1skg_B,1skg,B,5,protein,VAFRS,VAFRS,,"[MOH, SO4]",Daboia russellii pulchella; SYNTHETIC CONSTRUCT,1.21,2004-03-04,diffraction
...,...,...,...,...,...,...,...,...,...,...,...,...,...
653391,7kpu_B,7kpu,B,5,protein,bisubstrate analogue (CMC-ACE-SER-GLY-ARG-GLY-...,SGRGK,,"[ACE, BTB, GOL, NH2, SO4, WZG]",Homo sapiens; SYNTHETIC CONSTRUCT,1.43,2020-11-12,diffraction
682510,7oju_H,7oju,H,5,protein,MVNAL Peptide,MVNAL,,"[CMC, GOL, P6G]",Chaetomium thermophilum (strain DSM 1495 / CBS...,1.10,2021-05-17,diffraction
694155,7pul_P,7pul,P,5,protein,GLY-ALA-GLY-ALA-ALA,GAGAA,,"[CA, MG]",Enterococcus faecalis,1.40,2021-09-30,diffraction
744831,7x70_B,7x70,B,5,protein,peptide,AVKLQ,,[],Homo sapiens; SYNTHETIC CONSTRUCT,1.25,2022-03-08,diffraction


# I/O

We can write our selections as FASTA files or download and write the relevant PDBs in our selection to disk:

## CSV

In [38]:
import pandas as pd

manager.to_csv("tmp/test.csv")

sel = pd.read_csv("tmp/test.csv")
sel

Unnamed: 0,id,pdb,chain,length,molecule_type,name,sequence,split,ligands,source,resolution,deposition_date,experiment_type
0,1hhz_D,1hhz,D,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction
1,1hhz_E,1hhz,E,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction
2,1hhz_F,1hhz,F,5,protein,CELL WALL PEPTIDE,AEKAA,,"['3FG', 'DAL', 'DVC', 'FGA', 'GHP', 'MLU', 'OM...",,0.99,2000-12-29,diffraction
3,1sha_B,1sha,B,5,protein,PHOSPHOPEPTIDE A,YVPML,,['PTR'],Rous sarcoma virus,1.50,1992-08-18,diffraction
4,1skg_B,1skg,B,5,protein,VAFRS,VAFRS,,"['MOH', 'SO4']",Daboia russellii pulchella; SYNTHETIC CONSTRUCT,1.21,2004-03-04,diffraction
...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,7kpu_B,7kpu,B,5,protein,bisubstrate analogue (CMC-ACE-SER-GLY-ARG-GLY-...,SGRGK,,"['ACE', 'BTB', 'GOL', 'NH2', 'SO4', 'WZG']",Homo sapiens; SYNTHETIC CONSTRUCT,1.43,2020-11-12,diffraction
76,7oju_H,7oju,H,5,protein,MVNAL Peptide,MVNAL,,"['CMC', 'GOL', 'P6G']",Chaetomium thermophilum (strain DSM 1495 / CBS...,1.10,2021-05-17,diffraction
77,7pul_P,7pul,P,5,protein,GLY-ALA-GLY-ALA-ALA,GAGAA,,"['CA', 'MG']",Enterococcus faecalis,1.40,2021-09-30,diffraction
78,7x70_B,7x70,B,5,protein,peptide,AVKLQ,,[],Homo sapiens; SYNTHETIC CONSTRUCT,1.25,2022-03-08,diffraction


## FASTA

In [40]:
from graphein.protein.utils import read_fasta
manager.to_fasta("tmp/test.fasta")

fs = read_fasta("tmp/test.fasta")
print(fs)

{'1hhz_D': 'AEKAA', '1hhz_E': 'AEKAA', '1hhz_F': 'AEKAA', '1sha_B': 'YVPML', '1skg_B': 'VAFRS', '1tjk_I': 'FLSTK', '2d5w_C': 'ASKPK', '2d5w_D': 'ASKTK', '3drj_B': 'AHAKA', '3hds_E': 'ASWSA', '3hds_F': 'ASWSA', '4j78_B': 'KTKLL', '4j82_C': 'KSHQE', '4j82_D': 'KSHQE', '4j84_C': 'ARKLD', '4j84_D': 'ARKLD', '4l9p_C': 'KCVVM', '4olr_A': 'YVVFV', '4olr_B': 'YVVFV', '4qxx_Z': 'GNLVS', '4v3i_B': 'DLTRP', '4x3o_C': 'PKKTG', '4zhb_B': 'VDAVN', '5ctv_C': 'AEKAA', '5ctv_E': 'AEKAA', '5n99_C': 'NQPWQ', '5n99_E': 'NQPWQ', '5n99_F': 'NQPWQ', '5n99_H': 'NQPWQ', '5n99_J': 'NQPWQ', '5n99_L': 'NQPWQ', '5n99_N': 'NQPWQ', '5n99_P': 'NQPWQ', '5n99_R': 'NQPWQ', '5n99_T': 'NQPWQ', '5n99_V': 'NQPWQ', '5n99_X': 'NQPWQ', '5nf0_F': 'GGGGG', '5njf_E': 'AAAAA', '5njf_F': 'AAAAA', '5onp_B': 'GAIIG', '5onq_B': 'GAIIG', '5r42_E': 'TPGVY', '5r43_E': 'TPGVY', '5r44_E': 'TPGVY', '5r45_E': 'TPGVY', '5r46_E': 'TPGVY', '5r47_E': 'TPGVY', '5r48_E': 'TPGVY', '5r49_E': 'TPGVY', '5r4a_E': 'TPGVY', '5r4b_E': 'TPGVY', '5r4c_E': '

## Downloading PDBs

In [41]:
manager.download_pdbs("tmp/pdbs/")

  0%|          | 0/55 [00:00<?, ?it/s]

In [42]:
import os
os.listdir("tmp/pdbs")

['5r46.pdb',
 '4j84.pdb',
 '6y3m.pdb',
 '5ctv.pdb',
 '4qxx.pdb',
 '5njf.pdb',
 '7etu.pdb',
 '6f4s.pdb',
 '6z00.pdb',
 '7oju.pdb',
 '1hhz.pdb',
 '4j82.pdb',
 '5r48.pdb',
 '1skg.pdb',
 '5r45.pdb',
 '4zhb.pdb',
 '5r42.pdb',
 '4v3i.pdb',
 '5onp.pdb',
 '5r47.pdb',
 '1sha.pdb',
 '4x3o.pdb',
 '5r4c.pdb',
 '5onq.pdb',
 '5n99.pdb',
 '3drj.pdb',
 '7kd7.pdb',
 '5r4a.pdb',
 '7z5z.pdb',
 '7kpu.pdb',
 '5r4b.pdb',
 '5r4d.pdb',
 '3hds.pdb',
 '5r44.pdb',
 '6diy.pdb',
 '6rd2.pdb',
 '6fbb.pdb',
 '5r43.pdb',
 '6f4r.pdb',
 '2d5w.pdb',
 '7ett.pdb',
 '5r49.pdb',
 '7etv.pdb',
 '6eaw.pdb',
 '7pul.pdb',
 '6slg.pdb',
 '6ax4.pdb',
 '6f4t.pdb',
 '1tjk.pdb',
 '4l9p.pdb',
 '5nf0.pdb',
 '4olr.pdb',
 '4j78.pdb',
 '7x70.pdb']

## Writing Individual Chains

In [44]:
manager.write_chains()

  0%|          | 0/1 [00:00<?, ?it/s]

 78%|███████▊  | 43/55 [00:03<00:00, 12.35it/s]


FileNotFoundError: [Errno 2] No such file or directory: 'pdb/6t9m.pdb'