# Creating Datasets from the PDB

Graphein provides a utility for curating and splitting datasets from the [RCSB PDB](https://www.rcsb.org/).


Initialising a PDBManager will download PDB Metadata which we can use to make complex selections of protein structures.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb) [![GitHub](https://img.shields.io/badge/-View%20on%20GitHub-181717?logo=github&logoColor=ffffff)](https://github.com/a-r-j/graphein/blob/master/notebooks/creating_datasets_from_the_pdb.ipynb)

In [None]:
from rich import inspect
from graphein.datasets import PDBManager

manager = PDBManager(root_dir=".")

The manager wraps two dataframes:

* `manager.df` - your working selection.
* `manager.source` - A clean copy of the original metadata.

In [None]:
manager.df

In [None]:
manager.source

## Selection Properties

In [None]:
print("Num chains: ", manager.get_num_chains())
print("Num unique pdbs: ", manager.get_num_unique_pdbs())
print("Longest chain: ", manager.get_longest_chain())
print("Shortest chain: ", manager.get_shortest_chain())
print("Best Resolution: ", manager.get_best_resolution())
print("Worst Resolution: ", manager.get_worst_resolution())
print("Experiment Types: ", manager.get_experiment_types())
print("Molecule Types: ", manager.get_molecule_types())

# Making Selections

Selection functions return a pd.DataFrame. All selection functions provide an `update: bool` argument controlling whether or not `manager.df` is updated in place:

In [None]:
print("Number of chains: ", len(manager.df))

print(len(manager.resolution_better_than_or_equal_to(2.0)))
print(len(manager.df))

# Update inplace
manager.resolution_better_than_or_equal_to(2.0, update=True)
print(len(manager.df))

If you want to reset the selection:

In [None]:
manager.reset()
print(len(manager.df))

Here is an example selection:

In [None]:
manager.length_shorter_than(6, update=True)
manager.length_longer_than(4, update=True)
manager.molecule_type("protein", update=True)
manager.resolution_better_than_or_equal_to(1.5, update=True)
manager.experiment_type("diffraction", update=True)
manager.remove_non_standard_alphabet_sequences(update=True)
manager.df

# I/O

We can write our selections as FASTA files or download and write the relevant PDBs in our selection to disk:

## CSV

In [None]:
import os
import pandas as pd

os.makedirs("tmp/", exist_ok=True)
manager.to_csv("tmp/test.csv")

sel = pd.read_csv("tmp/test.csv")
sel

## FASTA

In [None]:
from graphein.protein.utils import read_fasta
manager.to_fasta("tmp/test.fasta")

fs = read_fasta("tmp/test.fasta")
print(fs)

## Downloading PDBs

In [None]:
manager.download_pdbs("tmp/pdbs/")

In [None]:
import os
os.listdir("tmp/pdbs")

## Writing Individual Chains

In [None]:
manager.write_chains()