# Building the Datasets

In this tutorial we provide the instructions to build the two datasets we have used to train the models

# Download

First we will download the folders that contain the CA coordinates of the experimental structures and the txt file with the pdb codes of every protein.

In [1]:
from torchmdexp.datasets.proteinfactory import ProteinFactory
import numpy as np
import tarfile
import os

In [2]:
datasets_path = '../data/datasets'
output_path = '.'
notebook_dir = os.getcwd()

### Fast-folders:

In [3]:
with tarfile.open(os.path.join(datasets_path, 'fastfolders.tar.gz')) as tar:
    tar.extractall(output_path)

In [4]:
protein_factory = ProteinFactory()
protein_factory.create_dataset(dataset = os.path.join(notebook_dir, 'fastfolders/ff.txt'), 
                               data_dir = os.path.join(notebook_dir, 'fastfolders/ff'),
                               out_dir = os.path.join(notebook_dir, 'fastfolders/ff.npy')
                              )

100%|███████████████████████████████████████████| 12/12 [00:00<00:00, 13.85it/s]


### Computational Solved Structures

In [None]:
with tarfile.open(os.path.join(datasets_path, 'csm_50.tar.gz')) as tar:
    tar.extractall(output_path)

In [18]:
protein_factory = ProteinFactory()
protein_factory.create_dataset(dataset = os.path.join(notebook_dir, 'csm_50/csm_50.txt'), 
                               data_dir = os.path.join(notebook_dir, 'csm_50/csm_dataset_50'),
                               out_dir = os.path.join(notebook_dir, 'csm_50/csm_50.npy')
                              )

100%|█████████████████████████████████████████████████████████| 15021/15021 [07:49<00:00, 31.99it/s]


Now we have prepared the datasets for training our models. Custom datasets can be created with pdb structures, only a folder that contains a subfolder x_0 with pdb structures and a .txt 
file with the names of the structures that the user wants to use are needed to create new datasets.