
# **PDB-CAT: Protein Data Bank Categorization Tool**
    
This notebook is designed to process and categorize structural data from the Protein Data Bank (PDB).
It allows users to analyze proteins and their interactions with ligands, apply filters based on sequence length, and classify structures based on mutations and binding properties.
    
## How to Use This Notebook
- **Step 1:** Run all the cells sequentially or use the *Run All* option.
- **Step 2:** Modify the necessary parameters (e.g., folder paths, filtering thresholds).
- **Step 3:** The script will process the structures and generate categorized output files.
    
    

In [1]:
import os
# Check if the PDB-CAT repository has been cloned and installed
if not os.path.isfile("PDB-CAT_READY"):
    os.system("git clone https://github.com/URV-cheminformatics/PDB-CAT.git")
    os.chdir("PDB-CAT")  # Change directory to the cloned repository
    !pip install -r requirements.txt # Install PDB-CAT if it has a setup.py or pyproject.toml
    os.chdir("..")  # Change back to the original directory
    os.system("touch PDB-CAT_READY")  # Create the PDB-CAT_READY file to indicate successful cloning and installation
print("PDB-CAT installed")

github = 'PDB-CAT/'

def ensure_directories():
    cif_dir = os.path.join(github + "/cif-test")
    out_dir = os.path.join(github + "/out")

    # Check if the 'cif' directory exists, if not, create it
    if not os.path.exists(cif_dir):
        os.mkdir(cif_dir)

    # Check if the 'out' directory exists, if not, create it
    if not os.path.exists(out_dir):
        os.mkdir(out_dir)

    print("Directories ensured")

ensure_directories()


PDB-CAT installed
Directories ensured


 ## Step 1: Import Required Libraries
The following libraries are required to parse and process PDB structures, manage data, and perform sequence alignments.

In [2]:
# Import libraries
import pandas as pd
import time
import re
import shutil
from datetime import datetime
import time
import psutil
from pdbecif.mmcif_io import CifFileReader
from pdbecif.mmcif_tools import MMCIF2Dict
from Bio.Align import PairwiseAligner
from Bio.PDB import *
from Bio import SeqIO
from google.colab import PDBCAT_module
PDBCAT_module.mount('/content/PDB-CAT/PDBCAT_module.py')

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

## Step 2: Verify Required Folders
The script checks if the required directories exist. If they do not, it creates them automatically.
    

In [None]:
# Check if you have the correct folders
cif_dir = os.path.join(os.getcwd(), "cif-test")

if not os.path.exists(cif_dir):
    os.mkdir(cif_dir)

## Step 3: Define Parameters
Adjust these parameters as needed before running the main script.
- **folder_name**: Name of the directory containing `.cif` files.
- **res_threshold**: Minimum number of residues to distinguish proteins from peptides.
- **mutation**: Set to `True` if mutation analysis is needed.
- **pdb**: PDB ID for mutation analysis (used only if `mutation = True`).

In [None]:
"""
=========
INITIAL INFORMATION. CHANGE THE CONTENT OF THESE VARIABLES IF NECESSARY
=========
"""

# Name of the folder with the cif files to process
folder_name = "cif-test"
# Chose a threshold for the number of amino acids, to discriminate between peptides and the subunits of the protein
res_threshold = 20
# Analyze mutations. True or False
mutation = False
# PDB code of the protein to analyze. If mutation is False, this variable is not used.
pdb = " "


## Step 4: Process the cif Data
The main script processes the PDB files and classifies structures based on the defined criteria.
No modifications are required in this section.

In [None]:
"""
===================================================================================================================================================
"""

# Path to the folder with the cif files to process
directory_path = os.getcwd() + "/" + folder_name
# Path and name of the FIRST csv output file (protein-centered) (.csv)
out_file = f"df-{folder_name}.csv"
# Path and name of the SECOND csv output file (ligand-centered) (.csv)
out_file_ligands = f"df-ligand-{folder_name}.csv"
# Path for the new categorizing folders
output_path = f"{folder_name}-out/"


"""
===================================================================================================================================================
"""

"""
MAIN CODE. YOU DO NOT NEED TO CHANGE THIS PART
"""
start_time = time.time()
blacklist, blacklist_dict = read_blacklist("./blacklist.txt") # Path to the blacklist file that contain the codes of the small molecules not considered ligands

# READ THE REFERENCE SEQUENCES from the FASTA file.
if mutation:
    fasta_file = f"{directory_path}/{pdb}.fasta"
    sequences_dict = extract_sequences(fasta_file)
else:
    sequences_dict = None

## OUTPUT
write_output(directory_path, out_file, out_file_ligands, blacklist_dict, mutation, blacklist, sequences_dict, res_threshold)

# Classify whether there is a mutation
if mutation == False:
    no_mutated_list = os.listdir(directory_path)
    no_mutated_list = [filename[:-4] for filename in no_mutated_list]

if mutation == True:
    no_mutated_list, non_mut_path = mutation_classification(directory_path, out_file, output_path)
    output_path = non_mut_path

# Classify depend on the bond
bond_classification(directory_path, out_file, no_mutated_list, output_path, mutation)