# Essential Toolkit for DTI Project Implementation

Since you've chosen a modern, heterogeneous model (GNN for drug, CNN/RNN for protein), you'll need a specialized stack of libraries. PyTorch is generally preferred in the research community for its flexibility in defining custom architectures like this.

Here is the essential toolkit for implementing your GNN/CNN Drug-Target Interaction (DTI) Model:


### 1. Data Handling & Cheminformatics (Phase 1 & 2 Core)

|      Library          |      Purpose                          |      Key Functionality                                                                                                                                                                                   |
|-----------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     RDKit             |     Core Cheminformatics              |     Converts SMILES strings to Mol objects, extracts atom/bond   features, generates molecular graphs,   and calculates molecular descriptors/fingerprints. Absolutely essential for   the drug side.    |
|     Pandas / NumPy    |     Data Cleaning &   Manipulation    |     Handles large tabular   datasets (ChEMBL/BindingDB exports), performs data filtering, label   assignment, and efficient numerical operations.                                                        |
|     Biopython         |     Protein Sequence Handling         |     While you'll mostly   handle sequences as strings, Biopython provides utilities for parsing   sequence files (FASTA) and basic sequence manipulation if needed.                                      |

### 2. Deep Learning Frameworks (Phase 3 & 4 Core)

|      Library          |      Purpose                          |      Key Functionality                                                                                                                                                                                   |
|-----------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     RDKit             |     Core Cheminformatics              |     Converts SMILES strings to Mol objects, extracts atom/bond   features, generates molecular graphs,   and calculates molecular descriptors/fingerprints. Absolutely essential for   the drug side.    |
|     Pandas / NumPy    |     Data Cleaning &   Manipulation    |     Handles large tabular   datasets (ChEMBL/BindingDB exports), performs data filtering, label   assignment, and efficient numerical operations.                                                        |
|     Biopython         |     Protein Sequence Handling         |     While you'll mostly   handle sequences as strings, Biopython provides utilities for parsing   sequence files (FASTA) and basic sequence manipulation if needed.                                      |

### 3. Protein Sequence Encoding (Phase 3)

|      Library          |      Purpose                          |      Key Functionality                                                                                                                                                                                   |
|-----------------------|---------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     RDKit             |     Core Cheminformatics              |     Converts SMILES strings to Mol objects, extracts atom/bond   features, generates molecular graphs,   and calculates molecular descriptors/fingerprints. Absolutely essential for   the drug side.    |
|     Pandas / NumPy    |     Data Cleaning &   Manipulation    |     Handles large tabular   datasets (ChEMBL/BindingDB exports), performs data filtering, label   assignment, and efficient numerical operations.                                                        |
|     Biopython         |     Protein Sequence Handling         |     While you'll mostly   handle sequences as strings, Biopython provides utilities for parsing   sequence files (FASTA) and basic sequence manipulation if needed.                                      |

### 4. Utility, Optimization, and Evaluation (Phase 4 & 5)

|      Library                                |      Purpose                       |      Key Functionality                                                                                                                                                                                                     |
|---------------------------------------------|------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     Scikit-learn (Sklearn)                  |     Metrics & Baseline   Models    |     Essential for   calculating final evaluation metrics (AUROC, AUPRC, $R^2$, etc.) on the   test set. Also useful for initial data splitting and baseline models.                                                        |
|     Optuna or WandB (Weights   & Biases)    |     Hyperparameter Tuning          |     Optuna is a powerful framework for automated hyperparameter optimization   (Bayesian/Pruning). WandB is excellent for logging,   visualization, and monitoring the training runs (loss, metrics, model   versions).    |
|     Matplotlib / Seaborn                    |     Visualization                  |     Used for generating   plots for data distribution, attention/saliency maps, AUROC curves, and   t-SNE/UMAP visualizations in the Interpretation Phase (Phase 5).                                                       |

A common project setup is:

1.	Use RDKit to convert SMILES to a graph object.

2.	Use a custom PyTorch Dataset and DataLoader to manage the heterogeneous (graph + se-quence) batches.

3.	Define the model as a single PyTorch nn.Module, using PyTorch Geometric components for the GNN side and standard PyTorch nn.Conv1d for the CNN side.

4.	Track all experiments using WandB.