# Road map

The workflow is structured to move logically from data acquisition to model deployment and evaluation. At the end, also we can develop a web application.

### Phase 1: Data Acquisition and Preprocessing
The goal here is to gather and clean the raw data needed to train the model.
##### Step 1.1: Data Sourcing & Collection
- Identify and download DTI data from public databases (e.g., ChEMBL, BindingDB, or DrugBank).
- Focus on obtaining pairs of Small Molecule ID/SMILES and Protein Target ID/Sequence along with a corresponding Interaction Label (e.g., Active/Positive/Binding or Inactive/Negative/Non-binding, or a quantitative affinity value like Ki or IC50).
##### Step 1.2: Data Cleaning & Filtering
- Standardize the data: Remove duplicate entries. Filter out targets without a valid full sequence and drugs without a valid SMILES string.
- Handle Imbalance: DTI datasets are often imbalanced (more non-interactions than interactions). Determine a strategy (e.g., sampling, synthetic data generation, or using specialized loss functions).
##### Step 1.3: Data Splitting
Split the cleaned dataset into three distinct sets: Training Set (for model fitting), Validation Set (for hyperparameter tuning), and Test Set (for final unbiased evaluation). It's crucial to perform splitting in a way that avoids data leakage (e.g., consider splitting by unique drugs or targets).

### Phase 2: Feature Engineering (The Encoding Core)
This phase transforms the raw biological data into numerical formats suitable for a deep learning model.
##### Step 2.1: Drug Molecule Encoding
- Convert SMILES strings into a numerical representation:
- Option A (GNN input): Generate a Molecular Graph (nodes = atoms, edges = bonds) for input into a Graph Neural Network.
- Option B (Traditional DL): Calculate molecular descriptors (e.g., physicochemical properties) or use a fixed-size Morgan Fingerprint (a type of bit vector).
##### Step 2.2: Protein Sequence Encoding
- Convert the Amino Acid sequence into a numerical matrix:
- Option A (DL input): Use One-Hot Encoding or BLOSUM/p-SSM substitution matrices to represent each amino acid as a vector for CNN/RNN input.
- Option B (Embeddings): Use pre-trained protein language models (like ProtT5 or ESM) to generate high-dimensional embeddings for the sequence.
##### Step 2.3: Data Pipeline Creation
- Create a robust pipeline (e.g., using PyTorch/TensorFlow DataLoaders) to efficiently feed the encoded drug and protein features in batches to the model during training.

### Phase 3: Model Architecture Design and Implementation
This is where the deep learning model is built according to the proposed architecture.
##### Step 3.1: Define Sub-Models (Feature Learners)
- Drug Module: Implement a Graph Neural Network (GNN) (e.g., GCN, GAT) to process the molecular graph, outputting a fixed-size drug feature vector.
- Protein Module: Implement a 1D Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN/LSTM) to process the protein sequence matrix, outputting a fixed-size protein feature vector.
##### Step 3.2: Implement the Fusion Mechanism
- Concatenation: Combine the two feature vectors (drug and protein) into a single, longer vector.
- Attention Mechanism (Advanced): Implement a co-attention mechanism to allow the model to learn which parts of the drug and protein are most important for binding.
##### Step 3.3: Implement the Prediction Head
- Add a final Feed-Forward Network (FNN) layer on top of the fused vector.
- The final layer uses a Sigmoid activation for binary classification (Binding/Non-Binding) or a linear/softplus activation for regression (predicting affinity).

### Phase 4: Model Training and Optimization
The core phase of teaching the model to predict interactions.
##### Step 4.1: Model Training
- Select an Optimizer (e.g., Adam or SGD) and a Loss Function (e.g., Binary Cross-Entropy for classification or Mean Squared Error for regression).
- Train the model iteratively, monitoring the loss and performance metric (e.g., AUROC, AUPRC) on the Validation Set after each epoch.
##### Step 4.2: Hyperparameter Tuning
- Use techniques like Grid Search or Bayesian Optimization to find the best hyper parameters (e.g., learning rate, batch size, number of layers, hidden dimension size) that maximize performance on the validation set.
##### Step 4.3: Regularization and Early Stopping
- Apply Regularization (e.g., Dropout, L2 penalty) to prevent over fitting.
- Implement Early Stopping based on the validation loss to halt training when performance plateaus or degrades.
##### Step 4.4: Final Model Selection
- Save the model weights that achieved the best performance on the validation set.

### Phase 5: Evaluation and Interpretation (The Validation)
The final step is to rigorously test the model and understand its predictions.
##### Step 5.1: Performance Evaluation
- Run the best-performing model on the completely unseen Test Set.
- Calculate key classification metrics: Area Under the Receiver Operating Characteristic (AUROC), Area Under the Precision-Recall Curve (AUPRC), Accuracy, Precision, Recall, and F1-Score.
##### Step 5.2: Interpretation and Visualization
- Use Attention Weights (if an attention mechanism was used) to visualize and identify which amino acids on the protein and which atoms/bonds on the drug molecule were most critical to the binding prediction. This provides biological interpretability.
- Visualize Embeddings (e.g., using t-SNE or UMAP) to check if the model is grouping similar drugs/targets in the latent space.
##### Step 5.3: Reporting and Future Work
- Document the entire process, including data sources, model architecture, training parameters, and final test metrics.
- Discuss limitations and suggest avenues for improvement (e.g., integrating 3D structure data, ensemble modeling).

