# Phase 1: Data Acquisition and Preprocessing
This is the foundational part. Errors or biases introduced here will cripple the best deep learning model, regardless of how complex its architecture is. The goal is to turn unstructured, messy, and large-scale public data into a standardized, clean, and balanced set of (SMILES, Sequence, text) triplets.

This phase is broken down into three critical steps: **sourcing**, **cleaning**, and **finalizing**.

### 1. Data Sourcing and Collection (The Raw Input)
The primary task here is to query a high-quality public database to collect the initial set of Drug-Target Interaction (DTI) records.
- **1.1. Identify Target Space:** Select a protein family or a single target (like a specific Kinase, as used in the code example) to restrict the scope. This is crucial for controlling the data volume and ensuring some structural homogeneity among targets.
- **1.2. Query Bioactivity Database:** Use the API (e.g., the chembl_webresource_client) for databases like ChEMBL or BindingDB.
    - Data Points to Extract: For every interaction, you must extract:
        - Molecule Identifier/SMILES: The compound (drug candidate) representation.
        - Target Identifier/Accession: The protein ID (e.g., UniProt ID).
        - Bioactivity Value: The measured affinity, typically IC50 (half maximal inhibitory concentration) or Kd (dissociation constant).
- **1.3. Initial Filtering (Quantitative Data):** Only retain records where the affinity value is an exact measurement (e.g., relation='=' in ChEMBL) and the unit is standardized (e.g., nM). Ig-nore entries with relations like '>' or '<', which are less precise.

### 2. Data Cleaning, Standardization, and Labeling (The Conversion)
This step transforms the raw data into a machine-learning-ready format and defines the prediction la-bel.
- **2.1. Standardize Drug Molecules (SMILES Cleaning):**
    - Canonicalization: A single molecule can have multiple valid SMILES strings. Use a cheminformatics library like RDKit to convert all SMILES into their Canonical SMILES form. This ensures that every unique molecule has one consistent input string.
    - Sanitization: Remove common issues like salts, explicit solvent molecules, and stand-ardize charges and tautomers. This prevents the model from learning features related to experimental preparation instead of intrinsic chemical structure.
- **2.2. Retrieve Target Sequences:** Use the protein ID (e.g., UniProt accession) to fetch the full amino acid sequence for the target protein. This is the raw sequence that will be encoded by the CNN/RNN module.
- **2.3. Create the Classification Label (Crucial Step):** The DTI task is often framed as binary classification (Binder vs. Non-binder).
    - Thresholding: Convert the quantitative affinity value (e.g., IC50) into a pIC50 value (pIC50 = -log10(C50).
    - Define a cutoff: Pairs with affinity better than a standard threshold (e.g., pIC50≥pIC60, corresponding to 1µM affinity) are labeled as Positive (1). Pairs with affinity worse than a separate, lower threshold are labeled as Negative (0).
- **2.4. Handling Imbalance and Negative Samples:**
    - Bioactivity data is inherently imbalanced, heavily skewed towards known binders (Posi-tive samples). A robust model needs reliable Negative (Non-interacting) samples.
    - Strategy: Generate Negative samples by combining random drug molecules with ran-dom protein targets where no interaction is known, or by using pairs that were explicitly tested and found to be inactive.

### 3. Finalization: Data Splitting (Preventing Leakage)
The final dataset must be partitioned carefully to ensure a fair test of the model's ability to generalize.
- **3.1. Training, Validation, and Testing Split:** Split the final, cleaned dataset into three sets (e.g., 80%, 10%, 10%).
    - Training Set: Used to update model weights.
    - Validation Set: Used for tuning hyperparameters and early stopping during training.
    - Test Set: Kept completely separate and untouched until the very end (Phase 5) for fi-nal, unbiased performance evaluation.
- **3.2. Stratified Splitting:** Use stratified splitting (e.g., from scikit-learn) to ensure that the ra-tio of Positive (1) to Negative (0) labels is roughly the same in all three sets.
- **3.3. Cold Start Split (Advanced):** For a more rigorous test, consider:
    - Cold Drug Split: Ensuring that the Test Set contains drugs never seen in the Train-ing/Validation sets.
    - Cold Target Split: Ensuring that the Test Set contains protein targets never seen in the Training/Validation sets. This tests the model's ability to generalize to new drugs and new targets, respectively.