# Phase 2: Feature Engineering (the Encoding core)

### Drug Molecule Encoding for GNN (SMILES to Graph)
The goal is to transform an SMILES string (a linear text representation of a molecule) into a molecular graph (a data structure easily consumed by a GNN).

**1. Toolkit Selection**

You'll need a cheminformatics library to parse the SMILES string and extract graph components.
- **Primary Tool:** RDKit (Open-source cheminformatics software).
- **Deep Learning/Graph Libraries:** PyTorch Geometric (PyG) or DeepChem (which often wraps RDKit functions).
 
**2. Step-by-Step Conversion**
- **Step A: SMILES Parsing to Mol Object**
The SMILES string is first converted into an RDKit Mol object, which holds the internal representation of the molecule's structure.

    mol = Chem.MolFromSmiles("CCO") (SMILES for Ethanol)
- **Step B: Node Feature Extraction (Atom Features)**
Each atom in the molecule becomes a node in your graph. You must assign a numerical feature vector to each node (X). This vector represents the atom's chemical properties.

|      Feature Name           |      Description                                              |      Example Values/Encoding                             |
|-----------------------------|---------------------------------------------------------------|----------------------------------------------------------|
|     Atom Type               |     Chemical element of the atom.                             |     One-hot encoded (e.g., C, N, O, S, F, P, etc.)       |
|     Atom Degree             |     Number of bonds the atom has (excluding implicit H's).    |     One-hot encoded (e.g., 0 to 5, or 'MoreThanFive')    |
|     Formal Charge           |     Electrical charge assigned to the atom.                   |     One-hot encoded (e.g., -2, -1, 0, 1, 2)              |
|     Aromaticity             |     Whether the atom is part of an aromatic ring.             |     Binary (1 or 0)                                      |
|     Chirality               |     Stereochemistry of the atom (R/S or unspecified).         |     One-hot encoded                                      |
|     Number of Implicit H    |     Number of hydrogens implied by the valence.               |     One-hot encoded (e.g., 0 to 4)                       |

The final node feature vector (X) for an atom is the concatenation of these individual encodings.

**Step C: Edge Definition (Bond Features & Adjacency)**
- **Adjacency Matrix (A):** This defines the graph structure. It is a binary matrix where Aij=1 if at-oms i and j are connected by a bond, and 0 otherwise. For GNNs, this is typically represented as an Edge Index List (a tensor of shape (2, Nedges)).
- **Edge Features (E):** Each bond becomes an edge in your graph. You can assign a feature vector to the edge to describe the bond properties (optional, but highly recommended for GNNs).

|      Feature Name      |      Description                                        |      Example Values/Encoding     |
|------------------------|---------------------------------------------------------|----------------------------------|
|     Bond Type          |     Single, double, triple, or aromatic.                |     One-hot encoded              |
|     Conjugation        |     Whether the bond is part of a conjugated system.    |     Binary (1 or 0)              |
|     Ring Membership    |     Whether the bond is part of a ring.                 |     Binary (1 or 0)              |

**3. Final GNN Input Structure**

The GNN for the drug molecule will take the following input:

**1.	Node Features (X):** A matrix of shape (Natoms, Nnode_features).

**2.	Edge Index (Adjacency List):** A tensor of shape (2, Nbonds* 2).

**3.	Edge Features (E):** A matrix of shape (Nbonds * 2, Nedge_features) (if used).

### Open-Source DTI Data Sources
For a robust DTI project, you need large, clean datasets of both interactions and non-interactions.

**1. ChEMBL (Recommended Starting Point)**

**•	Focus:** A manually curated database of bioactive molecules with drug-like properties. It links small molecules to targets.

**•	What to Extract:** Look for bioactivity assays with quantitative binding data, typically Ki, Kd, IC50, or EC50 values.

**•	Data Labeling Strategy (Classification):**
- Positive (Binding): DTI pairs with an affinity value below a certain threshold (e.g., Ki < 100 µM or pIC50 > 6.0).
- Negative (Non-Binding): Pairs that were tested but showed very low affinity (e.g., Ki > 10,000 µM) or "random" pairs not known to interact (to construct an unbiased set).
    
**2. BindingDB**

**•	Focus:** A public domain database of measured binding affinities for drugs and targets, includ-ing protein-ligand complexes.

**•	What to Extract:** Similar to ChEMBL, it's rich in quantitative binding data and can be used to complement ChEMBL or be the primary source.

**3. DeepDTA/Davis/KIBA Benchmark Sets**

**•	Focus:** Researchers in the DTI/DTA (Drug-Target Affinity) field often use established bench-mark datasets that are already pre-processed and curated for direct use.

**•	Value:** Starting with one of these (e.g., the Davis or KIBA datasets) is highly recommended, as they provide a cleaner, smaller set of verified DTI pairs, allowing you to focus on the model architecture and quickly establish a baseline. You can then scale up to the larger raw data-bases.
