# 01 Molecular features

Molecular fingerprints are representations of chemical structures that encode the presence or absence of specific substructures or molecular fragments.
They are typically used in cheminformatics to compare and analyze chemical compounds.

## Data and processing

In this section, we will use the PyArrow library to load our dataset. PyArrow is a powerful library for working with large, columnar data structures, and it provides efficient tools for reading and writing data in the Parquet format, among others.

First, we import the necessary module from PyArrow:

In [1]:
import pyarrow.dataset as ds

Next, we define the path to our training dataset:

In [2]:
PATH_TRAIN_DATA = "../../../data/train.parquet"

Finally, we load the dataset using the PyArrow dataset function:

In [3]:
DATA = ds.dataset(source=PATH_TRAIN_DATA, format="parquet")

## Protein selection

In this section, we will focus on a single protein target from our dataset. The dataset contains binding affinity data for three different proteins. To simplify our approach and make it more manageable for this beginner tutorial, we will select only one protein target. This will help us avoid overcomplicating our analysis while still demonstrating the essential concepts.

In [4]:
protein_selection = "sEH"

Because our dataset is massive, loading the entire dataset into memory is impractical. Instead, we use PyArrow's scanner functionality to efficiently filter and process the data. Scanners allow us to query and load only the necessary parts of the dataset into memory, making our analysis more efficient and scalable.

To handle the protein selection and binding information, we will create two separate scanners: one for molecules that bind to the protein and one for those that do not. This separation will help us handle the unbalanced nature of the dataset and make it more manageable for our beginner approach.

PyArrow's compute functions allow us to apply complex filters directly on the dataset without loading it entirely into memory, which is essential for working with large datasets. By using scanners, we can process large datasets in a scalable manner. This approach ensures that we only load the necessary data into memory, improving performance and reducing resource usage. Additionally, PyArrow provides powerful and flexible functions for querying and manipulating data. This flexibility allows us to tailor our data processing to the specific needs of our analysis.

To achieve this, we use the `pyarrow.compute` module, which provides a set of functions for performing computations and filtering on Arrow arrays and tables.

In [5]:
import pyarrow.compute as pc

We then create two scanners to filter the dataset based on the protein selection and binding status.

The filters use boolean conditions to specify the criteria for selecting rows, with the `&` operator serving as a logical `AND` to combine these conditions. This setup allows us to efficiently handle the large dataset by focusing our analysis on a specific protein and its binding characteristics.

In [6]:
scanner_protein_bind = DATA.scanner(
    filter=(pc.field("protein_name") == protein_selection) & (pc.field("binds") == 1)
)
scanner_protein_no_bind = DATA.scanner(
    filter=(pc.field("protein_name") == protein_selection) & (pc.field("binds") == 0)
)

-   `scanner_protein_bind`: This scanner filters the dataset to include only rows where the selected protein (sEH) binds (i.e., `binds` is `1`). This subset of data will be used to analyze molecules that successfully bind to the protein.
-   `scanner_protein_no_bind`: This scanner filters the dataset to include only rows where the selected protein (sEH) does not bind (i.e., `binds` is `0`). This subset will be used to analyze molecules that do not bind to the protein.

## Molecule examples

In [7]:
from rdkit import Chem
from rdkit.Chem import AllChem
from leash_bio_kaggle.mol import clean_mol_str

In [8]:
def get_mol(smiles):
    mol = Chem.MolFromSmiles(smiles)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol)
    AllChem.MMFFOptimizeMolecule(mol, maxIters=200)
    return mol

In [9]:
example_row_bind = scanner_protein_bind.head(num_rows=1)
example_row_no_bind = scanner_protein_no_bind.head(num_rows=1)

In [10]:
smiles_bind = clean_mol_str(example_row_bind["buildingblock3_smiles"][0].as_py())
smiles_no_bind = clean_mol_str(example_row_no_bind["buildingblock3_smiles"][0].as_py())

In [11]:
mol_bind = get_mol(smiles_bind)
mol_no_bind = get_mol(smiles_no_bind)

## MACCS keys

https://pubs.acs.org/doi/10.1021/ci010132r

In [12]:
from rdkit.Chem import MACCSkeys

In [13]:
maccs = MACCSkeys.GenMACCSKeys(mol_bind)

## Morgan fingerprint

Morgan fingerprints, also known as Extended-Connectivity Fingerprints (ECFPs), are a type of molecular fingerprint used extensively in cheminformatics for tasks such as structure-activity relationship (SAR) modeling, similarity searching, clustering, and classification.
These fingerprints are designed to capture the structural features of molecules in a compact and efficient manner.

Morgan fingerprints are derived using a variant of the Morgan algorithm, which was originally developed to solve the molecular isomorphism problem.
In other words, to determine if molecules are similar even if the atoms are ordered differently.


### Generation of Morgan Fingerprints

The ECFP generation process involves three main stages:

1.  **Initial Assignment of Atom Identifiers**: Each atom in the molecule is assigned an initial integer identifier.
    This identifier typically encodes information about the atom's properties, such as atomic number, valence, and whether it is part of a ring.
2.  **Iterative Updating of Identifiers**: In each iteration, the identifier of each atom is updated to reflect the identifiers of its neighboring atoms.
    This process captures the local structural environment of each atom.
    The identifiers from each iteration are collected into a set, forming the extended-connectivity fingerprint.
3. **Duplicate Identifier Removal**: After a specified number of iterations, duplicate identifiers are removed, resulting in a set of unique identifiers that define the fingerprint.

#### Additional readings

- https://pubs.acs.org/doi/10.1021/ci100050t

In [14]:
ecfp_example = AllChem.GetMorganFingerprintAsBitVect(
    mol_bind,
    radius=3,
    nBits=248
)
print(list(ecfp_example))

[1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
