# 02 Basic classifier

Developing predictive models.

Essential steps: data preprocessing, feature engineering, model selection, training, and evaluation.

The goal is to create a machine learning model that can predict the binding affinity of small molecules to a single protein target.

The system path, managed by the `sys.path` variable in Python, is a list of directory paths that the Python interpreter searches to locate modules and packages when an `import` statement is executed.

Current `sys.path` is able to show which directories are being searched by default 

In [2]:
import sys

print(sys.path)

['/opt/anaconda3/envs/leash_bio_kaggle-dev/lib/python311.zip', '/opt/anaconda3/envs/leash_bio_kaggle-dev/lib/python3.11', '/opt/anaconda3/envs/leash_bio_kaggle-dev/lib/python3.11/lib-dynload', '', '/opt/anaconda3/envs/leash_bio_kaggle-dev/lib/python3.11/site-packages', '/Users/ria/leash-bio-kaggle']


This list contains the directories that Python searches for modules.
Here’s a brief explanation of some key elements in this list:

-   `'/opt/anaconda3/envs/leash_bio_kaggle-dev/lib/python311.zip'`: This is the path to the Python standard library zip archive, which contains standard modules.
-   `'/opt/anaconda3/envs/leash_bio_kaggle-dev/lib/python3.11'`: This is the directory containing the core Python library.
-   `'/opt/anaconda3/envs/leash_bio_kaggle-dev/lib/python3.11/lib-dynload'`: This directory contains dynamic modules (e.g., compiled extension modules).
-   `''`: The empty string represents the current directory, allowing Python to search for modules in the directory from which the script is run.
-   `'/opt/anaconda3/envs/leash_bio_kaggle-dev/lib/python3.11/site-packages'`: This is the directory for third-party packages installed in the current virtual environment.

`'/Users/ria/leash-bio-kaggle'`

### Relative imports

In Python, the `os.path.abspath` function is used to obtain the absolute path of a given path.
An absolute path is a complete path from the root of the file system to the desired directory or file, leaving no ambiguity about its location.

#### Why use `abspath`?

`abspath` Absolute paths ensure that the path to the module is always interpreted correctly, regardless of the current working directory.

Crucial in dynamic environments like Jupyter notebooks where the working directory can change depending on how the notebook is executed.

#### Why Use Relative Path Import?

Using a relative path in conjunction with `os.path.abspath` helps maintain the portability of the project.
The relative path (`'../../../'`) specifies the location of the `leash_bio_kaggle` module relative to the current file's location.
This approach makes it easy to move the entire project to a different directory or system without breaking the import paths.

This practice helps in organizing the project structure logically.
By using relative paths, we can easily locate and manage modules within the project hierarchy.
It also keeps the code cleaner and more understandable, as the structure is evident from the path specified.

In [3]:
import os

module_path = os.path.abspath(os.path.join('../../../', 'leash_bio_kaggle'))
print(module_path)

if not os.path.exists(module_path):
    raise RuntimeError("Cannot find the Python module `leash_bio_kaggle`")

/Users/ria/leash-bio-kaggle/leash_bio_kaggle


After determining the correct path to our custom module using `os.path.abspath`, the next step is to ensure that this path is included in the sys.path list.
Let’s see how we can add our module path to the system path:

In [4]:
if module_path not in sys.path:
    sys.path.append(module_path)

Check if our custom module path is already present in the `sys.path`.

Prevents adding the same path multiple times

If the custom module path is not found in `sys.path`, we append it. 

## Data and processing

PyArrow loads our dataset and is a powerful library for working with large, columnar data structures, and it provides efficient tools for reading and writing data in the Parquet format, among others.

First, we import the necessary module from PyArrow:

In [5]:
import pyarrow.dataset as ds

Next, we define the path to our training dataset:

In [6]:
PATH_TRAIN_DATA = "../../../data/train.parquet"

Finally, we load the dataset using the PyArrow dataset function:

In [7]:
DATA = ds.dataset(source=PATH_TRAIN_DATA, format="parquet")

### Protein selection

In this section, we will focus on a single protein target from our dataset. The dataset contains binding affinity data for three different proteins. To simplify our approach and make it more manageable for this beginner tutorial, we will select only one protein target. This will help us avoid overcomplicating our analysis while still demonstrating the essential concepts.

In [8]:
protein_selection = "sEH"

The scanner allow us to query and load only the necessary parts of the dataset into memory.

pyarrow.compute module allows us to apply complex filters directly on the dataset without loading it entirely into memory.  It provides a set of functions for performing computations and filtering on Arrow arrays and tables.

In [9]:
import pyarrow.compute as pc

Create 2 separate scanners to filter the dataset based on the protein selection an binding status: molecules that bind, and molecules that do not.

The filters use boolean conditions to specify the criteria for selecting rows, with the `&` operator combining these conditions. 

Allows us to efficiently handle the large dataset by focusing our analysis on a specific protein and its binding characteristics.

In [10]:
scanner_protein_bind = DATA.scanner(
    filter=(pc.field("protein_name") == protein_selection) & (pc.field("binds") == 1)
)
scanner_protein_no_bind = DATA.scanner(
    filter=(pc.field("protein_name") == protein_selection) & (pc.field("binds") == 0)
)

-   `scanner_protein_bind`: This scanner filters the dataset to include only rows where the selected protein (sEH) binds (i.e., `binds` is `1`). This subset of data will be used to analyze molecules that successfully bind to the protein.
-   `scanner_protein_no_bind`: This scanner filters the dataset to include only rows where the selected protein (sEH) does not bind (i.e., `binds` is `0`). This subset will be used to analyze molecules that do not bind to the protein.

### Subsampling dataset

* Subsampling involves selecting a smaller, representative portion of the data for analysis or training. 
* Creates a more balanced dataset that allows the machine learning model to learn the characteristics of both classes more effectively.
* A representative subsample of the data can allow good generalization
* A well-chosen subsample can capture the essential patterns and variations in the data, enabling the model to perform well on unseen data. 


Working with a large dataset containing binding affinity data for molecules against a specific protein. 
* The dataset is likely imbalanced, with more instances of molecules that do not bind to the protein compared to those that do. 
    * This can lead to a biased model
* Create a balanced subset of the data for our initial analysis and model training.

2 scanners: one for binding molecules and one for non-binding molecules.
* By subsampling these datasets, we can ensure a more balanced representation of both classes, making our machine learning task more tractable and improving the model's performance.

#### Counting Rows

First, we count the number of rows (samples) for binding and non-binding molecules.
`count_rows()` is a method counts the number of rows in the scanner.

In [11]:
n_rows_bind = scanner_protein_bind.count_rows()
n_rows_no_bind = scanner_protein_no_bind.count_rows()

When working with large and potentially imbalanced datasets, it's important to choose sample sizes that create a balanced and manageable subset for analysis and model training. 
* In this example, we make specific choices for the number of binding and non-binding samples, but these numbers can be adjusted based on the characteristics of your dataset and the goals of your project.

In [12]:
n_bind = 10_000
ratio_no_bind = 1.0
n_no_bind = int(ratio_no_bind * n_bind)

1.  Setting the Number of Binding Samples (`n_bind`).

    We set the desired number of binding molecule samples to 10,000. This number is chosen to provide a substantial amount of data for training and validating our machine learning model while keeping the dataset size manageable.
    * The choice of 10,000 is somewhat arbitrary and should be based on the size of your dataset and the computational resources available. 
2.  Setting the Ratio of Non-Binding to Binding Samples (`ratio_no_bind`).

    We set the ratio of non-binding to binding samples to 1.0, meaning we want an equal number of non-binding samples as binding samples. T
    * This creates a balanced dataset, which is important for many machine learning algorithms that perform better when classes are balanced.
    
    Ratio chosen based on the level of imbalance in your dataset. 
    * For highly imbalanced datasets: ratio of less than 1 to downsample the majority class or use techniques like oversampling the minority class. 
    * A ratio of 1.0 is a good starting point for creating balanced datasets
3.  Calculating the Number of Non-Binding Samples (`n_no_bind`).
    Calculated based on the specified ratio. 
    * In this case, with a ratio of 1.0 and n_bind of 10,000, `n_no_bind` will also be 10,000, resulting in a balanced subset.
    
    Ensure that the calculated number of samples does not exceed the available data. 
    * If your dataset has fewer non-binding samples than calculated, you may need to adjust your ratio or consider techniques like upsampling the minority class. 
    * Always verify that the chosen sample sizes are feasible given the dataset's composition.

In [13]:
if n_bind > n_rows_bind:
    raise ValueError("`n_bind` is greater than `n_samples_bind`")
if n_no_bind > n_rows_no_bind:
    raise ValueError("`n_no_bind` is greater than `n_samples_no_bind`")

Before we proceed with subsampling the dataset, we need to import the necessary libraries. In this example, we use NumPy and PyArrow to perform random sampling and handle data efficiently.

In [14]:
import numpy as np
import pyarrow as pa

Next, we perform random sampling on the dataset to select a specified number of binding molecule samples. This step ensures that we have a manageable subset of binding data for further analysis and model training.

In [15]:
bind_indices = np.random.choice(n_rows_bind, size=n_bind, replace=False)

In this code block, we use the `np.random.choice` function to generate an array of random indices. Specifically, `np.random.choice(n_rows_bind, size=n_bind, replace=False)` generates `n_bind` random indices from the total number of binding samples (`n_rows_bind`). The `replace=False` parameter ensures that each index is unique, meaning the same index is not selected more than once. This method of random sampling helps in creating a diverse and representative subset of binding data.

Once we have the random indices, we use the take method of the scanner object to extract the corresponding rows from the dataset. The `scanner_protein_bind.take(indices=...)` method selects the rows at the specified indices, providing us with a subset of binding molecule samples. This subset will be used in our subsequent analysis and model training.

In [16]:
bind_table = scanner_protein_bind.take(indices=bind_indices)

Next, we perform a similar process to sample a specified number of non-binding molecule samples. 

In [17]:
no_bind_table = scanner_protein_no_bind.take(
    indices=np.random.choice(n_rows_no_bind, size=n_no_bind, replace=False)
)

After we have sampled the specified number of binding and non-binding molecule samples, the next step is to combine these subsets into a single dataset.
This is achieved by concatenating the tables containing the sampled data.

In [18]:
table = pa.concat_tables([bind_table, no_bind_table])

By concatenating the tables, we merge the subsets of binding and non-binding samples into a single, unified dataset.

Most machine learning algorithms expect a single input dataset for training and evaluation. By concatenating the binding and non-binding samples into one table, we prepare the data in a format that is ready to be fed into machine learning models. This unified dataset can then be split into training and testing sets, features can be extracted, and models can be trained without additional steps to merge data.

Having a single table that contains all the samples makes it easier to handle the data. We can apply transformations, feature extraction, and other preprocessing steps uniformly across the entire dataset without needing to manage multiple tables separately. This simplifies the workflow and reduces the potential for errors or inconsistencies in data processing.

## Features

TODO: We define functions for cleaning the molecule string and extracting features using RDKit. The features are generated using Morgan fingerprints, a common method for encoding molecular structures.

In [21]:
from rdkit import Chem
from rdkit.Chem import AllChem
from leash_bio_kaggle.mol import clean_mol_str

ModuleNotFoundError: No module named 'selfies'

In [20]:
def get_features(smiles: str, radius: int = 3, nBits: int = 2048):
    mol = Chem.MolFromSmiles(smiles)
    features = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=nBits)
    return np.array(features)

In [22]:
def process_row(row):
    smiles = row['molecule_smiles']
    fingerprint = get_features(clean_mol_str(smiles))
    return fingerprint

In [23]:
from concurrent.futures import ThreadPoolExecutor

In [26]:

def split_table_into_batches(table, batch_size):
    num_rows = len(table)
    for i in range(0, num_rows, batch_size):
        yield table.slice(i, min(batch_size, num_rows - i))


def generate_features_parallel(table, num_workers=4, batch_size=1000):
    batches = split_table_into_batches(table, batch_size)
    all_features = []

    for batch in batches:
        with ThreadPoolExecutor(max_workers=num_workers) as executor:
            batch_dict = batch.to_pydict()
            futures = [executor.submit(process_row, {key: batch_dict[key][i] for key in batch_dict}) for i in range(len(batch))]
            batch_features = np.array([future.result() for future in futures])
            all_features.append(batch_features)

    features = np.vstack(all_features)
    return features

In [27]:
features_array = generate_features_parallel(table)

NameError: name 'clean_mol_str' is not defined

## Training