# Loading CDRPY Datasets

The `Dataset` class standardizes how cancer drug response datasets are represented in cdrpy.

In this tutorial, we will walk through a simple example of loading and exploring a custom dataset using cdrpy.

## Creating a Dataset

In [11]:
import itertools

import numpy as np
import pandas as pd

from pathlib import Path

The `Dataset` class is available in the `data` module of cdrpy.

In [12]:
from cdrpy.data import Dataset

The only requirement for creating a cdrpy dataset is a `pandas.DataFrame` instance containing drug response observations. This DataFrame must contain the following four columns:

1. *id* - the drug response ID (unique to each combination of sample and drug)
2. *cell_id* - the unique identifier for the corresponding sample (e.g., cell line)
3. *drug_id* - the unique identifier for the corresponding drug/treatment
4. *label* - the observed drug response value (e.g., an IC50 value)

As an example, we will generate some synthetic drug response data and create a new `Dataset`.

In [13]:
def simulate_response_data(n_drugs: int, n_cells: int) -> pd.DataFrame:
    """Generates simulated drug response observations."""
    n_responses = n_cells * n_drugs

    cell_ids = [f"C{i+1}" for i in range(n_cells)]
    drug_ids = [f"D{i+1}" for i in range(n_drugs)]

    pairs = zip(*itertools.product(cell_ids, drug_ids))
    data = zip(range(n_responses), *pairs, np.random.normal(size=n_responses))

    return pd.DataFrame(data, columns=["id", "cell_id", "drug_id", "label"])


response_data = simulate_response_data(n_drugs=10, n_cells=20)
response_data.head()

Unnamed: 0,id,cell_id,drug_id,label
0,0,C1,D1,-0.29193
1,1,C1,D2,-2.285385
2,2,C1,D3,0.576527
3,3,C1,D4,-0.126431
4,4,C1,D5,1.501209


In [14]:
D = Dataset(response_data, name="sim-dataset")
print(D)

Dataset(name=sim-dataset, size=200, n_cells=20, n_drugs=10)


## Subsetting and selection

cdrpy provides an intuitive interface for subsetting/selecting cancer drug response data. Specific cell lines/samples can be selected using the `Dataset.select_cells` method.

In [15]:
cell_ids = ["C1", "C2", "C3"]
D.select_cells(cell_ids)

Dataset(name=None, size=30, n_cells=3, n_drugs=10)

Specific drugs can be selected using the `Dataset.select_drugs` method.

In [16]:
drug_ids = ["D1", "D2", "D3"]
D.select_drugs(drug_ids)

Dataset(name=None, size=60, n_cells=20, n_drugs=3)

## Loading data from CSV files

In many cases, raw response data may be stored as csv files. cdrpy can load csv files