# Modelling walkthrough

The purpose of this notebook is to show how a new, sample model with custom dependencies would be developed and integrated into the pipeline.

This notebook follows a hypothetical scenario where Machine Learning Engineer Maya is developing a new model, with the aim of generating her own predictions of drug-disease treatment efficacy scores. Maya is new to the EveryCure / Matrix ecosystem, and is learning as she goes.

In the end, she wants to train and submit a new model to the pipeline, and have it evaluated along with the other models.

## Modelling assumptions

Maya's goal is to train a new model that will predict the efficacy of drug-disease interactions.

**Embeddings from Knowledge Graph:** Maya knows that EveryCure has generated embeddings for biomedical knowledge graph nodes, which meaningfully encode semantics of the nodes. Many of those nodes are drugs and diseases between which she wants to predict treatment efficacy.

**Training Data:** Maya expects the training data to be a set of known positives and negatives, i.e. drug-disease pairs for which the treatment is known to be effective or ineffective.

**Evaluation:** Maya assumes that the model will be evaluated using AUC-ROC. She also assumes that she will need to perform train-validation splits on her data, and that Matrix's pipeline downstream will be able to further test the predictions of her model.

**Retrieving Data:** Importantly, Maya will retrieve the embeddings and training data from pipelines other than the modelling pipeline. She will avoid preprocessing the data itself as much as possible, relying on other resources provided by EveryCure.


# Prerequesites

Maya needs access to the GCS bucket containing the data (currently, `gs://mtrx-us-central1-hub-dev-storage`). Maya will use `gsutil` to copy the data to her local machine, and will find the exact paths to the files in the Kedro Data Catalog.

## Required data sources

### Embeddings 

Maya will use the embeddings generated by the Knowledge Graph pipeline to encode the drugs and disease into a vector space.

In the embeddings pipeline, embeddings are extracted from Neo4j and saved to GCS. The pipeline is defined in the [embeddings pipeline](https://github.com/matrix-ml/matrix/blob/main/pipelines/matrix/src/matrix/pipelines/embeddings/pipeline.py). 

```python
node(
    func=nodes.extract_node_embeddings,
    inputs={
        "nodes": "embeddings.model_output.graphsage",
        "string_col": "params:embeddings.write_topological_col",
    },
    outputs="embeddings.feat.nodes",
    name="extract_nodes_edges_from_db",
    tags=[
        "argowf.fuse",
        "argowf.fuse-group.topological_embeddings",
        "argowf.template-neo4j",
    ],
),
```

Kedro Dataset to which the embeddings are saved: 

```yml
embeddings.feat.nodes:
  <<: *_spark_parquet
  filepath: ${globals:paths.embeddings}/feat/nodes_with_embeddings
```


Maya knows that `${globals:paths.embeddings}/feat/nodes_with_embeddings` converts to `gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings`.

She arbitrarily chooses to use the latest version of the embeddings, which happens to be `v0.2.4-rc.1`. She will run the command below to copy the data to her local machine.

The files take up about 24GB of space.

In [None]:
!python --version


In [None]:
!mkdir -p data

!mkdir -p data/pca_embeddings

!gsutil -m cp \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/_SUCCESS" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00000-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00001-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00002-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00003-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00004-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00005-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00006-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00007-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00008-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00009-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00010-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00011-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00012-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00013-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00014-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00015-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  data/embeddings/
  

### Ground truth data

Maya needs to retrieve the training data from the preprocessing pipeline, containing True / False positives and negatives that she can use to train her model on the previously retrieved embeddings. First input to other modelling pipelines is `modelling.raw.ground_truth.positives@spark`, so Maya will retrieve that dataset first (together with its negative counterpart `modelling.raw.ground_truth.negatives@spark`).

```python
node(
    func=nodes.create_int_pairs,
    inputs=[
        "embeddings.feat.nodes",
        "modelling.raw.ground_truth.positives@spark",
        "modelling.raw.ground_truth.negatives@spark",
    ],
    outputs="modelling.int.known_pairs@spark",
    name="create_int_known_pairs",
),
```

We retrieve ground truth data (conflated True Positives and True Negatives) from GCS. Both were produced by the `preprocessing` pipeline, as dataset `modelling.raw.ground_truth.positives@pandas` and `modelling.raw.ground_truth.negatives@pandas`, and will be read in as `@spark` dataframes by modelling steps. Maya will run the command below to copy the data to her local machine. Like in the previous step, the used version is arbitrary.

Maya sees that other files live alongside the `*_conflated.tsv` files, and decides to download and investigate them.

In [None]:
!mkdir -p data/known_pairs

!gsutil -m cp \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/01_raw/ground_truth/translator/v2.7.3/tn_pairs_conflated.tsv" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/01_raw/ground_truth/translator/v2.7.3/tp_pairs_conflated.tsv" \
  data/known_pairs/


# Preprocessing

By now, Maya has obtained the embeddings and ground truth data. She will now preprocess the data to create the input for her model. She will also need to create splits for cross-validation.

Maya will first inspect the ground truth data.

In [101]:
import pandas as pd
# True positives
df_tp = pd.read_csv("data/known_pairs/tp_pairs_conflated.tsv", sep="\t")
df_tp.head()

Unnamed: 0,source,target
0,CHEBI:3699,MONDO:0007186
1,UNII:84H8Z9550J,MONDO:0007186
2,CHEBI:7915,MONDO:0007186
3,CHEBI:6375,MONDO:0007186
4,CHEBI:33130,MONDO:0007186


In [102]:
# True negatives

df_tn = pd.read_csv("data/known_pairs/tn_pairs_conflated.tsv", sep="\t")
df_tn.head()


Unnamed: 0,source,target
0,CHEBI:32149,MONDO:0006807
1,CHEBI:32588,MONDO:0006807
2,CHEBI:6804,MONDO:0007186
3,CHEBI:8094,MONDO:0007186
4,CHEMBL.COMPOUND:CHEMBL1187846,MONDO:0007186


True positives and true negatives are represented as sets of source-target pairs. `source` is the drug, `target` is the disease.

Now, Maya will inspect the embeddings that will be used as featured for her model.


In [103]:
import os

raw_embeddings_directory = 'data/embeddings'
sample_file_name = "part-00000-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet"
df= pd.read_parquet(os.path.join(raw_embeddings_directory, sample_file_name))

df.head()

Unnamed: 0,<id>,<labels>,topological_embedding,pca_embedding,id,category
0,0,[Entity],"[-0.030589668, 0.10044213, 0.19094326, 0.16870...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",ATC:L01XY01,biolink:Drug
1,1,[Entity],"[0.3552283, 0.15644005, -0.07593296, 0.6228076...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",AEO:0000187,biolink:AnatomicalEntity
2,2,[Entity],"[0.02153505, 0.3099397, 0.038633507, 0.0247304...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",AraPort:AT1G01720,biolink:Gene
3,3,[Entity],"[0.2947414, 0.15319654, 0.2776162, 0.40766323,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",ATC:A07XA02,biolink:ChemicalEntity
4,4,[Entity],"[-0.057893604, 0.02334209, 0.040926114, 0.1583...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",ATC:M01AX04,biolink:ChemicalEntity


In [104]:
df["category"].unique()

array(['biolink:Drug', 'biolink:AnatomicalEntity', 'biolink:Gene',
       'biolink:ChemicalEntity', 'biolink:SequenceVariant',
       'biolink:MolecularEntity', 'biolink:SmallMolecule',
       'biolink:Polypeptide', 'biolink:NucleicAcidEntity',
       'biolink:MolecularMixture', 'biolink:BiologicalEntity',
       'biolink:Protein', 'biolink:OrganismTaxon', 'biolink:CellLine',
       'biolink:Cell', 'biolink:Disease', 'biolink:PhenotypicFeature',
       'biolink:BiologicalProcess', 'biolink:Transcript',
       'biolink:MolecularActivity', 'biolink:CellularComponent',
       'biolink:Pathway', 'biolink:GeneFamily',
       'biolink:DiseaseOrPhenotypicFeature',
       'biolink:GrossAnatomicalStructure', 'biolink:PhysiologicalProcess',
       'biolink:Food', 'biolink:ChemicalMixture', 'biolink:RNAProduct',
       'biolink:Behavior', 'biolink:BehavioralFeature'], dtype=object)

## Preprocess embeddings


In [None]:
!mkdir -p data/embeddings_filtered

At first, Maya (a) removed all pca_embeddings, (b) removed all entities which are not drugs or diseases.

In [105]:
import pyarrow.parquet as pq
import pyarrow as pa

from pathlib import Path

filtered_embeddings_path = Path('data/embeddings_filtered/embeddings.parquet')

if filtered_embeddings_path.exists():
    print("Filtered embeddings already exist, deleting...")
    filtered_embeddings_path.unlink()

# those categories are defined in `conf/base/modelling/parameters/defaults.yml`
categories_to_keep = set(["biolink:DiseaseOrPhenotypicFeature", "biolink:Drug", "biolink:Disease", "biolink:BehavioralFeature", "biolink:SmallMolecule", "biolink:PhenotypicFeature"])

schema = pa.schema([
    ('topological_embedding', pa.list_(pa.float32())),
    ('id', pa.string()),
])

with pq.ParquetWriter(filtered_embeddings_path, schema=schema) as writer:
    for file_no, file_name in enumerate(os.listdir(raw_embeddings_directory)):
        if not file_name.endswith(".parquet"):
            continue
            
        print(f"Processing file number {file_no}: {file_name}")
        file_path = os.path.join(raw_embeddings_directory, file_name)
        parquet_file = pq.ParquetFile(file_path)
        
        for i in range(parquet_file.num_row_groups):
            chunk = parquet_file.read_row_group(i).to_pandas()
            filtered_chunk = chunk[chunk["category"].isin(categories_to_keep)]
            filtered_chunk = filtered_chunk[["topological_embedding", "id"]]
            filtered_chunk = filtered_chunk.dropna(subset=["id", "topological_embedding"])
            
            # Convert to pyarrow table with explicit schema
            table = pa.Table.from_pandas(filtered_chunk, schema=schema)
            writer.write_table(table)

Filtered embeddings already exist, deleting...
Processing file number 0: part-00006-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 1: part-00011-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 2: part-00001-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 3: part-00000-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 4: part-00007-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 5: part-00010-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 6: part-00002-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 7: part-00009-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 8: part-00015-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing file number 9: part-00005-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet
Processing fi

`filtered_embeddings_path` now contains topological embeddings of drugs and diseases.

Maya filters down the ground truths to a simple list of node ids, to create training data for her model.

In [106]:
ground_truth_ids = set(df_tn["target"].unique()) | set(df_tn["source"].unique()) | set(df_tp["target"].unique()) | set(df_tp["source"].unique())


In [107]:
embeddings_df = pd.read_parquet(filtered_embeddings_path)
embeddings_df.head()

Unnamed: 0,topological_embedding,id
0,"[0.0002950218, 0.0009237228, 0.00020888452, 0....",FOODON:03530192
1,"[0.11091984, -0.014966668, 0.08064084, 0.09933...",GTOPDB:12689
2,"[0.06940694, -0.015641756, 0.23790395, 0.32063...",GTOPDB:313
3,"[0.08519988, 0.07789688, 0.06879597, 0.0517256...",GTOPDB:450
4,"[0.17523064, 0.15815642, 0.23141605, 0.1906943...",GTOPDB:5527


In [108]:
# calculate how many ground truths ids have an embedding

len(ground_truth_ids.intersection(embeddings_df["id"].unique())) / len(ground_truth_ids)


0.9672624018707199

In [109]:
df_tn_filtered = df_tn[df_tn["target"].isin(embeddings_df["id"]) & df_tn["source"].isin(embeddings_df["id"])]
df_tp_filtered = df_tp[df_tp["target"].isin(embeddings_df["id"]) & df_tp["source"].isin(embeddings_df["id"])]

In [None]:
# check how many of the ground truth pairs are left
(len(df_tn_filtered) + len(df_tp_filtered)) / (len(df_tn) + len(df_tp))


Maya filtered down the dataset to 3GB from 24GB, and reduced it to only relevant drugs and diseases. 97% of drugs and diseases from the ground truth data are included in the filtered embeddings, which is satisfactory.

Now, she can proceed to creating her model. 

- `embeddings_df` is the filtered embeddings plus node ids
- `df_tn_filtered` and `df_tp_filtered` are the ground truth data, filtered down to only include rows with a drug and disease that have an embedding


## Prepare data for modelling

Maya will now create a dataset that combines the embeddings and the ground truth data.


In [110]:
# Concatenate true positives and negatives, adding label column
df_model = pd.concat([
    df_tp_filtered.assign(label=1),
    df_tn_filtered.assign(label=0)
]).reset_index(drop=True)

# Join with embeddings to get source and target embeddings
df_model = (
    df_model
    .merge(
        embeddings_df[['id', 'topological_embedding']],
        left_on='source',
        right_on='id',
        how='left'
    )
    .drop('id', axis=1)
    .rename(columns={'topological_embedding': 'source_embedding'})
    .merge(
        embeddings_df[['id', 'topological_embedding']], 
        left_on='target',
        right_on='id',
        how='left'
    )
    .drop('id', axis=1)
    .rename(columns={'topological_embedding': 'target_embedding'})
)

print(f"Final dataset shape: {df_model.shape}")
df_model.head()


Final dataset shape: (47450, 5)


Unnamed: 0,source,target,label,source_embedding,target_embedding
0,CHEBI:3699,MONDO:0007186,1,"[0.1749186, 0.43398678, 0.7452298, 0.65872484,...","[0.44671196, -0.3108637, 0.06650227, -0.133686..."
1,UNII:84H8Z9550J,MONDO:0007186,1,"[0.01990571, -0.08038043, 0.20040062, 0.161179...","[0.44671196, -0.3108637, 0.06650227, -0.133686..."
2,CHEBI:7915,MONDO:0007186,1,"[0.07026903, 0.65059316, 0.7452017, 0.7127575,...","[0.44671196, -0.3108637, 0.06650227, -0.133686..."
3,CHEBI:6375,MONDO:0007186,1,"[-0.02374817, 0.30726308, 0.9234296, 1.0461868...","[0.44671196, -0.3108637, 0.06650227, -0.133686..."
4,CHEBI:33130,MONDO:0007186,1,"[0.24461012, 0.22785787, -0.19872648, 0.362895...","[0.44671196, -0.3108637, 0.06650227, -0.133686..."


In [111]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold, cross_val_score


def prepare_features(df):
    # Convert list embeddings to numpy arrays
    source_embeddings = np.vstack(df['source_embedding'].values)
    target_embeddings = np.vstack(df['target_embedding'].values)
    
    # Concatenate the embeddings horizontally
    return np.hstack([source_embeddings, target_embeddings])

# Prepare features
X = prepare_features(df_model)
y = df_model['label'].values

In [112]:
# Setup cross-validation
n_splits = 5
cv = KFold(n_splits=n_splits, shuffle=True, random_state=42)
model = LogisticRegression(
    random_state=42,
    max_iter=1000,  # Increased from default 100
)

# Perform cross-validation
cv_scores = cross_val_score(
    model, 
    X, 
    y, 
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1
)

# Print results
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean AUC-ROC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

Cross-validation scores: [0.92268325 0.92757629 0.92734511 0.92554705 0.92701778]
Mean AUC-ROC: 0.926 (+/- 0.004)


In [113]:
final_model = LogisticRegression(
    random_state=42,
    max_iter=1000  # Make sure to use same parameters
)
final_model.fit(X, y)

In [None]:
# now we need to make a matrix of all drug-disease pairs, and get the embeddings for each pair

