# Modelling walkthrough

The purpose of this notebook is to show how a new, sample model with custom dependencies would be developed and integrated into the pipeline.

This notebook follows a hypothetical scenario where Machine Learning Engineer Maya is developing a new model, with the aim of generating her own predictions of drug-disease treatment efficacy scores. Maya is new to the EveryCure / Matrix ecosystem, and is learning as she goes.

In the end, she wants to train and submit a new model to the pipeline, and have it evaluated along with the other models.

## Modelling assumptions

Maya's goal is to train a new model that will predict the efficacy of drug-disease interactions.

**Embeddings from Knowledge Graph:** Maya knows that EveryCure has generated embeddings for biomedical knowledge graph nodes, which meaningfully encode semantics of the nodes. Many of those nodes are drugs and diseases between which she wants to predict treatment efficacy.

**Training Data:** Maya expects the training data to be a set of known positives and negatives, i.e. drug-disease pairs for which the treatment is known to be effective or ineffective.

**Evaluation:** Maya assumes that the model will be evaluated using AUC-ROC. She also assumes that she will need to perform train-validation splits on her data, and that Matrix's pipeline downstream will be able to further test the predictions of her model.

**Retrieving Data:** Importantly, Maya will retrieve the embeddings and training data from pipelines other than the modelling pipeline. She will avoid preprocessing the data itself as much as possible, relying on other resources provided by EveryCure.


# Prerequesites

Maya needs access to the GCS bucket containing the data (currently, `gs://mtrx-us-central1-hub-dev-storage`). Maya will use `gsutil` to copy the data to her local machine, and will find the exact paths to the files in the Kedro Data Catalog.

## Required data sources

### Embeddings 

Maya will use the embeddings generated by the Knowledge Graph pipeline to encode the drugs and disease into a vector space.

In the embeddings pipeline, embeddings are extracted from Neo4j and saved to GCS. The pipeline is defined in the [embeddings pipeline](https://github.com/matrix-ml/matrix/blob/main/pipelines/matrix/src/matrix/pipelines/embeddings/pipeline.py). 

```python
node(
    func=nodes.extract_node_embeddings,
    inputs={
        "nodes": "embeddings.model_output.graphsage",
        "string_col": "params:embeddings.write_topological_col",
    },
    outputs="embeddings.feat.nodes",
    name="extract_nodes_edges_from_db",
    tags=[
        "argowf.fuse",
        "argowf.fuse-group.topological_embeddings",
        "argowf.template-neo4j",
    ],
),
```

Kedro Dataset to which the embeddings are saved: 

```yml
embeddings.feat.nodes:
  <<: *_spark_parquet
  filepath: ${globals:paths.embeddings}/feat/nodes_with_embeddings
```


Maya knows that `${globals:paths.embeddings}/feat/nodes_with_embeddings` converts to `gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings`.

She arbitrarily chooses to use the latest version of the embeddings, which happens to be `v0.2.4-rc.1`. She will run the command below to copy the data to her local machine.

The files take up about 24GB of space.

In [None]:
!python --version


In [None]:
!mkdir -p data

!mkdir -p data/pca_embeddings

!gsutil -m cp \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/_SUCCESS" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00000-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00001-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00002-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00003-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00004-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00005-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00006-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00007-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00008-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00009-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00010-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00011-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00012-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00013-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00014-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings/part-00015-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet" \
  data/embeddings/
  

### Ground truth data

Maya needs to retrieve the training data from the preprocessing pipeline, containing True / False positives and negatives that she can use to train her model on the previously retrieved embeddings. First input to other modelling pipelines is `modelling.raw.ground_truth.positives@spark`, so Maya will retrieve that dataset first (together with its negative counterpart `modelling.raw.ground_truth.negatives@spark`).

```python
node(
    func=nodes.create_int_pairs,
    inputs=[
        "embeddings.feat.nodes",
        "modelling.raw.ground_truth.positives@spark",
        "modelling.raw.ground_truth.negatives@spark",
    ],
    outputs="modelling.int.known_pairs@spark",
    name="create_int_known_pairs",
),
```

We retrieve ground truth data (conflated True Positives and True Negatives) from GCS. Both were produced by the `preprocessing` pipeline, as dataset `modelling.raw.ground_truth.positives@pandas` and `modelling.raw.ground_truth.negatives@pandas`, and will be read in as `@spark` dataframes by modelling steps. Maya will run the command below to copy the data to her local machine. Like in the previous step, the used version is arbitrary.

Maya sees that other files live alongside the `*_conflated.tsv` files, and decides to download and investigate them.

In [None]:
!mkdir -p data/known_pairs

!gsutil -m cp \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/01_raw/ground_truth/translator/v2.7.3/tn_pairs_conflated.tsv" \
  "gs://mtrx-us-central1-hub-dev-storage/kedro/data/01_raw/ground_truth/translator/v2.7.3/tp_pairs_conflated.tsv" \
  data/known_pairs/


# Preprocessing

By now, Maya has obtained the embeddings and ground truth data. She will now preprocess the data to create the input for her model. She will also need to create splits for cross-validation.

Maya will first inspect the ground truth data.

In [None]:
import pandas as pd
# True positives
df_tp = pd.read_csv("data/known_pairs/tp_pairs_conflated.tsv", sep="\t")
df_tp.head()

In [None]:
# True negatives

df_tn = pd.read_csv("data/known_pairs/tn_pairs_conflated.tsv", sep="\t")
df_tn.head()


True positives and true negatives are represented as sets of source-target pairs. `source` is the drug, `target` is the disease.

Now, Maya will inspect the embeddings that will be used as featured for her model.


In [None]:
import os

raw_embeddings_directory = 'data/embeddings'
sample_file_name = "part-00000-1a0c48cc-af15-4889-8eaf-c6aa98d21c98-c000.snappy.parquet"
df= pd.read_parquet(os.path.join(raw_embeddings_directory, sample_file_name))

df.head()

In [None]:
df["category"].unique()

## Preprocess embeddings


In [None]:
!mkdir -p data/embeddings_filtered

At first, Maya (a) removed all pca_embeddings, (b) removed all entities which are not drugs or diseases.

In [None]:
import pyarrow.parquet as pq
import pyarrow as pa

from pathlib import Path

filtered_embeddings_path = Path('data/embeddings_filtered/embeddings.parquet')

if filtered_embeddings_path.exists():
    print("Filtered embeddings already exist, deleting...")
    filtered_embeddings_path.unlink()

categories_to_keep = set(["biolink:DiseaseOrPhenotypicFeature", "biolink:Drug", "biolink:Disease"])

first_file = next(f for f in os.listdir(raw_embeddings_directory) if f.endswith('.parquet'))
first_path = os.path.join(raw_embeddings_directory, first_file)
first_chunk = pd.read_parquet(first_path)
filtered_chunk = first_chunk[first_chunk["category"].isin(categories_to_keep)]
filtered_chunk = filtered_chunk[["topological_embedding", "id"]]

table = pa.Table.from_pandas(filtered_chunk)


with pq.ParquetWriter(filtered_embeddings_path, schema=table.schema) as writer:
    for file_no, file_name in enumerate(os.listdir(raw_embeddings_directory)):
        print(f"Processing file number {file_no}: {file_name}")
        if not file_name.endswith(".parquet"):
            continue
            
        file_path = os.path.join(raw_embeddings_directory, file_name)
        parquet_file = pq.ParquetFile(file_path)
        
        for i in range(parquet_file.num_row_groups):
            chunk = parquet_file.read_row_group(i).to_pandas()
            filtered_chunk = chunk[chunk["category"].isin(categories_to_keep)]
            filtered_chunk = filtered_chunk[["topological_embedding", "id"]]
            writer.write_table(pa.Table.from_pandas(filtered_chunk))


Maya filters down the ground truths to a simple list of node ids, to use in embedding filtering.

In [None]:
filter_ids = set(df_tn["target"].unique()) | set(df_tn["source"].unique()) | set(df_tp["target"].unique()) | set(df_tp["source"].unique())


In [None]:
df.head()

MATEUSZ tasks:


1. Make the pairs dataset that will combine the embeddings and the ground truth data. 
Explain what goes into that dataset.

2. Create splits. Explain what splits are.

3. Make a model that will take all, and train it. Quantify the performance.


1. Transform ground truth data into known_pairs. Explain what known_pairs actually are.
2. Establish where the drugs / diseases are coming from, and how they are connected to the embeddings + what comes out of Matrix.
3. Create splits.

## Ground Truths
### EDA

## Modelling

Maya needs to create and train a new model.


In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression

# Mock data for simplicity
np.random.seed(42)
num_samples = 100
num_features = 10

X = np.random.rand(num_samples, num_features)  # Features
true_weights = np.random.rand(num_features)    # True underlying weights
y = (X @ true_weights + np.random.normal(0, 0.1, num_samples)) > 0.5  # Binary targets

# Initialize and train a logistic regression model
model = LogisticRegression()
model.fit(X, y.astype(int))  # Convert y to integers for compatibility

# Predict on some test data
X_test = np.random.rand(10, num_features)
predictions = model.predict(X_test)

print("Predictions:", predictions)