# Modelling walkthrough

The purpose of this notebook is to show how a new, sample model with custom dependencies would be developed and integrated into the pipeline.

This notebook follows a hypothetical scenario where Machine Learning Engineer Maya is developing a new model, with the aim of generating her own predictions of drug-disease treatment efficacy scores. Maya is new to the EveryCure / Matrix ecosystem, and is learning as she goes.

In the end, she wants to train and submit a new model to the pipeline, and have it evaluated along with the other models.

## Modelling assumptions

Maya's goal is to train a new model that will predict the efficacy of drug-disease interactions.

**Embeddings from Knowledge Graph:** Maya knows that EveryCure has generated embeddings for biomedical knowledge graph nodes, which meaningfully encode semantics of the nodes. Many of those nodes are drugs and diseases between which she wants to predict treatment efficacy.

**Training Data:** Maya expects the training data to be a set of known positives and negatives, i.e. drug-disease pairs for which the treatment is known to be effective or ineffective.

**Evaluation:** Maya assumes that the model will be evaluated using AUC-ROC. She also assumes that she will need to perform train-validation splits on her data, and that Matrix's pipeline downstream will be able to further test the predictions of her model.

**Retrieving Data:** Importantly, Maya will retrieve the embeddings and training data from pipelines other than the modelling pipeline. She will avoid preprocessing the data itself as much as possible, relying on other resources provided by EveryCure.


# Prerequesites

Maya needs access to the GCS bucket containing the data (currently, `gs://mtrx-us-central1-hub-dev-storage`). Maya could retrieve those data sources manually from the bucket, using tools such as `gsutil`, but a much better way is to use Kedro's API.

## Dev Environment

Maya will build her prototype in Jupyter Kedro Lab. It is a standard jupyter lab, but with kedro context loaded into the notebook. To run kedro notebooks, using cloud environment (to have access to datasets in the cloud), Maya will run the command:

```
kedro jupyter notebook --env cloud
```

Importantly, she also set her `RELEASE_VERSION` in her `.env` file to `v0.2.4-rc.1`. She chose this release arbitrarily.

Now, Maya will have access to Kedro datasets via Kedro API. For a full tutorial for Kedro API, see [the official documentation](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html)

In [1]:
context


[1;35mKedroContext[0m[1m([0m
    [33mproject_path[0m=[1;35mPosixPath[0m[1m([0m[32m'/Users/mpw/Code/matrix/pipelines/matrix'[0m[1m)[0m,
    [33mconfig_loader[0m=[1;35mOmegaConfigLoader[0m[1m([0m[33mconf_source[0m=[35m/Users/mpw/Code/matrix/pipelines/matrix/[0m[95mconf[0m, [33menv[0m=[35mcloud[0m, [33mconfig_patterns[0m=[1m{[0m[32m'catalog'[0m: [1m[[0m[32m'catalog*'[0m, [32m'catalog*/**'[0m, [32m'**/catalog*'[0m[1m][0m, [32m'parameters'[0m: [1m<[0m[1;95mBoxList:[0m[39m [0m[1;39m[[0m[32m'parameters*'[0m[39m, [0m[32m'parameters*/**'[0m[39m, [0m[32m'**/parameters*'[0m[39m, [0m[32m'**/parameters*/**'[0m[1;39m][0m[39m>, [0m[32m'credentials'[0m[39m: [0m[1;39m[[0m[32m'credentials*'[0m[39m, [0m[32m'credentials*/**'[0m[39m, [0m[32m'**/credentials*'[0m[1;39m][0m[39m, [0m[32m'globals'[0m[39m: <BoxList: [0m[1;39m[[0m[32m'globals*'[0m[39m, [0m[32m'globals*/**'[0m[39m, [0m[32m'**/globals*'[0m

In [2]:
catalog.list("^embeddings.")


[1m[[0m
    [32m'embeddings.feat.bucketized_nodes@spark'[0m,
    [32m'embeddings.feat.bucketized_nodes@partitioned'[0m,
    [32m'embeddings.feat.graph.node_embeddings@partitioned'[0m,
    [32m'embeddings.feat.graph.node_embeddings@spark'[0m,
    [32m'embeddings.feat.graph.pca_node_embeddings'[0m,
    [32m'embeddings.feat.graph.edges_for_topological'[0m,
    [32m'embeddings.tmp.input_nodes'[0m,
    [32m'embeddings.tmp.input_edges'[0m,
    [32m'embeddings.models.topological'[0m,
    [32m'embeddings.model_output.topological'[0m,
    [32m'embeddings.feat.nodes'[0m,
    [32m'embeddings.reporting.loss'[0m,
    [32m'embeddings.reporting.topological_pca'[0m,
    [32m'embeddings.reporting.topological_pca_plot'[0m
[1m][0m


## Required data sources

### Embeddings 

Maya will use the embeddings generated by the Knowledge Graph pipeline to encode the drugs and disease into a vector space.

In the embeddings pipeline, embeddings are extracted from Neo4j and saved to GCS. The pipeline is defined in the [embeddings pipeline](https://github.com/matrix-ml/matrix/blob/main/pipelines/matrix/src/matrix/pipelines/embeddings/pipeline.py). 

```python
node(
    func=nodes.extract_node_embeddings,
    inputs={
        "nodes": "embeddings.model_output.graphsage",
        "string_col": "params:embeddings.write_topological_col",
    },
    outputs="embeddings.feat.nodes",
    name="extract_nodes_edges_from_db",
    tags=[
        "argowf.fuse",
        "argowf.fuse-group.topological_embeddings",
        "argowf.template-neo4j",
    ],
),
```

Kedro Dataset to which the embeddings are saved: 

```yml
embeddings.feat.nodes:
  <<: *_spark_parquet
  filepath: ${globals:paths.embeddings}/feat/nodes_with_embeddings
```


Maya knows that `${globals:paths.embeddings}/feat/nodes_with_embeddings` converts to `gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.2.4-rc.1/datasets/embeddings/feat/nodes_with_embeddings`, and this is where that dataset will be available in GCP. However, a much easier way to retrieve it is via kedro catalog.


In [3]:
catalog.datasets.embeddings__feat__nodes

[1m<[0m[1;95mmatrix.datasets.gcp.SparkDatasetWithBQExternalTable[0m[39m object at [0m[1;36m0x30484c8d0[0m[1m>[0m

In [4]:
embeddings_spark = catalog.load("embeddings.feat.nodes")

24/12/06 15:37:32 WARN Utils: Your hostname, Mateuszs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.1.6.105 instead (on interface en0)
24/12/06 15:37:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /Users/mpw/.ivy2/cache
The jars for the packages stored in: /Users/mpw/.ivy2/jars
com.google.cloud.spark#spark-3.5-bigquery added as a dependency
org.neo4j#neo4j-connector-apache-spark_2.12 added as a dependency
org.xerial#sqlite-jdbc added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-14a9dc89-2b7c-47a8-8fae-90ce16c0258d;1.0
	confs: [default]
	found com.google.cloud.spark#spark-3.5-bigquery;0.39.0 in central
	found com.google.cloud.spark#spark-bigquery-dsv2-common;0.39.0 in central


:: loading settings :: url = jar:file:/Users/mpw/Code/matrix/pipelines/matrix/.venv/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found com.google.cloud.spark#spark-bigquery-connector-common;0.39.0 in central
	found com.google.cloud.spark#bigquery-connector-common;0.39.0 in central
	found com.google.api.grpc#grpc-google-cloud-bigquerystorage-v1;3.5.1 in central
	found io.grpc#grpc-api;1.64.0 in central
	found com.google.code.findbugs#jsr305;3.0.2 in central
	found com.google.errorprone#error_prone_annotations;2.23.0 in central
	found io.grpc#grpc-stub;1.64.0 in central
	found io.grpc#grpc-protobuf;1.64.0 in central
	found com.google.api.grpc#proto-google-common-protos;2.39.0 in central
	found com.google.protobuf#protobuf-java;3.25.3 in central
	found com.google.api.grpc#proto-google-cloud-bigquerystorage-v1;3.5.1 in central
	found com.google.api#api-common;2.31.0 in central
	found com.google.auto.value#auto-value-annotations;1.10.4 in central
	found javax.annotation#javax.annotation-api;1.3.2 in central
	found com.google.j2objc#j2objc-annotations;3.0.0 in central
	found com.google.guava#guava;33.2.0-jre in centr

In [5]:
embeddings_spark

DataFrame[1m[[0m[1m<[0m[1;95mid[0m[39m>: bigint, <labels>: array<string>, topological_embedding: array<float>, pca_embedding: array<double[0m[1m>[0m, id: string, category: string[1m][0m

### Ground truth data

Maya needs to retrieve the training data from the preprocessing pipeline, containing True / False positives and negatives that she can use to train her model on the previously retrieved embeddings. First input to other modelling pipelines is `modelling.raw.ground_truth.positives@spark`, so Maya will retrieve that dataset first (together with its negative counterpart `modelling.raw.ground_truth.negatives@spark`).

```python
node(
    func=nodes.create_int_pairs,
    inputs=[
        "embeddings.feat.nodes",
        "modelling.raw.ground_truth.positives@spark",
        "modelling.raw.ground_truth.negatives@spark",
    ],
    outputs="modelling.int.known_pairs@spark",
    name="create_int_known_pairs",
),
```

We retrieve ground truth data (conflated True Positives and True Negatives) from GCS. Both were produced by the `preprocessing` pipeline, as dataset `modelling.raw.ground_truth.positives@pandas` and `modelling.raw.ground_truth.negatives@pandas`, and will be read in as `@spark` dataframes by modelling steps. Maya will run the command below to copy the data to her local machine. Like in the previous step, the used version is arbitrary.

Maya sees that other files live alongside the `*_conflated.tsv` files, and decides to download and investigate them.

In [6]:
ground_truths_positives = catalog.load("modelling.raw.ground_truth.positives@spark")
ground_truths_negatives = catalog.load("modelling.raw.ground_truth.negatives@spark")

                                                                                

                                                                                

# Preprocessing

By now, Maya has obtained the embeddings and ground truth data. She will now preprocess the data to create the input for her model. She will also need to create splits for cross-validation.

Maya will first inspect the ground truth data.

In [7]:
ground_truths_positives_df = ground_truths_positives.toPandas()
ground_truths_negatives_df = ground_truths_negatives.toPandas()

                                                                                

In [8]:
# True positives
ground_truths_positives_df.head()

Unnamed: 0,source,target
0,CHEBI:3699,MONDO:0007186
1,UNII:84H8Z9550J,MONDO:0007186
2,CHEBI:7915,MONDO:0007186
3,CHEBI:6375,MONDO:0007186
4,CHEBI:33130,MONDO:0007186


In [9]:
# True negatives
ground_truths_negatives_df.head()

Unnamed: 0,source,target
0,CHEBI:32149,MONDO:0006807
1,CHEBI:32588,MONDO:0006807
2,CHEBI:6804,MONDO:0007186
3,CHEBI:8094,MONDO:0007186
4,CHEMBL.COMPOUND:CHEMBL1187846,MONDO:0007186


True positives and true negatives are represented as sets of source-target pairs. `source` is the drug, `target` is the disease.

Maya will not be loading entire PySpark dataframe with embedding to memory - before conducting an EDA, she wants to reduce unnecessary columns.

She also wants to see what biolink categories the embeddings might have.

In [10]:
from pyspark.sql.functions import col

# Get unique values from the "category" column
unique_categories = embeddings_spark.select(col("category")).distinct()

# Collect unique values to a list (will bring data to the driver)
unique_values = [row["category"] for row in unique_categories.collect()]

unique_values

                                                                                


[1m[[0m
    [32m'biolink:AnatomicalEntity'[0m,
    [32m'biolink:Drug'[0m,
    [32m'biolink:ChemicalEntity'[0m,
    [32m'biolink:SequenceVariant'[0m,
    [32m'biolink:Gene'[0m,
    [32m'biolink:Pathway'[0m,
    [32m'biolink:Polypeptide'[0m,
    [32m'biolink:Protein'[0m,
    [32m'biolink:MolecularActivity'[0m,
    [32m'biolink:GrossAnatomicalStructure'[0m,
    [32m'biolink:DiseaseOrPhenotypicFeature'[0m,
    [32m'biolink:PhysiologicalProcess'[0m,
    [32m'biolink:Disease'[0m,
    [32m'biolink:NucleicAcidEntity'[0m,
    [32m'biolink:PhenotypicFeature'[0m,
    [32m'biolink:BiologicalProcess'[0m,
    [32m'biolink:GeneFamily'[0m,
    [32m'biolink:Transcript'[0m,
    [32m'biolink:CellularComponent'[0m,
    [32m'biolink:ChemicalMixture'[0m,
    [32m'biolink:MolecularMixture'[0m,
    [32m'biolink:MolecularEntity'[0m,
    [32m'biolink:Cell'[0m,
    [32m'biolink:OrganismTaxon'[0m,
    [32m'biolink:CellLine'[0m,
    [32m'biolink:SmallMolecule'

## Preprocess embeddings


At first, Maya (a) removed all pca_embeddings, (b) removed all entities which are not drugs or diseases.

In [11]:
from pyspark.sql import functions as F

# Define the categories to keep
categories_to_keep = {
    "biolink:DiseaseOrPhenotypicFeature", 
    "biolink:Drug", 
    "biolink:Disease", 
    "biolink:BehavioralFeature", 
    "biolink:SmallMolecule", 
    "biolink:PhenotypicFeature"
}

# Filter the PySpark DataFrame
filtered_df = embeddings_spark.filter(F.col("category").isin(*categories_to_keep)) \
    .select("topological_embedding", "id") \
    .na.drop(subset=["id", "topological_embedding"])

# Convert the filtered PySpark DataFrame to a Pandas DataFrame
embeddings_df = filtered_df.toPandas()

                                                                                

`filtered_embeddings_path` now contains topological embeddings of drugs and diseases.

Maya filters down the ground truths to a simple list of node ids, to create training data for her model.

In [16]:
ground_truth_ids = set(ground_truths_negatives_df["target"].unique()) | set(ground_truths_negatives_df["source"].unique()) | set(ground_truths_positives_df["target"].unique()) | set(ground_truths_positives_df["source"].unique())


In [17]:
embeddings_df.head()

Unnamed: 0,topological_embedding,id
0,"[-0.030589668, 0.10044213, 0.19094326, 0.16870...",ATC:L01XY01
1,"[-0.008864332, -0.022826852, 0.056452665, 0.07...",ATC:A10BD14
2,"[-0.0649597, 0.020033212, 0.106236674, 0.15575...",ATC:G03AA11
3,"[0.014651415, 0.009964101, 0.05925943, 0.06414...",ATC:C09DX02
4,"[-0.015761247, -0.045603417, 0.102571644, 0.13...",ATC:J05AP51


In [18]:
# calculate how many ground truths ids have an embedding

len(ground_truth_ids.intersection(embeddings_df["id"].unique())) / len(ground_truth_ids)


[1;36m0.9672624018707199[0m

In [20]:
df_tn_filtered = ground_truths_negatives_df[ground_truths_negatives_df["target"].isin(embeddings_df["id"]) & ground_truths_negatives_df["source"].isin(embeddings_df["id"])]
df_tp_filtered = ground_truths_positives_df[ground_truths_positives_df["target"].isin(embeddings_df["id"]) & ground_truths_positives_df["source"].isin(embeddings_df["id"])]

In [22]:
# check how many of the ground truth pairs are left
(len(df_tn_filtered) + len(df_tp_filtered)) / (len(ground_truths_negatives_df) + len(ground_truths_positives_df))


[1;36m0.9346806918015995[0m

Maya filtered down the dataset to 3GB from 24GB, and reduced it to only relevant drugs and diseases. 97% of drugs and diseases from the ground truth data are included in the filtered embeddings, which is satisfactory.

Now, she can proceed to creating her model. 

- `embeddings_df` is the filtered embeddings plus node ids
- `df_tn_filtered` and `df_tp_filtered` are the ground truth data, filtered down to only include rows with a drug and disease that have an embedding


## Prepare data for modelling

Maya combines the filtered embeddings with the ground truth data to create a dataset for model training. She concatenates true positives and negatives, adding a label column.


In [None]:
# Concatenate true positives and negatives, adding label column
df_model = pd.concat([
    df_tp_filtered.assign(label=1),
    df_tn_filtered.assign(label=0)
]).reset_index(drop=True)

# Join with embeddings to get source and target embeddings
df_model = (
    df_model
    .merge(
        embeddings_df[['id', 'topological_embedding']],
        left_on='source',
        right_on='id',
        how='left'
    )
    .drop('id', axis=1)
    .rename(columns={'topological_embedding': 'source_embedding'})
    .merge(
        embeddings_df[['id', 'topological_embedding']], 
        left_on='target',
        right_on='id',
        how='left'
    )
    .drop('id', axis=1)
    .rename(columns={'topological_embedding': 'target_embedding'})
)

print(f"Final dataset shape: {df_model.shape}")
df_model.head()


## Prepare Dataframe with all drug-disease pairs

Maya creates a cartesian product of all unique drugs and diseases to generate every possible drug-disease combination that needs a prediction. This creates a comprehensive matrix of all possible pairs, regardless of whether they were in the training data or not.

The resulting matrix (shown in the heatmap) allows for easy visualization of predicted relationships across the entire drug-disease space.

The code shows that this creates a large number of pairs (number of unique drugs × number of unique diseases), which is why Maya later implements batch processing to handle the predictions efficiently.


In [None]:
# Get unique diseases from both dataframes
all_diseases = pd.concat([
    df_tn_filtered["target"],
    df_tp_filtered["target"]
]).dropna().unique()

# Get unique drugs from both dataframes 
all_drugs = pd.concat([
    df_tn_filtered["source"],
    df_tp_filtered["source"]
]).dropna().unique()

print(f"Number of unique diseases: {len(all_diseases)}")
print(f"Number of unique drugs: {len(all_drugs)}")

In [None]:
# Create all possible combinations of drugs and diseases
all_pairs = pd.DataFrame(
    [(drug, disease) for drug in all_drugs for disease in all_diseases],
    columns=['source', 'target']
)

print(f"Total number of drug-disease pairs: {len(all_pairs):,}")
print(f"Shape of all pairs dataframe: {all_pairs.shape}")
all_pairs.head()


## Prepare features dataset


Maya prepares features by converting embeddings into numpy arrays and concatenating them to form input features for the model.

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold, cross_val_score


def prepare_features(df):
    # Convert list embeddings to numpy arrays
    source_embeddings = np.vstack(df['source_embedding'].values)
    target_embeddings = np.vstack(df['target_embedding'].values)
    
    # Concatenate the embeddings horizontally
    return np.hstack([source_embeddings, target_embeddings])

# Prepare features
X = prepare_features(df_model)
y = df_model['label'].values

### Create Train / Test split

Maya splits the data into training and test sets, ensuring an 80/20 split, and stratifies the data based on labels.


In [None]:
# First, create a train/test split
from sklearn.model_selection import train_test_split

# Split into train and test sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Training

Maya uses logistic regression for model training, performing cross-validation to evaluate the model's performance on the training data.

In [None]:
# Setup cross-validation on training data only
n_splits = 5
cv = KFold(n_splits=n_splits, shuffle=True, random_state=42)
model = LogisticRegression(
    random_state=42,
    max_iter=1000,
)

# Perform cross-validation on training data
cv_scores = cross_val_score(
    model, 
    X_train, 
    y_train, 
    cv=cv,
    scoring='roc_auc',
    n_jobs=-1
)

# Print cross-validation results
print("Cross-validation results (on training data):")
print(f"CV scores: {cv_scores}")
print(f"Mean AUC-ROC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")



In [None]:
# Train final model on all training data
final_model = LogisticRegression(
    random_state=42,
    max_iter=1000,
)
trained_model = final_model.fit(X_train, y_train)

# Evaluation

Maya evaluates the trained model on a held-out test set, calculating the AUC-ROC to assess its performance.


In [None]:
# Evaluate on held-out test set
test_predictions = trained_model.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(y_test, test_predictions)
print(f"\nFinal model performance on test set:")
print(f"Test AUC-ROC: {test_auc:.3f}")

## Produce model predictions

Maya generates efficacy score predictions for all possible drug-disease pairs using a batching function to handle large data efficiently.

In [None]:
trained_model

Now Maya wants to generate efficacy score predictions for all drug-disease pairs. To avoid fitting all embedidngs in memory, she creates a batching function.

In [None]:
def batch_pairs_with_embeddings(all_pairs, embeddings_df, batch_size=10000):
    """
    Generate batches of drug-disease pairs with their embeddings.
    
    Args:
        all_pairs (pd.DataFrame): DataFrame with 'source' and 'target' columns
        embeddings_df (pd.DataFrame): DataFrame with 'id' and 'topological_embedding' columns
        batch_size (int): Number of pairs to process in each batch
        
    Yields:
        np.array: Array of concatenated source and target embeddings for the batch
        pd.DataFrame: Corresponding batch of pairs
    """
    # Create embeddings lookup dictionary for faster access
    embeddings_dict = dict(zip(embeddings_df['id'], embeddings_df['topological_embedding']))
    
    # Process pairs in batches
    for start_idx in range(0, len(all_pairs), batch_size):
        end_idx = min(start_idx + batch_size, len(all_pairs))
        batch_pairs = all_pairs.iloc[start_idx:end_idx]
        
        # Get embeddings for the batch
        source_embeddings = np.vstack([
            embeddings_dict[source] for source in batch_pairs['source']
        ])
        target_embeddings = np.vstack([
            embeddings_dict[target] for target in batch_pairs['target']
        ])
        
        # Concatenate embeddings
        batch_features = np.hstack([source_embeddings, target_embeddings])
        
        yield batch_features, batch_pairs

# Example usage:
def predict_all_pairs(model, all_pairs, embeddings_df, batch_size=10000):
    """
    Generate predictions for all drug-disease pairs.
    
    Args:
        model: Trained model with predict_proba method
        all_pairs (pd.DataFrame): DataFrame with all drug-disease pairs
        embeddings_df (pd.DataFrame): DataFrame with embeddings
        batch_size (int): Batch size for processing
        
    Returns:
        pd.DataFrame: Original pairs with prediction scores
    """
    all_predictions = []
    all_processed_pairs = []
    
    for batch_features, batch_pairs in batch_pairs_with_embeddings(all_pairs, embeddings_df, batch_size):
        # Get predictions for the batch
        batch_predictions = model.predict_proba(batch_features)[:, 1]
        
        # Store results
        batch_results = batch_pairs.copy()
        batch_results['prediction_score'] = batch_predictions
        all_processed_pairs.append(batch_results)
            
    # Combine all results
    final_results = pd.concat(all_processed_pairs, ignore_index=True)
    return final_results

In [None]:
# Generate predictions
results_df = predict_all_pairs(
    model=trained_model,
    all_pairs=all_pairs,
    embeddings_df=embeddings_df,
    batch_size=50000
)

# View results
print(f"Generated predictions for {len(results_df):,} pairs")
print("\nSample predictions:")
print(results_df.sort_values('prediction_score', ascending=False).head())


## Evaluate Prediction Score distributions

Maya visualizes the distribution of prediction scores using histograms and density plots to understand the model's output.

### Test results

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create figure and axes
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))

# Histogram
sns.histplot(
    data=test_predictions,
    bins=50,
    ax=ax1
)
ax1.set_title('Distribution of Prediction Scores (Histogram)')
ax1.set_xlabel('Prediction Score')
ax1.set_ylabel('Count')

# Separate distributions for positive and negative classes
sns.kdeplot(
    data=test_predictions[y_test == 1],
    ax=ax2,
    label='Positive Class',
    color='green'
)
sns.kdeplot(
    data=test_predictions[y_test == 0],
    ax=ax2,
    label='Negative Class',
    color='red'
)
ax2.set_title('Distribution of Prediction Scores by Class (Density)')
ax2.set_xlabel('Prediction Score')
ax2.set_ylabel('Density')
ax2.legend()

plt.tight_layout()
plt.show()

# Print some basic statistics
print("\nPrediction Score Statistics:")
print(f"Mean: {test_predictions.mean():.3f}")
print(f"Median: {np.median(test_predictions):.3f}")
print(f"Std Dev: {test_predictions.std():.3f}")
print(f"Min: {test_predictions.min():.3f}")
print(f"Max: {test_predictions.max():.3f}")

## Full Matrix predictions

Maya creates a heatmap to visualize prediction scores for a random sample of drug-disease pairs, providing insights into the model's predictions.


In [None]:
# Create figure and axes
fig, ax = plt.subplots(figsize=(12, 6))

# Overall distribution of all predictions
sns.kdeplot(
    data=results_df['prediction_score'],
    ax=ax,
    label='All Predictions',
    color='blue'
)

ax.set_title('Distribution of Prediction Scores for All Drug-Disease Pairs')
ax.set_xlabel('Prediction Score')
ax.set_ylabel('Density')
ax.legend()

plt.tight_layout()
plt.show()

# Print basic statistics for all predictions
print("\nPrediction Score Statistics for All Pairs:")
print(f"Mean: {results_df['prediction_score'].mean():.3f}")
print(f"Median: {results_df['prediction_score'].median():.3f}")
print(f"Std Dev: {results_df['prediction_score'].std():.3f}")
print(f"Min: {results_df['prediction_score'].min():.3f}")
print(f"Max: {results_df['prediction_score'].max():.3f}")
print(f"Total number of predictions: {len(results_df):,}")

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Randomly sample 20 drugs and 20 diseases
np.random.seed(42)
sample_drugs = np.random.choice(all_drugs, size=20, replace=False)
sample_diseases = np.random.choice(all_diseases, size=20, replace=False)

# Filter results for sampled drugs and diseases
sample_results = results_df[
    results_df['source'].isin(sample_drugs) & 
    results_df['target'].isin(sample_diseases)
]

# Create prediction matrix
pred_matrix = sample_results.pivot(
    index='source', 
    columns='target', 
    values='prediction_score'
)

# Create heatmap
plt.figure(figsize=(15, 12))
sns.heatmap(
    pred_matrix,
    cmap='YlOrRd',
    center=0.5,
    vmin=0,
    vmax=1,
    cbar_kws={'label': 'Prediction Score'}
)
plt.title('Drug-Disease Prediction Scores Heatmap\n(20 random drugs and diseases)')
plt.xlabel('Diseases')
plt.ylabel('Drugs')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Print some statistics about the sampled predictions
print("\nSample Prediction Statistics:")
print(f"Mean: {sample_results['prediction_score'].mean():.3f}")
print(f"Median: {sample_results['prediction_score'].median():.3f}")
print(f"Std Dev: {sample_results['prediction_score'].std():.3f}")

## Evaluation conclusion

Maya generated a full matrix of drug-disease treatment efficacy scores.

We already can see that the model is far from perfect, and it highlights some of the issues our more advanced models has ran into - many drugs and diseases are "frequent flyers" with consistently high scores all across the board. She can also see that many too many drugs-disease pairs have treat scored close to 1.

However, her model is only a basic logistic regression, and for the sake of this exercise we will not be focusing on improving the results she's obrained. 

Now Maya will add her new model as Kedro node.





# Integrating model with Kedro

Now that Maya has created and prototyped her model, she wants to integrate it with EveryCure's Matrix repository, train and run it as part of the `matrix` pipeline.

### Integration with Kedro Pipeline

1. **Pipeline Overview**:
   - Data is prepared and preprocessed in `raw`, `kg_raw`, `ingestion`, `integration`, and `embeddings` pipelines.
   - These datasets are consumed by downstream `modelling`, `evaluation`, `matrix_generation`, and `inference` pipelines.
   - Kedro handles many steps automatically, such as sharding and using ApacheSpark datasets.

2. **Model Configuration**:
   - Models are injected into the Kedro pipeline via dependency injection mechanism.
   - Maya adds her model to `DYNAMIC_PIPELINES_MAPPING` in `pipelines/matrix/src/matrix/settings.py`:

   ```python
   DYNAMIC_PIPELINES_MAPPING = {
       "modelling": [
           {"model_name": "xg_baseline", "num_shards": 1, "run_inference": False},
           {"model_name": "xg_ensemble", "num_shards": 3, "run_inference": True},
           {"model_name": "rf", "num_shards": 1, "run_inference": False},
           {"model_name": "xg_synth", "num_shards": 1, "run_inference": False},
           {"model_name": "mayas_logistic_regression", "num_shards": 1, "run_inference": False},
       ],
       "evaluation": [
   ```

3. **Model Parameters**:
   - Configure parameters in `pipelines/matrix/conf/base/modelling/parameters/mayas_logistic_regression.yml`:

   ```yml
   modelling.mayas_logistic_regression:
       _overrides:
         model_tuning_args:
           tuner:
             object: matrix.pipelines.modelling.tuning.NopTuner
             estimator:
               object: sklearn.linear_model.LogisticRegression
               random_state: ${globals:random_state}
               device: cpu # TODO: Add cuda
           features:
             - source_+
             - target_+
           target_col_name: y
       model_options: ${merge:${.._model_options},${._overrides}}
   ```

   - **Key Points**:
     - The `estimator` is set to `sklearn.linear_model.LogisticRegression`. If she wanted to use an actual custom model object (one she would define and customise), she could reference it's `sklearn`-compliant class here.
     - Uses `NopTuner` for hyperparameter tuning.
     - Those parameters get automatically passed and injected into the model.
     - Following a similar logic, `modelling.<MODEL_NAME>.model_tuning_args` defines parameters for the estimator object. So  all parameters that `sklearn.linear_model.LogisticRegression` eg. `penalty`, `C`, `class_weight`.  etc. could be passed via the config file. For the full list of parameters, have a look at [official documentation](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html#logisticregression).





## Running model

Now that the model was added to modelling suite, it will be trained and used to generate matrix when the pipeline is executed.