# Example Notebook for easy access to the Matrix Knowledge Graph for notebook based model development

*Note you can change this notebook because changes are not tracked by git.*

Let's first get you set up with git so you can also pull new changes from the repo easily

In [None]:
gh_username = input("Please enter your github username: ")
gh_token = input("Please enter your github token: ")

# Set GitHub credentials in the remote URL
import subprocess
# Construct the URL with credentials
remote_url = f"https://{gh_username}:{gh_token}@github.com/everycure-org/matrix.git"
subprocess.run(["git", "remote", "set-url", "origin", remote_url], check=True)
print("Git remote URL updated successfully with your credentials")

In [None]:
# this loads various objects into the context, see 
# https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html#kedro-line-magics
import os
RELEASE_VERSION = "v0.3.0" #change the release version here easily
os.environ["RELEASE_VERSION"] = RELEASE_VERSION

%load_ext kedro.ipython
%reload_kedro  --env cloud

## Option A) Accessing datasets easily through the catalog

To access the kedro catalog now, simply call `catalog` in a cell and see the list of available datasets.

```python
catalog.list("unified") # this function accepts any regular expression
```

> **Note it does not show dynamic datasets like `integration.int.{source}.edges`. This is a known issue we are trying to solve. You can still access them however by calling `catalog.load("integration.int.robokop.edges")` for example




In [None]:
nodes = catalog.load("data_release.prm.bigquery_edges")
edges = catalog.load("data_release.prm.bigquery_nodes")

In [None]:
nodes.groupBy("upstream_data_source").count().show()
edges.groupBy("upstream_data_source").count().show()

In [None]:
## Option B) Access data through BigQuery queries  

Note these queries are run on BigQuery and only the results are loaded into the notebook. The results are loaded as pandas dataframes. This has the upside that you can easily reduce the data to the columns and rows you need. The downside is that you cannot simply load the entire dataset into memory. Pandas just cannot handle the size.

Check the [BigQuery SQL documentation](https://cloud.google.com/bigquery/docs/introduction-sql) for more information.

In [None]:
%%bigquery nodes_upstream_sources_count
SELECT upstream_data_source, count(*) as count FROM `mtrx-hub-dev-3of.release_v0_3_0.nodes` GROUP BY upstream_data_source

## Option C) Access data directly on GCS

Some people want to side-step the catalog / bigquery and work directly on the GCS data. 
The below cell show how you can do this. We use polars for this example but you can use any other library you want.

In [None]:
!uv pip install polars
import polars as pl

df = pl.scan_parquet("gs://mtrx-us-central1-hub-dev-storage/kedro/data/releases/v0.3.0/datasets/release/prm/bigquery_nodes/*.parquet")
df.columns