# Use published tables from Python

In order to access the published tables from a Python environment, it is convenient to use the Python package for duckdb. [Google Colab](https://colab.research.google.com/) comes with the duckdb package installed by default, so you can use it simply with import duckdb. If you want to use it in other environments, please refer to the [document](https://duckdb.org/docs/stable/clients/python/overview).

## Accessing the published tables

In [1]:
import duckdb

con = duckdb.connect()

# Attach Layer 1 database and use it
con.sql("""
ATTACH 'http://ep.dbcls.jp/tabulae/layer1.duckdb' (READ_ONLY);
USE layer1;
""")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

## Show table list

In [2]:
con.sql("""
SHOW TABLES;
""")

┌───────────────────────────────────────────────────────────┐
│                           name                            │
│                          varchar                          │
├───────────────────────────────────────────────────────────┤
│ chembl_compound_alogp                                     │
│ chembl_compound_atc_classification                        │
│ chembl_compound_chebi                                     │
│ chembl_compound_drug_indication_highest_development_phase │
│ chembl_compound_drug_indication_mesh                      │
│ chembl_compound_hba                                       │
│ chembl_compound_hbd                                       │
│ chembl_compound_mw                                        │
│ chembl_compound_psa                                       │
│ chembl_compound_pubchem                                   │
│        ·                                                  │
│        ·                                                  │
│       

## Query a table

In [3]:
con.sql("""
SELECT * FROM 'chembl_compound_atc_classification' LIMIT 5;
""")

┌────────────────────┬────────────┬─────────┐
│ chembl_compound_id │   label    │   atc   │
│      varchar       │  varchar   │ varchar │
├────────────────────┼────────────┼─────────┤
│ CHEMBL1027         │ TIAGABINE  │ N03AG06 │
│ CHEMBL1089         │ PHENELZINE │ N06AF03 │
│ CHEMBL115          │ INDINAVIR  │ J05AE02 │
│ CHEMBL11672        │ ROQUINIMEX │ L03AX02 │
│ CHEMBL1171837      │ PONATINIB  │ L01EA05 │
└────────────────────┴────────────┴─────────┘

## Join tables and convert to DataFrame

In [4]:
con.sql("""
FROM 'chembl_compound_ro5'
NATURAL JOIN 'chembl_compound_alogp'
NATURAL JOIN 'chembl_compound_mw'
NATURAL JOIN 'chembl_compound_hba'
NATURAL JOIN 'chembl_compound_hbd';
""").to_df()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,chembl_compound_id,ro5,alogp,mw,hba,hbd
0,CHEMBL204011,0.0,4.18,370.38,4.0,1.0
1,CHEMBL204015,0.0,4.84,406.48,4.0,1.0
2,CHEMBL204016,1.0,3.99,616.69,9.0,3.0
3,CHEMBL204017,0.0,4.06,324.45,4.0,2.0
4,CHEMBL204019,0.0,1.81,164.20,2.0,1.0
...,...,...,...,...,...,...
1886753,CHEMBL97074,0.0,3.63,375.49,4.0,1.0
1886754,CHEMBL98199,0.0,4.98,448.61,4.0,1.0
1886755,CHEMBL98239,2.0,3.76,707.98,8.0,6.0
1886756,CHEMBL98787,0.0,0.12,376.20,6.0,5.0


# Train ML models with AutoGluon

The following example shows how to join three tables and extract the first letter of the ATC code. Then, it makes a 1-hot feature vector with the `PIVOT` operation. Calling `.to_df()` on the result will return a pandas DataFrame. Then, we will train the predictor using [AutoGluon](https://auto.gluon.ai/stable/index.html).

In [5]:
# Install AutoGluon
!pip install autogluon > /dev/null
from autogluon.tabular import TabularDataset, TabularPredictor

## Join tables

In [6]:
combined = con.sql("""
SELECT
  chembl_compound_id, tissue_label, atc[1] AS target
FROM 'chembl_compound_atc_classification'
NATURAL JOIN 'chembl_compound_uniprot_via_activity_assay'
NATURAL JOIN 'uniprot_tissue';
""")

combined

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

┌────────────────────┬──────────────────┬─────────┐
│ chembl_compound_id │   tissue_label   │ target  │
│      varchar       │     varchar      │ varchar │
├────────────────────┼──────────────────┼─────────┤
│ CHEMBL1517         │ T-cell lymphoma  │ D       │
│ CHEMBL939          │ T-cell           │ L       │
│ CHEMBL1517         │ Teratocarcinoma  │ D       │
│ CHEMBL689          │ Testis           │ A       │
│ CHEMBL939          │ Testis           │ L       │
│ CHEMBL2362016      │ Testis           │ L       │
│ CHEMBL121          │ Tongue           │ A       │
│ CHEMBL64           │ Uterus           │ J       │
│ CHEMBL2216870      │ Uterus           │ L       │
│ CHEMBL413          │ Uterus           │ L       │
│     ·              │   ·              │ ·       │
│     ·              │   ·              │ ·       │
│     ·              │   ·              │ ·       │
│ CHEMBL600325       │ Hippocampus      │ C       │
│ CHEMBL3707331      │ Leukemic T-cell  │ A       │
│ CHEMBL1274

## Turn the table into 1-hot vectors

In [7]:
wide_table = con.sql("""
PIVOT
    (FROM combined)
ON tissue_label
USING COUNT(*) > 0
""")

wide_table.limit(2) # show 2 entries only

┌────────────────────┬─────────┬───────────┬────────────────┬───────────────┬───────────────────┬─────────────────────┬────────────────┬──────────┬───────┬────────────────────┬──────────────────────┬─────────────────────┬───────────────────────┬───────────┬─────────────┬────────┬─────────────────┬───────┬───────┬──────────────┬───────┬─────────────┬────────────────────┬───────┬──────────────┬───────────────────┬────────────┬─────────────┬────────────────────┬───────────┬─────────────────┬───────────┬─────────────────┬────────────┬─────────────────────┬────────┬──────────────────┬─────────────┬─────────────────┬───────────────────────────────┬───────┬──────────────────────┬──────────────┬─────────────────┬───────────────────┬──────────────────┬──────────────┬─────────────┬────────┬────────────────────┬─────────────────┬─────────────────┬─────────────────┬────────────────┬────────────────┬──────────┬────────┬─────────────────┬─────────────────────┬──────────────────────┬─────────────────

## Train a model

In [8]:
# To DataFrame
df = wide_table.to_df()

# Make the TabularDataset from the DataFrame, dropping the ID column:
train = TabularDataset(df.drop(columns=["chembl_compound_id"]))

# Train
predictor = TabularPredictor(label="target").fit(train)

No path specified. Models will be saved in: "AutogluonModels/ag-20250325_071605"
Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.11.11
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
CPU Count:          2
Memory Avail:       10.94 GB / 12.67 GB (86.3%)
Disk Space Avail:   61.10 GB / 107.72 GB (56.7%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong acc

## Show the model summary

In [9]:
summary = predictor.fit_summary(show_plot=True)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


*** Summary of fit() ***
Estimated performance of each model:
                  model  score_val eval_metric  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2   0.290650    accuracy       0.102980  177.445785                0.000945           0.109984            2       True         12
1              CatBoost   0.284553    accuracy       0.015835  174.850149                0.015835         174.850149            1       True          6
2       NeuralNetFastAI   0.270325    accuracy       0.016985   10.891056                0.016985          10.891056            1       True          1
3        NeuralNetTorch   0.264228    accuracy       0.006376   11.217554                0.006376          11.217554            1       True         10
4            LightGBMXT   0.264228    accuracy       0.036968    7.543878                0.036968           7.543878            1       True          2
5              LightGBM   

# NetworkX example

We will calculate the number of hops in the protein-protein interaction. We will use the Uniprot protein-protein interaction data from the Layer 1 database.

Save the following SPARQL query as a layer1 query and generate a table:

In [10]:
import networkx as nx
import pandas as pd

rel = con.sql("FROM uniprot_uniprot_interatcion")
df = rel.to_df() # uniprot_id1, uniprot_id2

# Create a graph from the DataFrame
graph = nx.from_pandas_edgelist(df, 'uniprot_id1', 'uniprot_id2')

# Find all pairs of nodes
all_pairs = list(nx.all_pairs_shortest_path_length(graph))

# Extract the data for the DataFrame
data = []
for source, targets in all_pairs:
    for target, distance in targets.items():
        data.append({'source': source, 'target': target, 'distance': distance})

# Create the DataFrame
result_df = pd.DataFrame(data)
result_df

Unnamed: 0,source,target,distance
0,Q05513,Q05513,0
1,Q05513,O95999,1
2,Q14241,Q14241,0
3,Q14241,O60447,1
4,Q14241,Q8N715,2
...,...,...,...
1186297,Q96N95,Q96N95,0
1186298,Q96N95,P17028,1
1186299,A0A0S2Z3N5,A0A0S2Z3N5,0
1186300,M0QZM1,M0QZM1,0
