# Simple workflow with Dummy Data

In case you have installed the `sawmil` package, remove the `src` from the code below.
Below, we provide experiments with the dummy (synthetic data) and the molecular data (`musk`).

## 1. Examples with Dummy Data
### 1.1 Load the dummy data

In [1]:
from src.sawmil.data import generate_dummy_bags
import numpy as np
rng = np.random.default_rng(0)

ds = generate_dummy_bags(
    n_pos=100, n_neg=100, inst_per_bag=(5, 15), d=2,
    pos_centers=((+2,+1), (+4,+3)),
    neg_centers=((-1.5,-1.0), (-3.0,+0.5)),
    pos_scales=((2.0, 0.6), (1.2, 0.8)),
    neg_scales=((1.5, 0.5), (2.5, 0.9)),
    pos_intra_rate=(0.25, 0.85),
    ensure_pos_in_every_pos_bag=True,
    neg_pos_noise_rate=(0.00, 0.05),
    pos_neg_noise_rate=(0.00, 0.20),
    outlier_rate=0.1,
    outlier_scale=8.0,
    random_state=42,
)

# Quick sanity:
X_pos, pos_idx = ds.positive_instances()
X_neg, neg_idx = ds.negative_instances()
print("Number of bags:", len(ds.bags))


Number of bags: 200


## 1.2. Fit the model

In [3]:
from src.sawmil.kernels import get_kernel
k = get_kernel("linear") # base (single-instance kernel)
# if you want to use kernels for bags outside of the models
from src.sawmil.bag_kernels import make_bag_kernel
bag_k = make_bag_kernel(k, normalizer="none", p=1.0) # bag kernel
# Otherwise, it is handled inside each model


#### 1.2.1 Fit NSK with the Linear Kernel

In [4]:
from src.sawmil import NSK

k = get_kernel("linear", normalizer="average")
clf = NSK(C=1, kernel=k, 
          # bag kernel settings
          normalizer='average',
          p=1.0,
          # solver settings
          scale_C=True, 
          tol=1e-8, 
          verbose=False, 
          solver='osqp').fit(ds, None)
print("Train acc:", clf.score(ds, ds.y))
# clf.predict(ds), clf.decision_function(ds)

Linear kernel has no parameters to fit.


Train acc: 0.895


#### 1.2.2 Fit NSK with the RBF Kernel

In [5]:
k = get_kernel("rbf", gamma=0.8)
clf = NSK(C=10, kernel=k, scale_C=True, tol=1e-8, verbose=False, solver='osqp').fit(ds, None)
print("Train acc:", clf.score(ds, ds.y))

Train acc: 0.955


#### 1.2.3 Fit NSK with Combined Kernels

In [6]:
from src.sawmil.kernels import Product, Polynomial, Linear, RBF, Sum, Scale

k = Sum(Linear(), 
        Scale(0.5, 
              Product(Polynomial(degree=2), RBF(gamma=1.0))))
clf = NSK(C=100, kernel=k, 
          # params to create a bag kernel
          normalizer="none",
          # svm params
          scale_C=True, 
          tol=1e-8, verbose=False, solver='gurobi').fit(ds, None)
print("Train acc:", clf.score(ds, ds.y))

Linear kernel has no parameters to fit.


Set parameter Username
Academic license - for non-commercial use only - expires 2026-08-04
Train acc: 1.0


#### 1.2.3 Fit sMIL with the Linear Kernel

In [7]:
from src.sawmil import sMIL
from src.sawmil.kernels import Linear

In [8]:
k  = Linear()
clf = sMIL(C=10, kernel=k, 
           # params to create a bag kernel
           normalizer="none",
           # svm params
           scale_C=True, 
           tol=1e-8, verbose=False, solver='osqp').fit(ds, None)

Linear kernel has no parameters to fit.


In [9]:
y = np.array([b.y for b in ds.bags])
# yhat = clf.predict(ds)
print("Train acc:", clf.score(ds, y))

Train acc: 0.89


#### 1.2.4. Fit sAwMIL with the Linear kernel

In [10]:
from src.sawmil import sAwMIL
from src.sawmil.kernels import get_kernel

In [16]:
k = get_kernel('linear')
clf = sAwMIL(C=10, kernel=k,
             solver="gurobi", eta=0.99) # here eta is high, since all items in the bag are relevant
clf.fit(ds)
print("Train acc:", clf.score(ds, ds.y))
clf.predict(ds)

Linear kernel has no parameters to fit.
Linear kernel has no parameters to fit.


Train acc: 0.81


array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0.,
       0., 0., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 1., 1.,
       0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,
       1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0.,
       0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0., 1.,
       1., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.])

## 2. Experiments with the molecular data
### 2.1 Load the `musk` data

In [17]:
from src.sawmil.data import load_musk_bags
train_ds, test_ds, scaler = load_musk_bags(standardize=True, test_size=0.3, random_state=42)

### 2.2 Fit models
#### 2.2.1 Fit NSK with Combined Kernel

In [18]:
from src.sawmil import NSK
from src.sawmil.kernels import Linear, Sum, RBF
from sklearn.metrics import matthews_corrcoef as mcc


k = Sum(Linear(),RBF(gamma=0.5))

clf = NSK(C=10, kernel=k, 
          normalizer="none",
          scale_C=True, tol=1e-8, verbose=True, solver='osqp').fit(train_ds, None)

y = test_ds.y
yhat = clf.predict(test_ds)
print('Matthews Correlation Coefficient:', mcc(y, yhat))

Linear kernel has no parameters to fit.


-----------------------------------------------------------------
           OSQP v1.0.0  -  Operator Splitting QP Solver
              (c) The OSQP Developer Team
-----------------------------------------------------------------
problem:  variables n = 71, constraints m = 72
          nnz(P) + nnz(A) = 2698
settings: algebra = Built-in,
          OSQPInt = 4 bytes, OSQPFloat = 8 bytes,
          linear system solver = QDLDL v0.1.8,
          eps_abs = 1.0e-08, eps_rel = 1.0e-08,
          eps_prim_inf = 1.0e-04, eps_dual_inf = 1.0e-04,
          rho = 1.00e-01 (adaptive: 50 iterations),
          sigma = 1.00e-06, alpha = 1.60, max_iter = 20000
          check_termination: on (interval 25, duality gap: on),
          time_limit: 1.00e+10 sec,
          scaling: on (10 iterations), scaled_termination: off
          warm starting: on, polishing: on, 
iter   objective    prim res   dual res   gap        rel kkt    rho         time
   1  -6.9000e+00   1.05e+00   1.04e+00  -3.59e+00   1.05

#### 2.2.2 Fit sMIL with Combined Kernel

In [19]:
from src.sawmil.kernels import Linear
from src.sawmil import sMIL

k = Sum(Linear(),RBF(gamma=0.5))

solver_params = {
    #'env': {'LogFile': 'gurobi.log'},
    'model': {'Method': 2, 'Threads': 10},
    #'start': np.zeros(train_ds.n_instances) # Specify a warm start (weights for each instance)
}

clf = sMIL(C=10, kernel=k, normalizer='none', scale_C=True, tol=1e-8, verbose=True, solver='gurobi', solver_params=solver_params).fit(train_ds, None)

Linear kernel has no parameters to fit.


Using solver 'gurobi' with params: {'model': {'Method': 2, 'Threads': 10}}
Set parameter Method to value 2
Set parameter Threads to value 10
Gurobi Optimizer version 12.0.3 build v12.0.3rc0 (mac64[arm] - Darwin 24.6.0 24G90)

CPU model: Apple M1 Max
Thread count: 10 physical cores, 10 logical processors, using up to 10 threads

Non-default parameters:
Method  2
Threads  10

Optimize a model with 1 rows, 2199 columns and 2199 nonzeros
Model fingerprint: 0xdc6b2233
Model has 2418900 quadratic objective terms
Coefficient statistics:
  Matrix range     [1e+00, 1e+00]
  Objective range  [5e-01, 1e+00]
  QObjective range [4e-04, 9e+02]
  Bounds range     [5e-03, 4e-01]
  RHS range        [0e+00, 0e+00]
Presolve time: 0.18s
Presolved: 1 rows, 2199 columns, 2199 nonzeros
Presolved model has 2418900 quadratic objective terms
Ordering time: 0.07s

Barrier statistics:
 Free vars  : 2188
 AA' NZ     : 2.395e+06
 Factor NZ  : 2.397e+06 (roughly 20 MB of memory)
 Factor Ops : 3.499e+09 (less than 1 

In [20]:
y = test_ds.y
yhat = clf.predict(test_ds)
print('Matthews Correlation Coefficient:', mcc(y, yhat))

Matthews Correlation Coefficient: 0.6369615602528665
