# Example Notebook

In this notebook, we will run through a few examples to show how to use SEFA and apply it to your own data.

Basic points:

- SEFA views feature subsets using masks. The feature vector $x$ always has all features
however, the mask vector $m \in \{0, 1\}$ will tell us what is seen by the model and what is not.
We have unit tests to make sure a feature with $0$ mask cannot affect the model output
- At deployment we would not have a feature value, so would fill it with $0$, as long as the mask value is also $0$
- Acquisition works with the sefa.acquire(x, mask_acq, mask_data, return_features) function.
We use mask_acq to tell us what we have measured, and mask_data to tell us what is available to measure,
at deployment this might be all $1$ s. The function returns the new mask, i.e. it sets one value in 
mask_acq to 1 for each instance. It does not change the value of x, so if this is used in deployment that functionality
must be added in. There is also the ability to return the selected features as well as the mask.

SEFA works with a mix of continuous and categorical features. However, to do this the continuous **must** come first in the feature vector, and
then all categorical. Additionally, if there are categorical features SEFA must know the maximum number of categories.

We give the information to SEFA when initialising in a dictionary, an example is below:

sefa_config = {  
$~~$ "num_con_features": 5,  
$~~$ "num_cat_features": 5,  
$~~$ "most_categories": 3,  
$~~$ "out_dim": 10,  
$~~$ "latent_dim": 4,  
$~~$ "num_hidden_predictor": 2,  
$~~$ "hidden_dim_predictor": 100,  
$~~$ "num_hidden_encoder": 2,  
$~~$ "hidden_dim_encoder": 20,  
$~~$ "num_samples_train": 100,  
$~~$ "num_samples_predict": 100,  
$~~$ "num_samples_acquire": 100,  
$~~$ "beta": 0.001,  
$~~$ "epochs": 5,  
$~~$ "lr": 0.001,  
$~~$ "batchsize": 128,  
$~~$ "patience": 5,  
}

The out_dim tells us how many classes there are in the prediction problem.

To train SEFA we use a TensorDataset with X, y and M. SEFA saves the best version of
the model during training, so the fit function requires a save path. We will be using a dummy folder and deleting it throughout
the notebook.

That is *roughly* everything you need to know, onto the examples.

In [None]:
import os
import os.path as osp

import numpy as np
import matplotlib.pyplot as plt
import torch

from torch.utils.data import TensorDataset

from models.sefa import SEFA

device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
print("device:", device)

In [None]:
from experiments.metrics_dict import metrics_dict

auroc = metrics_dict["auroc"]
accuracy = metrics_dict["accuracy"]

In [None]:
def create_dummy_folder():
  os.makedirs("dummy_folder", exist_ok=True)


def remove_dummy_folder():
  for root, dir, files in os.walk("dummy_folder", topdown=False):
    for name in files:
      os.remove(osp.join(root, name))
    for name in dir:
      os.rmdir(osp.join(root, name))
  os.rmdir("dummy_folder")

# Indicator Example

The first example is the indicator example from the paper. We have $d$ binary features,
and one indicator feature taking values from $1$ to $d$. The indicator tells us which feature gives the label.
So:

$y = x_{x_{d+1}}$

CMI maximization fails here because it does not select the indicator first. This is the first test, to
make a long term acquisition. The second test is to then select the feature chosen by the indicator, which changes for each
instance.

First we generate data as torch tensors. Even though this is categorical data, SEFA does not need one-hot encodings.
SEFA takes the features as floats, from $0$ to $\text{max categories} -1$. So in this case the max categories is $d$, the first $d$ features are $\in \{0.0, 1.0\}$
and the indicator is $\in \{0.0, ..., d-1.0\}$.

The labels are given as longs, so for this example $\in \{0, 1 \}$.

The masks are given as floats $\in \{0.0, 1.0\}$.

We put the train and val data in torch TensorDatasets.

In [None]:
def generate_indicator_data(num_data, d):
  xd = np.random.randint(low=0, high=2, size=(num_data, d))
  indicator = np.random.randint(low=0, high=d, size=(num_data, 1))
  x = np.concatenate([xd, indicator], axis=-1)
  y = xd[np.arange(num_data), indicator.flatten()]
  x = torch.tensor(x).float()
  y = torch.tensor(y).long()
  m = torch.ones_like(x)
  return x, y, m


d = 10
num_train = 60_000
num_val = 10_000
num_test = 10_000

X_train, y_train, M_train = generate_indicator_data(num_train, d)
X_val, y_val, M_val = generate_indicator_data(num_val, d)
X_test, y_test, M_test = generate_indicator_data(num_test, d)

train_data = TensorDataset(X_train, y_train, M_train)
val_data = TensorDataset(X_val, y_val, M_val)

Next we initialise SEFA, using the configs described previously. We have $0$ continuous features, $d+1$ categorical features, and max categories $d$.
We have binary classification so the out_dim is $2$.

All other entries are hyperparameters.

In [None]:
sefa_config = {
  "num_con_features": 0,
  "num_cat_features": d+1,
  "most_categories": d,
  "out_dim": 2,
  "latent_dim": 4,
  "num_hidden_predictor": 2,
  "hidden_dim_predictor": 100,
  "num_hidden_encoder": 2,
  "hidden_dim_encoder": 20,
  "num_samples_train": 100,
  "num_samples_predict": 100,
  "num_samples_acquire": 100,
  "beta": 0.001,
  "epochs": 5,
  "lr": 0.001,
  "batchsize": 128,
  "patience": 5,
}

sefa = SEFA(sefa_config).to(device)

Next we train SEFA with the fit function. We must give SEFA the train_data and val_data in the torch TensorDatasets.

We also need to specify the save_path, where the best model is saved. This can be used in larger scale experiments if needed.

Finally, we specify a predictive metric. This is used to quantify classification performance, which SEFA uses to calculate how well the acquisition has gone, and to select the best model (auroc and accuracy were functions defined at the start of the notebook).

In [None]:
create_dummy_folder()

# Fit the model
sefa.fit(train_data=train_data, val_data=val_data, save_path="dummy_folder", metric_f=auroc)

remove_dummy_folder()

With a trained SEFA model we can test the acquisitions. We use the .acquire function returning the selections.
We will test if the first selection is the indicator, and if the second acquisition is the feature given by the indicator.

To do this we must create a new mask_acq, full of zeros, to tell SEFA to act as if it has no features initially. We will use the .acquire function to fill mask_acq.

In [None]:
# Test the acquisition.

X_test = X_test.to(device)
M_test = M_test.to(device)
m_acq = torch.zeros_like(M_test)

# First acquisition
m_acq, selected = sefa.acquire(x=X_test, mask_acq=m_acq, mask_data=M_test, return_features=True)
print("Proportion of first selections that are indicator:", (selected == d).float().mean().item())

# Second selection
m_acq, selected = sefa.acquire(x=X_test, mask_acq=m_acq, mask_data=M_test, return_features=True)
print("Proportion of second selections that are given by indicator:", (selected == X_test[:, -1]).float().mean().item())

# Syn1

Next we will look at the Syn1 example from the paper. Here we have all continuous features.

The features are sampled from $\mathcal{N}(0, 1)$.

We define a logit based on value of $x_{11}$:

- If $x_{11} < 0$, $l = 4x_1x_2$
- If $x_{11} \geq 0$, $l = 1.2 (x_3^2 + x_4^2 + x_5^2 + x_6^2) - 4.2$

Then labels are sampled from $p(Y=1| x) = 1/(1+e^l)$.

The optimal strategy is to select $x_{11}$ first and then the features for the relevant logit.

First we generate data, again $x$ and $m$ are tensors of floats, $y$ is a tensor of longs. We put the train and val data into a torch TensorDataset.

In [None]:
def generate_syn1_data(num_data):
  x = np.random.normal(size=(num_data, 11))
  logit1 = 4*x[:, 0]*x[:, 1]
  logit2 = 1.2*np.sum(x[:, 2:6]**2, axis=-1) - 4.2
  feature_11_less_0 = (x[:, -1] < 0).astype(np.float32)
  logit = logit1*feature_11_less_0 + logit2*(1-feature_11_less_0)
  y = np.random.binomial(1, 1/(1+np.exp(logit)), size=num_data)
  x = torch.tensor(x).float()
  y = torch.tensor(y).long()
  m = torch.ones_like(x)
  return x, y, m


num_train = 60_000
num_val = 10_000
num_test = 10_000

X_train, y_train, M_train = generate_syn1_data(num_train)
X_val, y_val, M_val = generate_syn1_data(num_val)
X_test, y_test, M_test = generate_syn1_data(num_test)

train_data = TensorDataset(X_train, y_train, M_train)
val_data = TensorDataset(X_val, y_val, M_val)

This time in the config we have $11$ continuous features, and no categorical features or max categories.

In [None]:
sefa_config = {
  "num_con_features": 11,
  "num_cat_features": 0,
  "most_categories": 0,
  "out_dim": 2,
  "latent_dim": 6,
  "num_hidden_predictor": 2,
  "hidden_dim_predictor": 150,
  "num_hidden_encoder": 2,
  "hidden_dim_encoder": 50,
  "num_samples_train": 100,
  "num_samples_predict": 200,
  "num_samples_acquire": 200,
  "beta": 0.0005,
  "epochs": 30,
  "lr": 0.001,
  "batchsize": 128,
  "patience": 5,
}

sefa = SEFA(sefa_config).to(device)

We fit the model as before.

In [None]:
create_dummy_folder()

# Fit the model.
sefa.fit(train_data=train_data, val_data=val_data, save_path="dummy_folder", metric_f=auroc)

remove_dummy_folder()

Next we run the acquisition, tracking which features are acquired at each step.

In [None]:
# Run acquisition.

X_test = X_test.to(device)
M_test = M_test.to(device)
m_acq = torch.zeros_like(M_test)

selected = []

for _ in range(M_test.shape[-1]):
  m_acq, selected_tmp = sefa.acquire(x=X_test, mask_acq=m_acq, mask_data=M_test, return_features=True)
  selected.append(selected_tmp)

selected = torch.stack(selected, dim=0)

This code is just plotting code, to recreate the heatmaps in the paper. (First column of Figure 5)

In [None]:
def draw_rectangle_axis(ax, left, down, right, up, linestyle="-", color="lime"):
  ax.plot([left, right], [up, up], color=color, linestyle=linestyle)
  ax.plot([left, right], [down, down], color=color, linestyle=linestyle)
  ax.plot([left, left], [up, down], color=color, linestyle=linestyle)
  ax.plot([right, right], [up, down], color=color, linestyle=linestyle)


def hm_trajectories_on_axis(ax, selections):
  # HM means heatmap, so we have a heat map behind sample trajectories.
  selections = selections.numpy()  # Shape: [num_features, batch]
  num_features = selections.shape[0]
  hm = []
  for f in range(num_features):
    hm.append(np.mean(selections == f, axis=-1))
  hm = np.stack(hm, axis=0)
  ax.imshow(hm, cmap="Blues", aspect=0.4, origin="lower", vmin=0.0, vmax=1.0)

  max_features = 6
  trajectories = selections[:max_features, np.random.choice(a=selections.shape[1], size=400, replace=False)]
  ax.plot(trajectories, color="red", linewidth=2.5, alpha=0.018)
  ax.set_xlim(-0.5, max_features-0.5)
  ax.set_xticks(ticks=np.arange(max_features), labels=np.arange(max_features)+1)
  ax.set_yticks(ticks=np.arange(num_features), labels=np.arange(num_features)+1)


fig, ax = plt.subplots(1, ncols=2, figsize=[8, 3.5])

neg_ids = torch.where(X_test[:, -1] < 0)[0]
pos_ids = torch.where(X_test[:, -1] >= 0.0)[0]

hm_trajectories_on_axis(ax[0], selected[:, neg_ids].cpu())
hm_trajectories_on_axis(ax[1], selected[:, pos_ids].cpu())

ax[0].set_title("x11 < 0")
ax[0].set_xlabel("Acquisition")
ax[0].set_ylabel("Feature")
ax[1].set_title("x11 >= 0")
ax[1].set_xlabel("Acquisition")
ax[1].set_ylabel("Feature")


# Draw the rectangles highlighting the correct acquisitions.
draw_rectangle_axis(ax[0], -0.5, 9.5, 0.5, 10.5)
draw_rectangle_axis(ax[1], -0.5, 9.5, 0.5, 10.5)
draw_rectangle_axis(ax[0], 0.5, -0.5, 2.5, 1.5)
draw_rectangle_axis(ax[1], 0.5, 1.5, 4.5, 5.5)
ax[0].axvline(x=2.5, color="black", linestyle="--")
ax[1].axvline(x=4.5, color="black", linestyle="--")

# Jointly Informative Data

We provide one more example of SEFA making long term acquisitions, even when there are features that provide immediate information.
This dataset does not have instance wise orderings, but there can still be myopic behavior.

The last 3 features are binary, but we translate this to be $-1$ or $1$. The label is given as the
product of the last three features. That is, any of the three has the ability to flip the label. And it is impossible to know anything about the
label without having all three.

The other features are sampled based on the label. If $y=1$ they are sampled from a Bernoulli with
$p=0.55$, and if $y=-1$ they are sampled from a Bernoulli with $p=0.45$. So all other features can provide some information
about the label, even though they are noisy, therefore, CMI would select these features first.

The optimal strategy is to acquire the jointly informative features first, having poor performance for three acquisitions to then have perfect performance.
The greedy strategy that CMI would yield is to keep acquiring noisy features, slowly seeing what the most likely value of $y$ is.

This example shows SEFA can acquire jointly informative (but individually uninformative) features, taking into account their possible values and interactions.

(we translate $-1$ in the jointly informative features and $y$ back to $0$ after generating the noisy features)

In [None]:
def generate_jointly_informative_data(num_data):
  num_ji_features = 3
  num_noisy_features = 10
  x_ji = 2*np.random.binomial(1, 0.5, size=(num_data, num_ji_features)) - 1
  y = np.prod(x_ji, axis=1)
  p = 0.5 + 0.05*y
  p = p.reshape(-1, 1)
  p = np.tile(p, (1, num_noisy_features))
  x_noisy = np.random.binomial(1, p)
  x_ji = (x_ji + 1) / 2  # Convert back to 0, 1
  y = (y + 1) / 2
  x = np.concatenate([x_noisy, x_ji], axis=1)  # The final three features are the jointly informative features.
  x = torch.tensor(x).float()
  y = torch.tensor(y).long()
  m = torch.ones_like(x)
  return x, y, m


num_train = 60_000
num_val = 10_000
num_test = 10_000

X_train, y_train, M_train = generate_jointly_informative_data(num_train)
X_val, y_val, M_val = generate_jointly_informative_data(num_val)
X_test, y_test, M_test = generate_jointly_informative_data(num_test)

train_data = TensorDataset(X_train, y_train, M_train)
val_data = TensorDataset(X_val, y_val, M_val)

Again we have no continuous features, only categorical, all are binary.

In [None]:
sefa_config = {
  "num_con_features": 0,
  "num_cat_features": 13,
  "most_categories": 2,
  "out_dim": 2,
  "latent_dim": 4,
  "num_hidden_predictor": 2,
  "hidden_dim_predictor": 150,
  "num_hidden_encoder": 2,
  "hidden_dim_encoder": 50,
  "num_samples_train": 100,
  "num_samples_predict": 200,
  "num_samples_acquire": 200,
  "beta": 0.001,
  "epochs": 10,
  "lr": 0.001,
  "batchsize": 128,
  "patience": 5,
}

sefa = SEFA(sefa_config).to(device)

In [None]:
create_dummy_folder()

# Fit the model
sefa.fit(train_data=train_data, val_data=val_data, save_path="dummy_folder", metric_f=auroc)

remove_dummy_folder()

During acquisition we track the auroc. We also do this for a greedy strategy, just acquiring in order, and only acquiring the jointly informative features last.

We will then plot the performance as a function of number of acquired features. And report the porportion of the first, second and third acquisitions SEFA does that are the jointly informative features.

In [None]:
X_test = X_test.to(device)
M_test = M_test.to(device)
m_acq = torch.zeros_like(M_test)
m_seq = torch.zeros_like(M_test)

preds_with_no_features = sefa.predict(x=X_test, mask=m_seq).detach().cpu()
auroc_no_features = auroc(preds_with_no_features, y_test)
metrics_acq = [auroc_no_features]
metrics_seq = [auroc_no_features]

selected = []

for i in range(X_test.shape[-1]):
  m_acq, selected_tmp = sefa.acquire(x=X_test, mask_acq=m_acq, mask_data=M_test, return_features=True)
  m_seq[:, i] = 1.0  # Acquiring the noisy features then jointly informative features in order.
  selected.append(selected_tmp)

  preds_acq = sefa.predict(x=X_test, mask=m_acq*M_test).detach().cpu()
  preds_seq = sefa.predict(x=X_test, mask=m_seq*M_test).detach().cpu()
  metrics_acq.append(auroc(preds_acq, y_test))
  metrics_seq.append(auroc(preds_seq, y_test))


# Print proportion of selections.
selected = torch.stack(selected, dim=0)
print(f"Proportion of first acquisitions being jointly informative features: {torch.mean((selected[0] > 9).float()).item()}")
print(f"Proportion of second acquisitions being jointly informative features: {torch.mean((selected[1] > 9).float()).item()}")
print(f"Proportion of third acquisitions being jointly informative features: {torch.mean((selected[2] > 9).float()).item()}")


# Plot the curves.
metrics_acq = np.array(metrics_acq)
metrics_seq = np.array(metrics_seq)
plt.plot(metrics_acq, label="Active Acquisitions")
plt.plot(metrics_seq, label="Greedy Acquisitions")
plt.grid()
plt.legend()
plt.xlabel("Acquisition")
plt.ylabel("AUROC")