This notebook shows a simple demo of how to use Aug-imodels (`AugLinearClassifier` and `AugTreeClassifier`). These models use LLMs to augment trnsparent models (e.g. GAMs and decision trees) to improve their performance.

Both follow a simple sklearn-style interface, but may be slow to fit (because of the LLM augmentation). Both are extremely fast at test time, as they no longer use an LLM.

In [None]:
%load_ext autoreload
%autoreload 2
from imodelsx import AugLinearClassifier, AugTreeClassifier
# from imodelsx import AugGAMClassifier as AugLinearClassifier
import datasets
import numpy as np

### Load some data
Here, we load some training/validation data from the rotten-tomatoes movie dataset. To make things fast, we restrict our training and testing datasets to only 300 examples.

In [None]:
dset = datasets.load_dataset('rotten_tomatoes')['train']
dset = dset.select(np.random.choice(len(dset), size=300, replace=False))

dset_val = datasets.load_dataset('rotten_tomatoes')['validation']
dset_val = dset_val.select(np.random.choice(
    len(dset_val), size=300, replace=False))

# Aug-Linear

### Fit AugLinearClassifier
Fitting AugLinear is a simple function call! AugLinear takes a few hyperparameters.

In [None]:
m = AugLinearClassifier(
    checkpoint='textattack/distilbert-base-uncased-rotten-tomatoes',
    ngrams=2,
    all_ngrams=True,  # also use lower-order ngrams
)
m.fit(dset['text'], dset['label'])

### Interpretation

We now have a linear model of ngrams. The `fit` function above has precomputed the linear coefficients for ngrams it saw during training and saved them to `m.coefs_dict_` Let's take a look at some of them.

In [None]:
print('Total ngram coefficients: ', len(m.coefs_dict_))
print('Most positive ngrams')
for k, v in sorted(m.coefs_dict_.items(), key=lambda item: item[1], reverse=True)[:8]:
    print('\t', k, round(v, 2))
print('Most negative ngrams')
for k, v in sorted(m.coefs_dict_.items(), key=lambda item: item[1])[:8]:
    print('\t', k, round(v, 2))

### Predictions
Now, let's take a look at how we make predictions. This is very fast, as it just uses the precomputed dictionary `m.coefs_dict_`

In [None]:
preds = m.predict(dset['text'])
print('acc_train', np.mean(preds == dset['label']))
preds_proba = m.predict_proba(dset['text'])

In [None]:
preds = m.predict(dset_val['text'])
print('acc_val', np.mean(preds == dset_val['label']))

Note: we may want to infer the coefficients for ngrams we didn't see during training. To do this, we call the `cache_linear_coefs` function on the inputs for the test set. This adds the values for the unseen coefficients to the dictionary `m.coefs_dict_`. Then we can call `predict` as before.

In [None]:
m.cache_linear_coefs(dset_val['text'])
preds = m.predict(dset_val['text'])
print('acc_val', np.mean(preds == dset_val['label']))

# Aug-Tree

In [None]:
import imodelsx.augtree.data

# set  your openai key
import openai
openai.api_key = open('/home/chansingh/.OPENAI_KEY', 'r').read().strip()

# pepare data
X_text = list(dset['text'])
# optionally, convert data to ngrams
X, _, feature_names = imodelsx.augtree.data.convert_text_data_to_counts_array(
    X_text, [], ngrams=2)

In [None]:
m = AugTreeClassifier(
    max_depth=2,  # depth of the tree
    max_features=1,
    # this tells the classifier to actually use the llm (defaults to text-davinci-003)
    refinement_strategy='llm',
    verbose=True,
    # folder to store cached ngram expansions
    cache_expansions_dir='/home/chansigh/aug-models/augtree/results/gpt3_cache',
)
m.fit(X=X, y=dset['label'], feature_names=feature_names, X_text=X_text)

In [None]:
print(m)