# SetFit for Text Classification

In this notebook, we'll learn how to do few-shot text classification with SetFit.

## Setup

If you're running this Notebook on Colab or some other cloud platform, you will need to install the `setfit` library. Uncomment the following cell and run it:

In [None]:
 %pip install setfit
 %pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting setfit
  Downloading setfit-0.7.0-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.3.0 (from setfit)
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers>=2.2.1 (from setfit)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting evaluate>=0.3.0 (from setfit)
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m10.7 MB/s

To be able to share your model with the community, there are a few more steps to follow.

First, you have to store your authentication token from the Hugging Face Hub (sign up [here](https://huggingface.co/join) if you haven't already!). To do so, execute the following cell and input an [access token](https://huggingface.co/docs/hub/security-tokens) associated with your account:

In [None]:
from huggingface_hub import notebook_login

notebook_login()
#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

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS, which you can do by uncommenting and running following command:

In [None]:
 !apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.9.2-1).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


Finally, you may need to configue Git on your system by providing details about who you are:

In [None]:
 !git config --global user.email "agarcf15@estudiantes.unileon.es"
 !git config --global user.name "agarcf15"

This notebook is designed to work with any multiclass [text classification dataset](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads) and pretrained [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub. Change the values below to try a different dataset / model!

# SACAR CSVs

In [None]:
##dataset
import pandas as pd

bt = pd.read_excel('/bin_train.xlsx')
bv = pd.read_excel('/bin_validation.xlsx')

bt.to_csv('bin_train.csv', index=False)
bv.to_csv('bin_validation.csv', index=False)



In [None]:
csv_bt = pd.read_csv('bin_train.csv')
csv_bt

Unnamed: 0,sentence,label
0,Application Programming Interface (API) adopt...,1
1,Two medical device vulnerabilities in select ...,1
2,Two patients are seeking class-action status ...,1
3,"Hunt\nAugust 12, 2021 - Long Island Jewish For...",1
4,"Hunt\nAugust 11, 2021 - A ransomware attack on...",1
...,...,...
2886,"\nIn October, the Federal Financial Institutio...",0
2887,Richard P. Salgado\nUSA Bulletin - (March 2001...,0
2888,\nThe Pew Internet and American Life Project r...,0
2889,"\nBut over the past five years, operating-syst...",0


In [None]:
csv_bt = csv_bt.dropna()
csv_bt = csv_bt.replace('\n', '. ', regex=True)
csv_bt = csv_bt.replace('"', '', regex=True)
csv_bt = csv_bt.replace('^\.', '', regex=True)
csv_bt = csv_bt.replace('\s\.', '', regex=True)
csv_bt


csv_bt.to_csv('bin_train_clean.csv', index=False)

In [None]:
csv_bt

Unnamed: 0,sentence,label
0,Application Programming Interface (API) adopt...,1
1,Two medical device vulnerabilities in select ...,1
2,Two patients are seeking class-action status ...,1
3,"Hunt. August 12, 2021 - Long Island Jewish For...",1
4,"Hunt. August 11, 2021 - A ransomware attack on...",1
...,...,...
2886,"In October, the Federal Financial Institution...",0
2887,Richard P. Salgado. USA Bulletin - (March 2001...,0
2888,The Pew Internet and American Life Project re...,0
2889,"But over the past five years, operating-syste...",0


In [None]:
csv_bt.to_csv('bin_train_clean.csv', index=False)

In [None]:
csv_bv = pd.read_csv('bin_validation.csv')

csv_bv = csv_bv.dropna()

csv_bv = csv_bv.replace('\n', '. ', regex=True)
csv_bv = csv_bv.replace('"', '', regex=True)
csv_bv = csv_bv.replace('^\.', '', regex=True)
csv_bv = csv_bv.replace('\s\.', '', regex=True)
csv_bv
csv_bv.to_csv('bin_validation_clean.csv', index=False)

In [None]:
import csv
import json

csv_file = "bin_train_clean.csv"
json_file = "bin_train_clean.json"

# Read CSV file and convert to a dict
with open(csv_file, "r") as f:
    csv_data = csv.DictReader(f)
    data = [row for row in csv_data]

# Write JSON file
with open(json_file, "w") as f:
    json.dump(data, f)


# MODELO Y DATASET

In [None]:
dataset_id = "agarc15/CIULE"
#model_id = "sentence-transformers/all-mpnet-base-v2"

#model_id = "sentence-transformers/paraphrase-mpnet-base-v2"
#model_id = "sentence-transformers/all-roberta-large-v1"



#dataset_id = "SetFit/enron_spam"
model_id = "kauffinger/xlm-roberta-base-finetuned-enron"


## Loading and sampling the dataset

We will use the 🤗 Datasets library to download the data, which can be done as follows:

In [None]:
from datasets import load_dataset

dataset = load_dataset(dataset_id)
dataset

Downloading readme:   0%|          | 0.00/92.0 [00:00<?, ?B/s]

Downloading and preparing dataset csv/agarc15--CIULE to /root/.cache/huggingface/datasets/agarc15___csv/agarc15--CIULE-08c5055d16017521/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/14.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/197k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/agarc15___csv/agarc15--CIULE-08c5055d16017521/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 2865
    })
    validation: Dataset({
        features: ['sentence', 'label'],
        num_rows: 40
    })
})

Most datasets on the Hub have many more labeled examples than those one encounters in few-shot settings. To simulate the effect of training on a limited number of examples, let's subsample the training set to have 8 labeled examples per class:

In [None]:
from setfit import sample_dataset

train_dataset = sample_dataset(dataset["train"])
train_dataset

Dataset({
    features: ['sentence', 'label'],
    num_rows: 16
})

Here we have 16 total examples to train with since the `sst2` dataset has two classes (positive and negative). For evaluation, we'll use the validation split, since the test split of `sst2` is unlabeled:

In [None]:
eval_dataset = dataset["validation"] 

Okay, now we have the dataset, let's load and train a model!

## Fine-tuning the model

To train a SetFit model, the first thing to do is download a pretrained checkpoint from the Hub. We can do so by using the `from_pretrained()` method associated with the `SetFitModel` class:

In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained(model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Downloading (…)bbd22/.gitattributes:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading (…)fda6dbbd22/README.md:   0%|          | 0.00/659 [00:00<?, ?B/s]

Downloading (…)a6dbbd22/config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading optimizer.pt:   0%|          | 0.00/2.22G [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading rng_state.pth:   0%|          | 0.00/14.6k [00:00<?, ?B/s]

Downloading scaler.pt:   0%|          | 0.00/557 [00:00<?, ?B/s]

Downloading scheduler.pt:   0%|          | 0.00/627 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/470 [00:00<?, ?B/s]

Downloading (…)2/trainer_state.json:   0%|          | 0.00/3.15k [00:00<?, ?B/s]

Downloading training_args.bin:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/kauffinger_xlm-roberta-base-finetuned-enron were not used when initializing XLMRobertaModel: ['classifier.dense.weight', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at /root/.cache/torch/sentence_transformers/kauffinger_xlm-roberta-base-finetuned-enron and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bi

Here, we've downloaded a pretrained Sentence Transformer from the Hub and added a logistic classification head to the create the SetFit model. As indicated in the message, we need to train this model on some labeled examples. We can do so by using the `SetFitTrainer` class as follows:

In [None]:
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitTrainer

trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    batch_size=8, #por encima de 8 me quedo sin VRAM
    num_iterations=40,
    #num_epochs=8, # The number of epochs to use for contrastive learning
    column_mapping={"sentence": "text", "label": "label"},
)

The main arguments to notice in the trainer is the following:

* `loss_class`: The loss function to use for contrastive learning with the Sentence Transformer body
* `num_iterations`: The number of text pairs to generate for contrastive learning
* `column_mapping`: The `SetFitTrainer` expects the inputs to be found in a `text` and `label` column. This mapping automatically formats the training and evaluation datasets for us.

Now that we've created a trainer, we can train it!

In [None]:
trainer.train()

Applying column mapping to training dataset


Generating Training Pairs:   0%|          | 0/40 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 1280
  Num epochs = 1
  Total optimization steps = 160
  Total train batch size = 8


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/160 [00:00<?, ?it/s]

The final step is to compute the model's performance using the `evaluate()` method:

In [None]:
metrics = trainer.evaluate()
metrics
#0.44467713787085517
#0.46457242582897035 BS=7, E=6
#0.4024432809773124 BS=7, E=1

#0.32774869109947646 bs=7 e=1

#0.6226876090750436 bs=8 e=1
#0.6331588132635253 bs8 e4
#0.6558464223385689 bs8 e8


Applying column mapping to evaluation dataset
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.7}

And once the model is trained, you can push it to the Hub:

In [None]:
trainer.push_to_hub("agarc15/TEST")

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

model_head.pkl:   0%|          | 0.00/6.99k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

'https://huggingface.co/agarc15/TEST/tree/main/'

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `your-username/the-name-you-picked` so for instance:

In [None]:
from setfit import SetFitModel

model = SetFitModel.from_pretrained("agarc15/TEST")

# Run inference
preds = model(["As the federal governmentâ€™s zero-trust journey continues, cybersecurity officials say they are working to review individual agency plans, harmonize implementation guidance and set up alternative standards to judge the progress of smaller agencies and offices.. Chris DeRusha, the federal governmentâ€™s chief information security officer, said now that agencies have submitted their plans for moving to a zero-trust architecture, they will need to go through an Office of Management and Budget review, a process that is likely to lead to further changes and refinement.. â€œWhat weâ€™re doing right now is going through those plans and making sure that they align to what we asked [agencies] to do in the memo, making sure that theyâ€™re sound plans working with the budget side to make sure that they have awareness, as well,â€ DeRusha said in an interview Wednesday after speaking at an event hosted by Institute for Critical Infrastructure Technology.. Agencies have been naming a mixture of CIOs, CISOs and other officials as their leads for implementation, and part of OMBâ€™s process is evaluating whether those designated officials are the best fit for the job. The agency is also incorporating technical input from staff at the Office of the National Cyber Director.. Where possible, zero-trust items have been incorporated into respective agency budgets, but DeRusha said OMB and the White House designed the zero-trust mandates with a general three-year deadline in order to maintain enough flexibility to work through each agencyâ€™s unique IT environment.. â€œA reminder of why we did it this way as opposed to setting concrete deadlines for all the tasks in the memo is we wanted to be mindful of this [reality], DeRusha said. â€œWe understand that every agency is in a different spot in their journey across these five pillars in the strategy, and we really want to make sure that we have this opportunity to develop strong points.â€. Thereâ€™s also a challenge in synthesizing all the different guidance that agencies are receiving. OMB is leading the implementation of zero trust in the civilian federal government and has put out its own zero-trust outline that agencies must follow. Others have also weighed in, with the Cybersecurity and Infrastructure Security Agency, the NSA, the National Institute for Standards and Technology and military agencies like the Defense Information Systems Agency publishing or in the process of developing zero-trust guidance and strategy documents for downstream agencies to follow.. While some of these documents are meant to serve specific purposes (for example, CISAâ€™s guidance is meant to help agencies reconcile their zero-trust tasks with the technical and cybersecurity maturity of their IT environment), they have also created a mash of documentation for agencies to ingest and some confusion.. According to CISA Deputy Director Nitin Natarajan, that diversity of resources is by design and part of a broader effort to collaborate with other stakeholders and achieve buy-in for the work ahead.. â€œThe federal civilian enterprise is a wide-open space. A lot of people perceive it to be we just reach out to a bunch of CIOs, say â€˜do Xâ€™ and it happens,â€ Natarajan said. â€But realistically, thatâ€™s not the reality that weâ€™re in, so how do we make sure that we can talk about â€¦ where we need to go, what is the best way to get there and then how do we invest in that?â€. Like DeRusha, he reflected on the need for a process that is measured and can take into account the unique budgetary, staffing and technology needs at each agency.. â€œYou know, thereâ€™s not a magic checkbook in government, so how do we make sure that weâ€™re resourcing these things effectively to get to success?â€ he said. â€œIf weâ€™re not resourcing correctly, we canâ€™t get there from here. And the federal budget process is [slow], so how do we make sure we can get investments where we need them to be to really be on the forefront of that? Itâ€™s going to take some time, itâ€™s going to take some prioritization and some commitment.â€. Small agencies bring big cybersecurity challenges. One of the more complex challenges facing OMB and other agencies is figuring out how zero trust mandates will trickle down to smaller and mid-sized agencies. The federal civilian government is a vast empire of departments, agencies and offices, some with hundreds of thousands of employees and billions of dollars in spending authority, while others have only a few dozen employees and a budget measured in the millions of dollars.. It is often impossible to craft mandates that are relevant to the IT realities of the Department of Veterans Affairs ($316 billion budget) and the Selective Service System (with an annual budget of less than $30 million) and documents like OMBâ€™s zero trust guidance are often developed with the former in mind.. Some have questioned whether agencies can really complete the work, which includes identifying every network connected device, implementing multifactor authentication and encryption, microsegmentation of networks, accelerating cloud deployments, deploying endpoint detection and response systems and more, by 2024. There is of course another event that is taking place around that same time which could be influencing that timeline: the end of President Joe Bidenâ€™s first term in office.. Greg Touhill, who served as federal chief information security officer under President Barack Obama, told SC Media that the timelines established arenâ€™t impossible, but do speak to the reality that those in charge of implementing the plan may not be around to see it through past 2024.. â€œYouâ€™ve got to acknowledge the political realities and the â€˜Cinderella strike of midnightâ€™ aspects of the administration,â€ said Touhill, now director of the CERT at the Software Engineering Institute. â€œI think itâ€™s certainly [achievable] â€” itâ€™s late to need â€” but weâ€™ve got to choose wisely.â€  Touhill said many smaller and micro agencies are simply not going to have the resources or staff to effectively manage the kind of technological requirements that will come with the cybersecurity executive order and zero-trust mandates. He has advocated for a managed security service provider (MSSP) model in government that can handle the cybersecurity needs of smaller- and less-resourced agencies and offices, and said the government must stop buying technology that requires months of training and a legion of cyber professionals to properly install, configure or manage.. â€œI do think for the small agencies out there, just like the small- or medium- [sized] business, having an MSSP type of relationship provided by one of those related-, larger- and better-funded agencies, might be a prescription for moving faster and providing better protection of the peopleâ€™s information,â€ Touhill said.. DeRusha, for his part, has been singing the same tune since last year, saying that smaller agencies won't be judged by the same standards as larger- or mid-sized agencies when it comes to implementing the administrationâ€™s mandates. He told SC Media Wednesday that OMB hasnâ€™t yet determined what those standards will look like, but as they get more data it will help them craft alternative options for implementation and budget needs.. â€œI think itâ€™s too early to say [right now] but I will be very transparent that it will be different and weâ€™re going to really work with the small and mediums to make sure that we come up with a successful plan, because it may end up looking different than for the large [agencies],â€ he said." ])
preds   

Downloading (…)lve/main/config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

Downloading (…)25a99/.gitattributes:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)9c15325a99/README.md:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)15325a99/config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

Downloading model_head.pkl:   0%|          | 0.00/6.99k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

Downloading (…)9c15325a99/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5325a99/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading model_head.pkl:   0%|          | 0.00/6.99k [00:00<?, ?B/s]

tensor([1])

## Fine-tuning with a pure PyTorch model

`setfit` also provides a pure PyTorch implementation of `SetFitModel`, where the head is a dense layer instead of a classifier from `scikit-learn`. This allows one to do backprop end-to-end and have more fine-grained control over the training process.

To use the PyTorch model, we load a pretrained model with `use_differentiable_head=True` and specify the number of classes to include in the head:

In [None]:
from setfit import SetFitModel

num_classes = len(train_dataset.unique("label"))
model = SetFitModel.from_pretrained(model_id, use_differentiable_head=True, head_params={"out_features": num_classes})

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/kauffinger_xlm-roberta-base-finetuned-enron were not used when initializing XLMRobertaModel: ['classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.out_proj.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at /root/.cache/torch/sentence_transformers/kauffinger_xlm-roberta-base-finetuned-enron and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bi

As before, we instantiate the trainer:

In [None]:
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    column_mapping={"sentence": "text", "label": "label"},
)

Next, we freeze the weights of the final layer and apply contrastive learning:

In [None]:
trainer.freeze()
trainer.train(body_learning_rate=1e-5, num_epochs=1)

Applying column mapping to training dataset


Generating Training Pairs:   0%|          | 0/20 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 640
  Num epochs = 1
  Total optimization steps = 40
  Total train batch size = 16


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/40 [00:00<?, ?it/s]

OutOfMemoryError: ignored

Note that here we can specify the learning rate for the model's body - we find that small values in 1e-5 range work well for this step.

Now that the model body is tuned, we can unfreeze the head and train it:

In [None]:
trainer.unfreeze(keep_body_frozen=True)
trainer.train(learning_rate=1e-2, num_epochs=50)

Applying column mapping to training dataset


Epoch:   0%|          | 0/50 [00:00<?, ?it/s]

Note that a larger learning rate is used when training the head. We recommend using values in the 1e-2 range. Now that the model is trained, we can evaluate it as usual:

In [None]:
trainer.evaluate()

Applying column mapping to evaluation dataset
***** Running evaluation *****


{'accuracy': 0.8577981651376146}

Nice! This is comparable to the results found with the `scikit-learn` head.