Release v2.7.0 - CachedGISTEmbedLoss, easy Matryoshka inference & evaluation, CrossEncoder, Intel Gaudi2 · UKPLab/sentence-transformers

This release introduces a new promising loss function, easier inference for Matryoshka models, new functionality for CrossEncoders and Inference on Intel Gaudi2, along much more.

Install this version with

pip install sentence-transformers==2.7.0

New loss function: CachedGISTEmbedLoss (#2592)

For a number of years, MultipleNegativesRankingLoss (also known as SimCSE, InfoNCE, in-batch negatives loss) has been the state of the art in embedding model training. Notably, this loss function performs better with a larger batch size.

Recently, various improvements have been introduced:

CachedMultipleNegativesRankingLoss was introduced, which allows you to pick much higher batch sizes (e.g. 65536) with constant memory.
GISTEmbedLoss takes a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.

Now, @JacksonCakes has combined these two approaches to produce the best of both worlds: CachedGISTEmbedLoss. This loss function allows for high batch sizes with constant memory usage, while also using a guide model to assist with the in-batch negative sample selection.

As can be seen in our Loss Overview, this model should be used with (anchor, positive) pairs or (anchor, positive, negative) triplets, much like MultipleNegativesRankingLoss, CachedMultipleNegativesRankingLoss, and GISTEmbedLoss. In short, any example using those loss functions can be updated to use CachedGISTEmbedLoss! Feel free to experiment, e.g. with this training script.

Automatic Matryoshka model truncation (#2573)

Sentence Transformers v2.4.0 introduced Matryoshka models: models whose embeddings are still useful after truncation. Since then, many useful Matryoshka models have been trained.

As of this release, the truncation for these Matryoshka embedding models can be done automatically via a new truncate_dim constructor argument:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

matryoshka_dim = 64
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True, truncate_dim=matryoshka_dim)

embeddings = model.encode(
    [
        "search_query: What is TSNE?",
        "search_document: t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.",
        "search_document: Amelia Mary Earhart was an American aviation pioneer and writer.",
    ]
)
print(embeddings.shape)
# => [3, 64]

similarities = cos_sim(embeddings[0], embeddings[1:])
# => tensor([[0.7839, 0.4933]])

Extra information:

Model truncation in all evaluators (#2582)

Alongside easier inference with Matryoshka models, evaluating them is now also much easier. You can also pass truncate_dim to any Evaluator. This way you can easily check the performance of any Sentence Transformer model at various truncated dimensions (even if the model was not trained with MatryoshkaLoss!)

from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers import SentenceTransformer
import datasets

model = SentenceTransformer("tomaarsen/mpnet-base-nli-matryoshka")

stsb = datasets.load_dataset("mteb/stsbenchmark-sts", split="test")

for dim in [768, 512, 256, 128, 64, 32, 16, 8, 4]:
    evaluator = EmbeddingSimilarityEvaluator(
        stsb["sentence1"],
        stsb["sentence2"],
        [score / 5 for score in stsb["score"]],
        name=f"sts-test-{dim}",
        truncate_dim=dim,
    )
    print(f"dim={dim:<3}: {evaluator(model) * 100:.2f} Spearman Correlation")

dim=768: 86.81 Spearman Correlation
dim=512: 86.76 Spearman Correlation
dim=256: 86.66 Spearman Correlation
dim=128: 86.20 Spearman Correlation
dim=64 : 85.40 Spearman Correlation
dim=32 : 82.42 Spearman Correlation
dim=16 : 79.31 Spearman Correlation
dim=8  : 72.82 Spearman Correlation
dim=4  : 63.44 Spearman Correlation

Here are some example training scripts that use this new truncate_dim option to assist with training Matryoshka models:

CrossEncoder improvements

This release improves the support for CrossEncoder reranker models.

`push_to_hub` (#2524)

You can now push trained CrossEncoder models to the 🤗 Hugging Face Hub!

from sentence_transformers import CrossEncoder

...

model = CrossEncoder("distilroberta-base")

# Train the model
model.fit(
    train_dataloader=train_dataloader,
    evaluator=evaluator,
    epochs=num_epochs,
    warmup_steps=warmup_steps,
)

model.push_to_hub("tomaarsen/distilroberta-base-stsb-cross-encoder")

Docs: CrossEncoder.push_to_hub

`trust_remote_code` for custom models (#2595)

You can now load custom models from the Hugging Face Hub, i.e. models that have custom modelling code that require trust_remote_code to load.

from sentence_transformers import CrossEncoder

# Note: this model does not require `trust_remote_code=True` - there are currently no models that require it yet.
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", trust_remote_code=True)

# We want to compute the similarity between the query sentence
query = "A man is eating pasta."

# With all sentences in the corpus
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

# We rank all sentences in the corpus for the query
ranks = model.rank(query, corpus)

# Print the scores
print("Query:", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}")

Docs: CrossEncoder

Inference on Intel Gaudi2 (#2557)

From this release onwards, you will be able to perform inference on Intel Gaudi2 accelerators. No modifications are needed, as the library will automatically detect the hpu device and configure the model accordingly. Thanks to Intel Habana for the support here.

All changes

[docs] Add simple Makefile for building docs by @tomaarsen in #2566
[examples] Add Matryoshka evaluation plot by @kddubey in #2564
Adding push_to_hub to CrossEncoder by @imvladikon in #2524
Fix semantic_search_usearch() for single query by @karmi in #2572
[requirements] Set minimum transformers version to 4.34.0 for is_nltk_available by @tomaarsen in #2574
[docs] Update link: retrieve_rerank_simple_wikipedia.py -> .ipynb by @tomaarsen in #2580
Document dev reqs, add ruff pre-commit by @kddubey in #2576
Enable Sentence Transformer Inference with Intel Gaudi2 GPU Supported ( 'hpu' ) by @ZhengHongming888 in #2557
[feat] Add truncation support by @kddubey in #2573
[examples] Add model upload for training_nli_v3 with GISTEmbedLoss by @tomaarsen in #2584
Add truncation support in evaluators by @kddubey in #2582
Add ST annotation to evaluators by @kddubey in #2586
[fix] Matryoshka training always patch original forward, and check matryoshka_dims by @kddubey in #2593
corrected comment from kmeans to agglomerative by @DhruvMakwana in #2590
Update transformers requirement in setup.py to match requirements.txt by @maxfriedrich in #2589
feat: add trust remote code to cross encoders by @bwanglzu in #2595
Add CachedGISTEmbedLoss by @JacksonCakes in #2592
[docs] Fix search bar on sbert.net by @tomaarsen in #2597
[clip] Prevent warning with padding when tokenizing for CLIP by @tomaarsen in #2599

New Contributors

@imvladikon made their first contribution in #2524
@karmi made their first contribution in #2572
@ZhengHongming888 made their first contribution in #2557
@DhruvMakwana made their first contribution in #2590
@maxfriedrich made their first contribution in #2589
@JacksonCakes made their first contribution in #2592

I especially want to thank @JacksonCakes for their excellent CachedGISTEmbedLoss PR and @kddubey for their wonderful PRs surrounding Matryoshka models and general repository housekeeping.

Full Changelog: v2.6.1...v2.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.7.0 - CachedGISTEmbedLoss, easy Matryoshka inference & evaluation, CrossEncoder, Intel Gaudi2

New loss function: CachedGISTEmbedLoss (#2592)

Automatic Matryoshka model truncation (#2573)

Model truncation in all evaluators (#2582)

CrossEncoder improvements

`push_to_hub` (#2524)

`trust_remote_code` for custom models (#2595)

Inference on Intel Gaudi2 (#2557)

All changes

New Contributors

Contributors

v2.7.0 - CachedGISTEmbedLoss, easy Matryoshka inference & evaluation, CrossEncoder, Intel Gaudi2

New loss function: CachedGISTEmbedLoss (#2592)

Automatic Matryoshka model truncation (#2573)

Model truncation in all evaluators (#2582)

CrossEncoder improvements

push_to_hub (#2524)

trust_remote_code for custom models (#2595)

Inference on Intel Gaudi2 (#2557)

All changes

New Contributors

Contributors

`push_to_hub` (#2524)

`trust_remote_code` for custom models (#2595)