# Transforming LLMs into High-Quality Text Embeddings with LLM2Vec.

In the recent paper, "LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders" (August 2024), they introduce LLM2Vec, a straightforward, unsupervised method that transforms any decoder-only LLM into a powerful text encoder. 

This technique allows for embedding generation from decoder-based models like GPT, leveraging three key steps:
✅ Bidirectional Attention: By replacing the typical causal attention with an all-ones matrix, each token can now attend to every other token in the sequence, giving it a “bidirectional” view.

☑️ Masked Next Token Prediction (MNTP): MNTP combines next-token prediction with masked language modeling to build context awareness. It predicts masked tokens in the sequence while calculating loss based on the logits from previous positions, strengthening the model’s contextual understanding.

❎ Unsupervised Contrastive Learning: Using SimCSE, this step helps the model create distinct representations by maximizing similarity between different representations of the same sentence while minimizing similarity with representations of others.



*This notebook is inspired in the article: [Llama 3.2 Embeddings: Training and Evaluation with LLM2Vec](https://newsletter.kaitchup.com/p/llama-32-embeddings-training) and its notebook.* by Benjamin Marie.

I also reviewed the *LLM2Vec Github* repo where you can find examples using other language models.


In this notebook, we will see how to make text embeddings from Qwen2 0.5 B. We will see in detail all the steps: masked next-token prediction training, contrastive learning, and then how to evaluate the resulting embeddings.
You can find the base model on Huggingface, [Qwen2 0.5B Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)

To train and evaluate the embedding model, I used an RTX 3090 from Vast.ai.


## Sections

* Package installation
* Masked next-token prediction training
* Contrastive learning
* Merging to the base model and saving the adapter to Huggingface Hub
* Evaluation the model
* Downloading the model and make some inferences

# Installation

In [1]:
!git clone https://github.com/McGill-NLP/llm2vec.git
!cd llm2vec && pip install -e .[evaluation]
!pip install flash-attn --no-build-isolation

Cloning into 'llm2vec'...
remote: Enumerating objects: 915, done.[K
remote: Counting objects: 100% (345/345), done.[K
remote: Compressing objects: 100% (148/148), done.[K
remote: Total 915 (delta 244), reused 197 (delta 197), pack-reused 570 (from 1)[K
Receiving objects: 100% (915/915), 1.48 MiB | 6.35 MiB/s, done.
Resolving deltas: 100% (532/532), done.
Obtaining file:///workspace/llm2vec
  Preparing metadata (setup.py) ... [?25ldone
Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting transformers<=4.44.2,>=4.43.1
  Downloading transformers-4.44.2-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp310-cp310-manylinux

You will need an access token from Hugging Face to be able to Qwen 0.5.

In [2]:
from google.colab import userdata
hf_token= userdata.get('HF_TOKEN')

In [None]:
hf_token='<YOUR HUGGINGFACE API KEY>'

In [3]:
from huggingface_hub import login
login(token=hf_token) #enter you Hugging Face access token here to be able to use Llama 3.2

# Masked next-token prediction training

Set the model and training parameters and sabve them to a json file

In [4]:
JSON_CONFIG='''
 {
    "model_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
    "dataset_name": "wikitext",
    "dataset_config_name": "wikitext-103-raw-v1",
    "per_device_train_batch_size": 16,
    "per_device_eval_batch_size": 16,
    "gradient_accumulation_steps": 2,
    "do_train": true,
    "do_eval": true,
    "max_seq_length": 512,
    "mask_token_type": "blank",
    "data_collator_type": "default",
    "mlm_probability": 0.2,
    "overwrite_output_dir": true,
    "output_dir": "output/mntp/Qwen2-0.5B",
    "evaluation_strategy": "steps",
    "eval_steps": 100,
    "save_steps": 250,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2",
    "dataloader_num_workers": 4,
    "dataloader_prefetch_factor": 2
}
'''

with open("mtnp_qwen2_config.json", 'w') as f:
  f.write(JSON_CONFIG)

Now, we can run the MNTP training using the code provided by LLM2Vec. LLM2Vec does this training with LoRA. We only train an adapter that we will load for the next steps. It makes training relatively cheap. As you can see our training dataset is the "WikiText 103".

In [5]:
!python llm2vec/experiments/run_mntp.py mtnp_qwen2_config.json

12/01/2024 10:05:49 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=4,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=2,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=100,
eval_strategy=

# Contrastive learning

We need to download the training dataset, Wikitext 103 for the Simce training. For this learning, LLM2Vec uses SimCSE (Simple Contrastive Learning of Sentence Embeddings).

In [6]:
!wget https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt

--2024-12-01 10:53:48--  https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt
Resolving huggingface.co (huggingface.co)... 13.32.110.109, 13.32.110.55, 13.32.110.77, ...
Connecting to huggingface.co (huggingface.co)|13.32.110.109|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/datasets/princeton-nlp/datasets-for-simcse/7b1825863a99aa76479b0456f7c210539dfaeeb69598b41fb4de4f524dd5a706?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27wiki1m_for_simcse.txt%3B+filename%3D%22wiki1m_for_simcse.txt%22%3B&response-content-type=text%2Fplain&Expires=1733309628&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMzMwOTYyOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9kYXRhc2V0cy9wcmluY2V0b24tbmxwL2RhdGFzZXRzLWZvci1zaW1jc2UvN2IxODI1ODYzYTk5YWE3NjQ3OWIwNDU2ZjdjMjEwNTM5ZGZhZWViNjk1OThiNDFmYjRkZTRmNTI0ZGQ1YTcwNj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9

We create a new json file containing the parameters

In [8]:
JSON_CONFIG='''
{
    "model_name_or_path": "Qwen/Qwen2-0.5B-Instruct",
    "peft_model_name_or_path": "output/mntp/Qwen2-0.5B",
    "simcse_dropout": 0.3,
    "bidirectional": true,
    "pooling_mode": "mean",
    "dataset_name": "Wiki1M",
    "dataset_file_path": "wiki1m_for_simcse.txt",
    "remove_unused_columns": false,
    "learning_rate": 3e-5,
    "loss_scale": 20,
    "per_device_train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "do_train": true,
    "disable_tqdm": false,
    "max_seq_length": 128,
    "overwrite_output_dir": true,
    "output_dir": "output/mntp-simcse/Qwen2-0.5B",
    "logging_steps": 50,
    "save_steps": 250,
    "save_only_model": true,
    "stop_after_n_steps": 1000,
    "lora_r": 16,
    "gradient_checkpointing": true,
    "torch_dtype": "bfloat16",
    "attn_implementation": "flash_attention_2",
    "seed": 422
}
'''

with open("simcse_qwen2_config.json", 'w') as f:
  f.write(JSON_CONFIG)

Now, it's time to run the contrastive learning training

In [9]:
!python llm2vec/experiments/run_simcse.py simcse_qwen2_config.json

2024-12-01 10:57:55 - llm2vec.dataset.Wiki1M - INFO - Loading Wiki1M data from wiki1m_for_simcse.txt...
2024-12-01 10:57:57 - llm2vec.dataset.Wiki1M - INFO - Loaded 1000000 samples.
Loading train examples...: 100%|██| 1000000/1000000 [00:03<00:00, 320123.83it/s]
2024-12-01 10:58:03 - peft.tuners.tuners_utils - INFO - Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing!
Model's Lora trainable parameters:
trainable params: 8,798,208 || all params: 502,830,976 || trainable%: 1.7497
{'loss': 0.8613, 'grad_norm': 103.16072082519531, 'learning_rate': 2.9935995903737838e-05, 'epoch': 0.01}
{'loss': 0.1583, 'grad_norm': 32.38646697998047, 'learning_rate': 2.987199180747568e-05, 'epoch': 0.01}
{'loss': 0.0821, 'grad_norm': 20.49149513244629, 'learning_rate': 2.9807987711213516e-05, 'epoch': 0.02}
{'loss': 0.0574, 'grad_norm': 21.583349227905273, 'learning_rate': 2.9743983614951356e-05, 'epoch': 0.03}

# Merging the adapter

In [12]:
!cd llm2vec && pip install -e .

Obtaining file:///workspace/llm2vec
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: llm2vec
  Attempting uninstall: llm2vec
    Found existing installation: llm2vec 0.2.2
    Uninstalling llm2vec-0.2.2:
      Successfully uninstalled llm2vec-0.2.2
[33m  DEPRECATION: Legacy editable install of llm2vec==0.2.2 from file:///workspace/llm2vec (setup.py develop) is deprecated. pip 25.0 will enforce this behaviour change. A possible replacement is to add a pyproject.toml or enable --use-pep517, and use setuptools >= 64. If the resulting installation is not behaving as expected, try using --config-settings editable_mode=compat. Please consult the setuptools documentation for more information. Discussion can be found at https://github.com/pypa/pip/issues/11457[0m[33m
[0m  Running setup.py develop for llm2vec
Successfully installed llm2vec-0.2.2
[0m

Let's merge the adapter to the base model

In [10]:
import torch
from llm2vec.llm2vec import LLM2Vec

In [11]:
l2v_model = LLM2Vec.from_pretrained(
    "Qwen/Qwen2-0.5B-Instruct",
    peft_model_name_or_path="output/mntp-simcse/Qwen2-0.5B/checkpoint-1000/",
    device_map="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.bfloat16,
    merge_peft=True
)


Save the trained adapter to disk and later to the HF Hub

In [12]:
l2v_model.save("Qwen2-0.5B-mntp-simcse")

In [14]:
l2v_model.model.push_to_hub("edumunozsala/Qwen2-0.5B-mntp-simcse")

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

HTTP Error 502 thrown while requesting PUT https://hf-hub-lfs-us-east-1.s3-accelerate.amazonaws.com/repos/68/e4/68e4108e4ec796f8f61eba28ea62eb9f2d21b1c90183d11e762bb99445398781/08802948e85b7176a2dbaf30a1b2b44debde0e624713515731162bb6f4c2569a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIA2JU7TKAQLC2QXPN7%2F20241201%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241201T115147Z&X-Amz-Expires=86400&X-Amz-Signature=1f6543b28a59dc9dc7153f6e3022df4ca352d1b374f053220b1c9216e5de13a8&X-Amz-SignedHeaders=host&partNumber=14&uploadId=BOvj0uiTqUp4F5Axk1H_yk7YqW3j.96zQo21p3_aiDJwIe2XHIMUhwPhnAAsz7Y0IVN_2Xvz3pdLKu5411gYr.ZexYAs1bhXeQ5EmYw1JcNNA4qQONk5mk_LW.p7iI_b&x-id=UploadPart
Retrying in 1s [Retry 1/5].


CommitInfo(commit_url='https://huggingface.co/edumunozsala/Qwen2-0.5B-mntp-simcse/commit/a6317a00246c6bcaf80d121226fb3d1fc0aa40fa', commit_message='Upload model', commit_description='', oid='a6317a00246c6bcaf80d121226fb3d1fc0aa40fa', pr_url=None, repo_url=RepoUrl('https://huggingface.co/edumunozsala/Qwen2-0.5B-mntp-simcse', endpoint='https://huggingface.co', repo_type='model', repo_id='edumunozsala/Qwen2-0.5B-mntp-simcse'), pr_revision=None, pr_num=None)

## Inference

In [15]:
# Encoding queries using instructions
instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v_model.encode(queries)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


In [16]:
# Encoding documents. Instruction are not required for documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v_model.encode(documents)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [17]:
# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)
"""
tensor([[0.6470, 0.1619],
        [0.0786, 0.5844]])
"""

tensor([[0.8209, 0.5430],
        [0.6533, 0.7960]])


'\ntensor([[0.6470, 0.1619],\n        [0.0786, 0.5844]])\n'

## Evaluation

The Massive Text Embedding Benchmark (MTEB) is an evaluation framework designed to assess the effectiveness of text embeddings across diverse tasks, datasets, and languages. It includes 8 tasks, 58 datasets, and 112 languages.

It includes tasks like clustering, reranking, classification, semantic textual similarity (STS), and retrieval to ensure a holistic assessment of embeddings.

In [18]:
!pip install mteb==1.14.10

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting mteb==1.14.10
  Downloading mteb-1.14.10-py3-none-any.whl.metadata (26 kB)
Downloading mteb-1.14.10-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m379.8 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: mteb
  Attempting uninstall: mteb
    Found existing installation: mteb 1.21.1
    Uninstalling mteb-1.21.1:
      Successfully uninstalled mteb-1.21.1
Successfully installed mteb-1.14.10
[0m

Evaluation on all the STS tasks:

In [20]:
for t in ["STS16","STS13","STS14","STS15","STS17","STS22","STS12","BIOSSES","STSBenchmark","SICK-R"]:
    !python llm2vec/experiments/mteb_eval_custom.py --base_model_name_or_path Qwen/Qwen2-0.5B-Instruct --peft_model_name_or_path output/mntp-simcse/Qwen2-0.5B/checkpoint-1000/ \
    --task_name {t} \
    --task_to_instructions_fp llm2vec/test_configs/mteb/task_to_instructions.json \
    --output_dir results

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - STS16, [3;90ms2s[0m


Batches:   0%|                                           | 0/38 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|██████████████████████████████████| 38/38 [00:03<00:00, 10.75it/s]
Batches: 100%|██████████████████████████████████| 38/38 [00:02<00:00, 13.12it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - STS13, [3;90ms2s[0m


Batches:   0%|                                           | 0/47 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|██████████████████████████████████| 47/47 [00:04<00:00, 11.29it/s]
Batches: 100%|██████████████████████████████████| 47/47 [00:03<00:00, 13.83it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - STS14, [3;90ms2s[0m


Batches:   0%|                                          | 0/118 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|████████████████████████████████| 118/118 [00:09<00:00, 12.40it/s]
Batches: 100%|████████████████████████████████| 118/118 [00:08<00:00, 13.25it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - STS15, [3;90ms2s[0m


Batches:   0%|                                           | 0/94 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|██████████████████████████████████| 94/94 [00:07<00:00, 12.13it/s]
Batches: 100%|██████████████████████████████████| 94/94 [00:07<00:00, 13.17it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - STS17, [3;90ms2s[0m, [3;31mmultilingual [0m[1;3;31m11[0m[3;31m [0m[3;31m/[0m[3;31m [0m[1;3;31m11[0m[3;31m Subsets[0m


Batches:   0%|                                           | 0/89 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|██████████████████████████████████| 89/89 [00:07<00:00, 11.34it/s]
Batches: 100%|██████████████████████████████████| 89/89 [00:07<00:00, 12.39it/s]
Batches: 100%|████████████████████████████████████| 8/8 [00:00<00:00, 12.93it/s]
Batches: 100%|████████████████████████████████████| 8/8 [00:00<00:00, 12.96it/s]
Batches: 100%|████████████████████████████████████| 8/8 [00:00<00:00, 13.07it/s]
B

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - STS22, [3;90mp2p[0m, [3;31mmultilingual [0m[1;3;31m18[0m[3;31m [0m[3;31m/[0m[3;31m [0m[1;3;31m18[0m[3;31m Subsets[0m


Batches:   0%|                                           | 0/17 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|██████████████████████████████████| 17/17 [00:04<00:00,  3.73it/s]
Batches: 100%|██████████████████████████████████| 17/17 [00:03<00:00,  4.72it/s]
Batches: 100%|████████████████████████████████████| 6/6 [00:01<00:00,  4.03it/s]
Batches: 100%|████████████████████████████████████| 6/6 [00:01<00:00,  4.14it/s]
Batches: 100%|████████████████████████████████████| 7/7 [00:01<00:00,  4.50it/s]
B

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - STS12, [3;90ms2s[0m


Batches:   0%|                                           | 0/98 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|██████████████████████████████████| 98/98 [00:08<00:00, 11.95it/s]
Batches: 100%|██████████████████████████████████| 98/98 [00:07<00:00, 12.87it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - BIOSSES, [3;90ms2s[0m


Batches:   0%|                                            | 0/4 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|████████████████████████████████████| 4/4 [00:01<00:00,  3.98it/s]
Batches: 100%|████████████████████████████████████| 4/4 [00:00<00:00, 10.71it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - STSBenchmark, [3;90ms2s[0m


Batches:   0%|                                           | 0/44 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|██████████████████████████████████| 44/44 [00:03<00:00, 11.08it/s]
Batches: 100%|██████████████████████████████████| 44/44 [00:03<00:00, 13.16it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[30m─────────────────────────────── [0m[1mSelected tasks [0m[30m ────────────────────────────────[0m
[1mSTS[0m
    - SICK-R, [3;90ms2s[0m


Batches:   0%|                                          | 0/311 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Batches: 100%|████████████████████████████████| 311/311 [00:24<00:00, 12.91it/s]
Batches: 100%|████████████████████████████████| 311/311 [00:23<00:00, 13.36it/s]


## Merge the Adapter model to the base model from HF Hub

In [21]:
from transformers import AutoTokenizer, AutoModel, AutoConfig
from peft import PeftModel


In [22]:
# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs.
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen2-0.5B-Instruct", padding_side='left' 
)
config = AutoConfig.from_pretrained(
    "Qwen/Qwen2-0.5B-Instruct", trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "Qwen/Qwen2-0.5B-Instruct",
    trust_remote_code=True,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
)

In [23]:
# Loading MNTP (Masked Next Token Prediction) model.
model = PeftModel.from_pretrained(
    model,
    "output/mntp-simcse/Qwen2-0.5B/checkpoint-1000/",
)

In [24]:
# Wrapper for encoding and pooling operations
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)


In [25]:
# Encoding queries using instructions
instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v.encode(queries)

# Encoding documents. Instruction are not required for documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v.encode(documents)

# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)
"""
tensor([[0.6266, 0.4199],
        [0.3429, 0.5240]])
"""

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

tensor([[0.8217, 0.5439],
        [0.6554, 0.7964]])


'\ntensor([[0.6266, 0.4199],\n        [0.3429, 0.5240]])\n'

## Save to he Hugginface hub

Now, we save the and the tokenizer to the Huggingface Hub

In [26]:
model.push_to_hub("edumunozsala/Qwen2-0.5B-mntp-simcse")
tokenizer.push_to_hub("edumunozsala/Qwen2-0.5B-mntp-simcse")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/35.2M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/edumunozsala/Qwen2-0.5B-mntp-simcse/commit/6b2e3012e8a72923a15ff33306b7bcc8c3719f12', commit_message='Upload model', commit_description='', oid='6b2e3012e8a72923a15ff33306b7bcc8c3719f12', pr_url=None, repo_url=RepoUrl('https://huggingface.co/edumunozsala/Qwen2-0.5B-mntp-simcse', endpoint='https://huggingface.co', repo_type='model', repo_id='edumunozsala/Qwen2-0.5B-mntp-simcse'), pr_revision=None, pr_num=None)

## Test the model saved in HF

In [31]:
# Loading base Mistral model, along with custom code that enables bidirectional connections in decoder-only LLMs.
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen2-0.5B-Instruct", padding_side='left' 
)
config = AutoConfig.from_pretrained(
    "Qwen/Qwen2-0.5B-Instruct", trust_remote_code=True
)
model = AutoModel.from_pretrained(
    "Qwen/Qwen2-0.5B-Instruct",
    trust_remote_code=True,
    config=config,
    torch_dtype=torch.bfloat16,
    device_map="cuda" if torch.cuda.is_available() else "cpu",
)

In [32]:
# Loading MNTP (Masked Next Token Prediction) model.
model = PeftModel.from_pretrained(
    model,
    "edumunozsala/Qwen2-0.5B-mntp-simcse",
)

In [33]:
# Wrapper for encoding and pooling operations
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=512)


In [34]:
# Encoding queries using instructions
instruction = (
    "Given a web search query, retrieve relevant passages that answer the query:"
)
queries = [
    [instruction, "how much protein should a female eat"],
    [instruction, "summit define"],
]
q_reps = l2v.encode(queries)

# Encoding documents. Instruction are not required for documents
documents = [
    "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
    "Definition of summit for English Language Learners. : 1  the highest point of a mountain : the top of a mountain. : 2  the highest level. : 3  a meeting or series of meetings between the leaders of two or more governments.",
]
d_reps = l2v.encode(documents)

# Compute cosine similarity
q_reps_norm = torch.nn.functional.normalize(q_reps, p=2, dim=1)
d_reps_norm = torch.nn.functional.normalize(d_reps, p=2, dim=1)
cos_sim = torch.mm(q_reps_norm, d_reps_norm.transpose(0, 1))

print(cos_sim)
"""
tensor([[0.6266, 0.4199],
        [0.3429, 0.5240]])
"""

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

tensor([[0.8217, 0.5439],
        [0.6554, 0.7964]])


'\ntensor([[0.6266, 0.4199],\n        [0.3429, 0.5240]])\n'