<a href="https://colab.research.google.com/github/anyuanay/medium/blob/main/src/vec2text_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Inverse Embeddings to Text
- github repository: https://github.com/jxmorris12/vec2text
- arxiv paper: https://arxiv.org/abs/2310.06816


## Synopsis
Current NLP techniques heavily rely on text embeddings for similarity computation. A piece of text is encoded into a sequence of numerical values called embedding. Many also wonder whether it is possible to decode or invert text embeddings back into the original text.

In this study, the authors explored solutions to the problem of inversing embeddings. At first, the basic method they tried wasn't very successful. However, by refining their technique — where they made corrections step-by-step and tried embedding the text again — they got impressive results. With this improved method, they were able to perfectly reconstruct 92% of texts that were originally 32 tokens long.

Most importantly, when they applied their technique on more advanced embedding models, they found that their method could even bring out personal details, like full names, from clinical notes. This raises privacy concerns for how such embeddings are used and emphasizes the importance of securing such data.

In [1]:
!pip install -q accelerate transformers sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m39.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m23.3 MB/s[0

In [2]:
!pip install -q openai

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m41.0/77.0 kB[0m [31m1.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
!pip install -q vec2text

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/68.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.6/68.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.7/203.7 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25

In [4]:
import openai

In [7]:
openai_key_path = "OPENAI_KEY_PATH"

with open(openai_key_path, 'r') as f:
    openai_key = f.readline()

openai.api_key = openai_key

In [9]:
import vec2text

corrector = vec2text.load_corrector("text-embedding-ada-002")

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/219M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading (…)lve/main/config.json:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.15G [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

In [10]:
texts = [
            "Emily Thompson, 28, resides at 123 Maple Street, Toronto, with her cat Whiskers. \
            Her email is emilyT@example.com and her phone number is (416) 555-1234.",
            "Carlos Rodriguez, 35, works at TechFlow Corp, 456 Pine Avenue, San Francisco. \
            His birthday is on July 8th, and his passport number is AB1234567."
        ]

In [11]:
vec2text.invert_strings(
    texts,
    corrector=corrector
)

['Emily Thompson lives in Toronto, Ontario. Her email address is emily@whittingham.com, and her phone number is (613) 224-2828, ext. 305.',
 'Pedro Rodriguez, a technologist, works at CF Technologies in San Francisco. His birthdate is July 28, 1956. His passport is CF-2828.']

In [12]:
vec2text.invert_strings(
    texts,
    corrector=corrector,
    num_steps = 20
)

['Emily Thompson, 28, resides at 123 Maple Street in Toronto, with her cat Whispers. Her email address is emmy.thomson@gmail.com and her phone number is (604) 559-5656.',
 'Carlos Rodriguez, 35, works at TechFlow Corporation on Pine Street in San Francisco. His birthday is July 8, and his passport number is AB124868.']

In [13]:
vec2text.invert_strings(
    texts,
    corrector=corrector,
    num_steps=20,
    sequence_beam_width=4
)

['Emily Thompson, 28, resides at 123 Maple Street in Toronto, with her pet cat Whiskers. Her email address is emily.whiskers@example.com and her phone number is (416) 443-4444.',
 'Carlos Rodriguez, 35, works at TechFlow, Inc. on 123 Pine Street in San Francisco. His birthday is July 8th, and his passport number is AB44778.']

In [14]:
import math
import torch

def get_embeddings_openai(text_list, model="text-embedding-ada-002") -> torch.Tensor:
    batches = math.ceil(len(text_list) / 128)
    outputs = []
    for batch in range(batches):
        text_list_batch = text_list[batch * 128 : (batch + 1) * 128]
        response = openai.Embedding.create(
            input=text_list_batch,
            model=model,
            encoding_format="float",  # override default base64 encoding...
        )
        outputs.extend([e["embedding"] for e in response["data"]])
    return torch.tensor(outputs)


embeddings = get_embeddings_openai(texts)

vec2text.invert_embeddings(
    embeddings=embeddings.cuda(),
    corrector=corrector
)

['Emily Thompson, whose email address is emma@whitneysmokers.ca, lives in Toronto. Her email is at emma@whitneysmokers.ca,  28 Maple Street, Toronto, ON, 08001.',
 'Rafael Rodriguez, a flow cc, works at Techcrunch. His passport is PHONE # 78. His birthday is June 30, 1988.']

In [15]:
vec2text.invert_embeddings(
    embeddings=embeddings.mean(dim=0, keepdim=True).cuda(),
    corrector=corrector
)

['Terry Thompson, whose name is Emily Carr, lives in El Capitan, CA. Her email address is: e-mail@email.com, fax: 977-834-3636.']

In [16]:
embeddings.shape

torch.Size([2, 1536])

In [17]:
import numpy as np

for alpha in np.arange(0.0, 1.1, 0.1):
  mixed_embedding = torch.lerp(input=embeddings[0], end=embeddings[1], weight=alpha)
  text = vec2text.invert_embeddings(
      embeddings=mixed_embedding[None].cuda(),
      corrector=corrector,
      # num_steps=20,
      # sequence_beam_width=4,
  )[0]
  print(f'alpha={alpha:.1f}\t', text)

alpha=0.0	 Emily Thompson, whose email address is emma@whitneysmokers.ca, lives in Toronto. Her email is at emma@whitneysmokers.ca,  28 Maple Street, Toronto, ON, 08001.
alpha=0.1	 Emily Thompson, whose name is Emily Swinburne, lives in Toronto. Her email is: emily@trentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrentrent.
alpha=0.2	 Emily Thompson, whose name is Emily Thompson, lives in Toronto, Ontario. Her email is e-mail@twitter.com.  28th Street, Suite 305, Toronto, ON, 97201.
alpha=0.3	 Emily Thompson, whose name is Emily Thompson, lives at 333 Maple Street, Toronto, Ontario, Canada. Her email is e-mail@twitter.com.
alpha=0.4	 Emily Thompson, whose name is S

In [18]:
import numpy as np

for alpha in np.arange(0.0, 1.0, 0.1):
  mixed_embedding = torch.lerp(input=embeddings[0], end=embeddings[1], weight=alpha)
  text = vec2text.invert_embeddings(
      embeddings=mixed_embedding[None].cuda(),
      corrector=corrector,
      num_steps=20,
      sequence_beam_width=4,
  )[0]
  print(f'alpha={alpha:.1f}\t', text)

alpha=0.0	 Emily Thompson, 28, resides at 123 Maple Street, Toronto, with her cat Whiskers.  her email address is emmy.whiskers@gmail.com, and her phone number is (519) 548-4444.
alpha=0.1	 Emily Thompson, 28, resides at 123 Maple Street, Toronto, with her cat Whiskers.  her email address is emily@example.com and her phone number is (416) 123-4545.
alpha=0.2	 Emily Thompson, 28, resides at 123 Maple Street in Toronto, with her cat, Whiskers.  Her phone number is (416) 548-4444, and her email is emily.work@gmail.com.
alpha=0.3	 Emily Thompson, 28, lives at 123 Maple Street, in Toronto.  Her email address is emily.whiskers@catfish.com, and her Passport Number is (401) 742-7880.
alpha=0.4	 Emily Thompson, 28, lives at 123 Rodr­guez Street, in Toronto, and is employed by CatFlowers, Inc.  her phone number is (516) 844-5000  her e-mail address is emily.tech@gmail.com.
alpha=0.5	 Christopher Thompson, 28, lives at 123 Elm Street, in San Francisco, and works for FlowRiders, Inc.  July 4, 2015