Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text Embeddings Reveal (Almost) As Much As Text #56

Open
YeonwooSung opened this issue Dec 17, 2023 · 0 comments
Open

Text Embeddings Reveal (Almost) As Much As Text #56

YeonwooSung opened this issue Dec 17, 2023 · 0 comments

Comments

@YeonwooSung
Copy link
Owner

paper, code

Abstract

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naïve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.

Personal Thoughts

Reconstructing and recovering the original texts from the text embeddings might be considered as AI-based vulnerability, which could cause unintended privacy leakage issue.

스크린샷 2023-12-17 오후 9 49 25

The paper stated that it is possible to decrease the recoverage of Vec2Text by adding some Gaussian noise directly to each embedding.

As you could see in the chart above, there is a some point that we could maximize the distance between Vec2Text recovery percentage and Retrieval performance (keep the Retrieval performance and drop the recovery probability drastically).

Makes me to remind that adding "proper" noises to embeddings improves the AI-based systems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant