Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings are occasionally wrong #157

Open
EliahKagan opened this issue Jun 13, 2023 · 0 comments
Open

Embeddings are occasionally wrong #157

EliahKagan opened this issue Jun 13, 2023 · 0 comments

Comments

@EliahKagan
Copy link
Collaborator

EliahKagan commented Jun 13, 2023

Over the last week or two, I've been seeing curious test failures like the one here from time to time:

======================================================================
FAIL: test_en_and_es_sentence_are_very_similar_1_dogwalk (tests.test_embed.TestEmbedOne)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/Library/Caches/pypoetry/virtualenvs/embeddingscratchwork-mVBGSMi1-py3.10/lib/python3.10/site-packages/parameterized/parameterized.py", line 620, in standalone_func
    return func(*(a + p.args), **p.kwargs, **kw)
  File "/Users/runner/work/EmbeddingScratchwork/EmbeddingScratchwork/tests/_bases.py", line 113, in test_en_and_es_sentence_are_very_similar
    self.assertGreaterEqual(result, 0.9)
AssertionError: -0.0025300044 not greater than or equal to 0.9

We determined our 0.9 threshold experimentally, and I definitely would not call embeddings wrong for not meeting it. But the dot product shown there is a small negative number. Even for conceptually unrelated text, we rarely get below a 0.6.

Some embedding models (e.g., some of the SBERT models) regularly return embeddings whose similarities to one another are negative. This can be meaningful, representing texts that are less related than average in meaning or that have the opposite meaning. But text-embedding-ada-002 is not such a model. The cosine similarities we are accustomed to getting from it suggest that the vectors are distributed in such a way that the angle between two vectors is acute in the great majority of cases, with a mean angle somewhere around $\pi / 4$.

Furthermore, we usually get values >0.9 for the comparisons in those particular tests. Yet sometimes, even in other jobs in the same workflow runs, we get small-magnitude numbers (often small negative numbers as shown above, but sometimes small positive numbers). As far as I've noticed so far, this doesn't seem to happen more often on any Python version or platform. However, I think we should look into that specifically in case there are versions/platforms where it happens much more often, or where it doesn't happen.

It seems to me that there is (a) a bug in our code causing this, (b) a bug in OpenAI's code or infrastructure causing this, or (c) a combination of bugs in both, which together combine to cause this. A cursory web search has not revealed anything, but I could certainly search broader and deeper.

The frequency of this problem seems to vacillate from once every couple days to a failed job in more than half of the test workflow's runs throughout a day. However, this characterization is very unscientific, and it is affected by how often I am paying attention, how often I work on the project in ways that cause lots of CI runs, and other factors.

Before this started, I noticed, with what seemed like similar frequency, occasional HTTP 500 errors. When that was happening, we never got these failures in the similarity tests. Now the HTTP 500 errors seem to be gone completely, but we have this. Maybe there is some kind of connection. I am not sure. Although that seems to suggest a problem on OpenAI's end could contribute, I haven't seen anything on https://status.openai.com/ that seems like it could be related and I haven't heard of anybody else having problems with the model.

If we can detect when the embeddings we're getting are wrong, then this may not be a problem, whether or not we figure out exactly what's causing it and whether or not it gets fixed. My hope is that, when this happens, it's happening due to at least one of the embeddings having a low magnitude, because that's a very obvious and easily detected way for it to be wrong. The embeddings from text-embedding-ada-002 are normalized: each vector we get should have a magnitude (i.e., Euclidean norm, i.e., length) of 1, give or take a small amount of rounding error. (Vectors do not have to have a low magnitude for their dot product to be small, of course. This is just a hope/hunch.)

I've opened #158 to add norm tests, which would probably be a good thing to have anyway. Maybe this will shed light on the matter.

If we don't get insight into the circumstances when this happens and we don't find a way to detect it reliably, then it may block the way forward on the usc branch (#136), since that is eventually going to involve embedding a high volume of text (divided into a large number of pieces), such that it would be hard to detect occasional incorrect results and also difficult to redo the embeddings repeatedly. But I'm not extremely worried about that at this point, and I think it's still reasonable to continue work on that branch at this time.

EliahKagan added a commit to EliahKagan/EmbeddingScratchwork that referenced this issue Jun 13, 2023
This is motivated by the goal of investigating issue dmvassallo#157, but I
think it's valuable to have such tests anyway.
EliahKagan added a commit that referenced this issue Jun 13, 2023
This is motivated by the goal of investigating issue #157, but I
think it's valuable to have such tests anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant