Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Order dependence in ElmoEmbedder? #1169

Closed
ngoodman opened this issue May 2, 2018 · 6 comments
Closed

Order dependence in ElmoEmbedder? #1169

ngoodman opened this issue May 2, 2018 · 6 comments

Comments

@ngoodman
Copy link

ngoodman commented May 2, 2018

Apologies for opening an issue with what is likely a conceptual misunderstanding on my part!

I'm playing around with the pre-trained elmo embeddings (which are cool, thanks!) and noticing that the embedder seems to be stateful. That is, if i embed the same sentence twice, i don't get the same result:

import numpy as np
import scipy as sp
from allennlp.commands.elmo import ElmoEmbedder

ee = ElmoEmbedder()

v1 = np.squeeze(ee.embed_sentence("hello my name is mud .".split())[2,0,:])
v2 = np.squeeze(ee.embed_sentence("hello my name is mud .".split())[2,0,:])

print("embed twice test: ", sp.spatial.distance.cosine(v1,v2))

This gives a cosine dist of about 0.02 -- not huge, but problematic for the same sentence!

Where does the statefulness come from? Am i mis-using the embedder?

@schmmd
Copy link
Member

schmmd commented May 2, 2018

Hi @ngoodman that's expected behavior. ELMo has internal state and adapts to your domain over time. We've been thinking about how to make the output more consistent--as this is unexpected behavior for our users.

@ngoodman
Copy link
Author

ngoodman commented May 2, 2018

Oh, i see! So this is giving me the embedding in the context of the "corpus" of sentences i've asked it to embed so far?

Testing my understanding i tried the above test using two separate instances of ElmoEmbedder, and indeed got the same embedding. This seems to be an infeasible approach in practice for lots of sentences, because it takes a long time to make the ElmoEmbedder()... Any workaround?

It's certainly true that this was unexpected for me, but a few small changes would have clued me in. E.g. if the call had been ee. embed_next_sentence(..) and/or there were a note in the docs.

At any rate, thanks for the super fast response, and the nice open software!

@schmmd
Copy link
Member

schmmd commented May 2, 2018

@ngoodman here's an attempt at improving the docs. I'll also see what I can do about a more comprehensive solution. #1169

@schmmd schmmd closed this as completed May 2, 2018
@matt-peters
Copy link
Contributor

@ngoodman - I added a longer description of the statefulness to this PR: #1167

The TLDR; is that the stateful aspect is a consequence of how the biLM was originally trained.
Except for the first few batch, the predictions won't vary that much from batch to batch assuming you are using the same ElmoEmbedder instance. The recommended usage is to load only one ElmoEmbedder instance (it holds the internal states) and send all batches through it. If you are concerned about the non-determinism then run a batch or two though it to "warmup the states", then start making predictions with your data

For example, modifying your code to run the same sentence multiple times, the embeddings are constant after the first batch:

distances = []
v1 = np.squeeze(ee.embed_sentence("hello my name is mud .".split())[2,0,:])
for k in range(5):
    v2 = np.squeeze(ee.embed_sentence("hello my name is mud .".split())[2,0,:])
    distances.append(sp.spatial.distance.cosine(v1,v2))
    v1 = v2

print(distances)

Displays [0.02286398410797119, 3.7550926208496094e-06, 0.0, 5.960464477539063e-08, 0.0].

@zpaines
Copy link

zpaines commented Oct 24, 2018

@matt-peters
Is the statefulness simply a consequence of calling _get_initial_states in encoder_base.py#sort_and_run_forward?

These states simply represent the memory and output for each timestep in the batch, correct? What does it mean that they "adapt to the domain"? Is there a human understandable version of the information that they are storing, or is simply some weighted product of internal states?

@zpaines
Copy link

zpaines commented Oct 26, 2018

@schmmd is there a human understandable description of what this "context" describes? I think I understand what the LSTMs are doing, they essentially try to predict the next word in a given sentence given the previous (or following in the backwards case) words. But what's not clear to me is how this adapts to the given domain (beyond just predicting 'x' after 'y' if it ends to appear that way in past inputs).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants