Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with the new ._.trf_data object (3.7+) #13137

Closed
ahalterman opened this issue Nov 17, 2023 · 7 comments · Fixed by #13164
Closed

Working with the new ._.trf_data object (3.7+) #13137

ahalterman opened this issue Nov 17, 2023 · 7 comments · Fixed by #13164
Labels
docs Documentation and website feat / transformer Feature: Transformer

Comments

@ahalterman
Copy link

tl;dr: how do I access the transformer tensors in 3.7+?

I have a Python package that used the spaCy transformer encodings for each token in a classifier and similarity models.

In the pre-3.7 spaCy models, I could access the tensors using doc._.trf_data.tensors (full example). However, after the 3.7 update, this attribute doesn't exist. I'm not sure how to access this attribute now.

How to reproduce the behaviour

import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("We visited Alexanderplatz.")

As expected, there's a doc._.trf_data object.

The documentation for the transformer assigned attributes says that doc._.trf_data is type TransformerData. The docs for TransformerData say that that class has the following attributes:

  • tokens
  • model_output
  • tensors
  • align
  • width

However, dir(doc._.trf_data) shows the following attributes (methods omitted for space):

'all_hidden_layer_states', 'all_outputs', 'embedding_layer', 'last_hidden_layer_state', 'last_layer_only', 'num_outputs'

which doesn't include the expected tokens, model_output, etc.

When I call type(doc._.trf_data), I get <class 'spacy_curated_transformers.models.output.DocTransformerOutput'>, which doesn't seem to match the TransformerData type I expected from the docs.

Any help you have would be very appreciated, and apologies if I'm just not looking in the right place for the tensors.

Your Environment

  • spaCy version: 3.7.2
  • Platform: macOS-14.1.1-x86_64-i386-64bit
  • Python version: 3.9.9
  • Pipelines: en_coreference_web_trf (3.4.0a2), en_core_web_sm (3.7.1), en_core_web_lg (3.7.1), en_core_web_trf (3.7.3)
@danieldk
Copy link
Contributor

spaCy 3.7 switched to the Curated Transformers library. The DocTransformerOutput class is documented here:

https://spacy.io/api/curatedtransformer#doctransformeroutput

The last_hidden_layer_state property provides the per-token hidden representations for every document.

@adrianeboyd adrianeboyd added the feat / transformer Feature: Transformer label Nov 20, 2023
@adrianeboyd
Copy link
Contributor

Ah, this probably should have been documented better as part of the release.

At first glance, the DocTransformerOutput seems to contain quite a bit less information than the spacy-transformers ModelOutput, in particular I don't see enough info to align the tensors with anything in the doc, but maybe I am mistaken?

@adrianeboyd
Copy link
Contributor

Ah, the Ragged lengths align to spacy tokens? (I admit that I hadn't looked too closely at the details here before, which is part of why this was missed in the release notes.)

@danieldk
Copy link
Contributor

Ah, the Ragged lengths align to spacy tokens? (I admit that I hadn't looked too closely at the details here before, which is part of why this was missed in the release notes.)

Yeah, they do. spacy-curated-transformers applies piecing to tokens, so it doesn't have to do the same alignment as spacy-transformers (modulo whitespace tokens).

@adrianeboyd
Copy link
Contributor

So to get back to the original question, doc._.trf_data.last_hidden_layer_state is a Ragged object where you can use the spacy token index to access the tensor data for that token, without having to do any additional alignment on your side.

The data for each token is also a Ragged object:

import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("DocTransformerOutput.last_hidden_layer_state is a Ragged object")

# for the tensors corresponding to "DocTransformerOutput.last_hidden_layer_state"
# (token index 0), you can access doc._.trf_data.last_hidden_layer_state[0].data
assert doc._.trf_data.last_hidden_layer_state[0].data.shape == (12, 768)

@adrianeboyd adrianeboyd added the docs Documentation and website label Nov 23, 2023
@ahalterman
Copy link
Author

Great! That answers my question, and that's a very intuitive way to access the tensor by token index.

@adrianeboyd adrianeboyd linked a pull request Nov 30, 2023 that will close this issue
3 tasks
Copy link
Contributor

github-actions bot commented Jan 4, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website feat / transformer Feature: Transformer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants