Working with the new `._.trf_data` object (3.7+) #13137

ahalterman · 2023-11-17T21:41:03Z

tl;dr: how do I access the transformer tensors in 3.7+?

I have a Python package that used the spaCy transformer encodings for each token in a classifier and similarity models.

In the pre-3.7 spaCy models, I could access the tensors using doc._.trf_data.tensors (full example). However, after the 3.7 update, this attribute doesn't exist. I'm not sure how to access this attribute now.

How to reproduce the behaviour

import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("We visited Alexanderplatz.")

As expected, there's a doc._.trf_data object.

The documentation for the transformer assigned attributes says that doc._.trf_data is type TransformerData. The docs for TransformerData say that that class has the following attributes:

tokens
model_output
tensors
align
width

However, dir(doc._.trf_data) shows the following attributes (methods omitted for space):

'all_hidden_layer_states', 'all_outputs', 'embedding_layer', 'last_hidden_layer_state', 'last_layer_only', 'num_outputs'

which doesn't include the expected tokens, model_output, etc.

When I call type(doc._.trf_data), I get <class 'spacy_curated_transformers.models.output.DocTransformerOutput'>, which doesn't seem to match the TransformerData type I expected from the docs.

Any help you have would be very appreciated, and apologies if I'm just not looking in the right place for the tensors.

Your Environment

spaCy version: 3.7.2
Platform: macOS-14.1.1-x86_64-i386-64bit
Python version: 3.9.9
Pipelines: en_coreference_web_trf (3.4.0a2), en_core_web_sm (3.7.1), en_core_web_lg (3.7.1), en_core_web_trf (3.7.3)

The text was updated successfully, but these errors were encountered:

danieldk · 2023-11-19T13:40:05Z

spaCy 3.7 switched to the Curated Transformers library. The DocTransformerOutput class is documented here:

https://spacy.io/api/curatedtransformer#doctransformeroutput

The last_hidden_layer_state property provides the per-token hidden representations for every document.

adrianeboyd · 2023-11-20T13:30:37Z

Ah, this probably should have been documented better as part of the release.

At first glance, the DocTransformerOutput seems to contain quite a bit less information than the spacy-transformers ModelOutput, in particular I don't see enough info to align the tensors with anything in the doc, but maybe I am mistaken?

adrianeboyd · 2023-11-20T13:34:39Z

Ah, the Ragged lengths align to spacy tokens? (I admit that I hadn't looked too closely at the details here before, which is part of why this was missed in the release notes.)

danieldk · 2023-11-20T16:59:53Z

Ah, the Ragged lengths align to spacy tokens? (I admit that I hadn't looked too closely at the details here before, which is part of why this was missed in the release notes.)

Yeah, they do. spacy-curated-transformers applies piecing to tokens, so it doesn't have to do the same alignment as spacy-transformers (modulo whitespace tokens).

adrianeboyd · 2023-11-21T08:43:38Z

So to get back to the original question, doc._.trf_data.last_hidden_layer_state is a Ragged object where you can use the spacy token index to access the tensor data for that token, without having to do any additional alignment on your side.

The data for each token is also a Ragged object:

import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("DocTransformerOutput.last_hidden_layer_state is a Ragged object")

# for the tensors corresponding to "DocTransformerOutput.last_hidden_layer_state"
# (token index 0), you can access doc._.trf_data.last_hidden_layer_state[0].data
assert doc._.trf_data.last_hidden_layer_state[0].data.shape == (12, 768)

ahalterman · 2023-11-28T19:42:54Z

Great! That answers my question, and that's a very intuitive way to access the tensor by token index.

github-actions · 2024-01-04T00:02:26Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ahalterman mentioned this issue Nov 17, 2023

DocTransformerOutput' object has no attribute 'tensors' ahalterman/mordecai3#21

Closed

adrianeboyd added the feat / transformer Feature: Transformer label Nov 20, 2023

adrianeboyd added the docs Documentation and website label Nov 23, 2023

adrianeboyd linked a pull request Nov 30, 2023 that will close this issue

Docs: update trf_data examples and pipeline design info #13164

Merged

3 tasks

svlandeg closed this as completed in #13164 Dec 4, 2023

github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with the new `._.trf_data` object (3.7+) #13137

Working with the new `._.trf_data` object (3.7+) #13137

ahalterman commented Nov 17, 2023

danieldk commented Nov 19, 2023

adrianeboyd commented Nov 20, 2023

adrianeboyd commented Nov 20, 2023

danieldk commented Nov 20, 2023

adrianeboyd commented Nov 21, 2023

ahalterman commented Nov 28, 2023

github-actions bot commented Jan 4, 2024

Working with the new ._.trf_data object (3.7+) #13137

Working with the new ._.trf_data object (3.7+) #13137

Comments

ahalterman commented Nov 17, 2023

How to reproduce the behaviour

Your Environment

danieldk commented Nov 19, 2023

adrianeboyd commented Nov 20, 2023

adrianeboyd commented Nov 20, 2023

danieldk commented Nov 20, 2023

adrianeboyd commented Nov 21, 2023

ahalterman commented Nov 28, 2023

github-actions bot commented Jan 4, 2024

Working with the new `._.trf_data` object (3.7+) #13137

Working with the new `._.trf_data` object (3.7+) #13137