Skip to content

Commit

Permalink
Update processing-pipelines.md to mention method for doc metadata (#7480
Browse files Browse the repository at this point in the history
)

* Update processing-pipelines.md

Under "things to try," inform users they can save metadata when using nlp.pipe(foobar, as_tuples=True)

Link to a new example on the attributes page detailing the following:

> ```
> data = [
>   ("Some text to process", {"meta": "foo"}),
>   ("And more text...", {"meta": "bar"})
> ]
> 
> for doc, context in nlp.pipe(data, as_tuples=True):
>     # Let's assume you have a "meta" extension registered on the Doc
>     doc._.meta = context["meta"]
> ```

from https://stackoverflow.com/questions/57058798/make-spacy-nlp-pipe-process-tuples-of-text-and-additional-information-to-add-as

* Updating the attributes section

Update the attributes section with example of how extensions can be used to store metadata.

* Update processing-pipelines.md

* Update processing-pipelines.md

Made as_tuples example executable and relocated to the end of the "Processing Text" section.

* Update processing-pipelines.md

* Update processing-pipelines.md

Removed extra line

* Reformat and rephrase

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
  • Loading branch information
langdonholmes and adrianeboyd authored Apr 19, 2021
1 parent 0e7f94b commit df541c6
Showing 1 changed file with 33 additions and 0 deletions.
33 changes: 33 additions & 0 deletions website/docs/usage/processing-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,37 @@ have to call `list()` on it first:

</Infobox>

You can use the `as_tuples` option to pass additional context along with each
doc when using [`nlp.pipe`](/api/language#pipe). If `as_tuples` is `True`, then
the input should be a sequence of `(text, context)` tuples and the output will
be a sequence of `(doc, context)` tuples. For example, you can pass metadata in
the context and save it in a [custom attribute](#custom-components-attributes):

```python
### {executable="true"}
import spacy
from spacy.tokens import Doc

if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)

text_tuples = [
("This is the first text.", {"text_id": "text1"}),
("This is the second text.", {"text_id": "text2"})
]

nlp = spacy.load("en_core_web_sm")
doc_tuples = nlp.pipe(text_tuples, as_tuples=True)

docs = []
for doc, context in doc_tuples:
doc._.text_id = context["text_id"]
docs.append(doc)

for doc in docs:
print(f"{doc._.text_id}: {doc.text}")
```

### Multiprocessing {#multiprocessing}

spaCy includes built-in support for multiprocessing with
Expand Down Expand Up @@ -1373,6 +1404,8 @@ There are three main types of extensions, which can be defined using the
[`Span.set_extension`](/api/span#set_extension) and
[`Token.set_extension`](/api/token#set_extension) methods.
## Description
1. **Attribute extensions.** Set a default value for an attribute, which can be
overwritten manually at any time. Attribute extensions work like "normal"
variables and are the quickest way to store arbitrary information on a `Doc`,
Expand Down

0 comments on commit df541c6

Please sign in to comment.