## Scaling and Performance

If you want to process a large volumes of text in spacy, the pipe method is helpful. This method processes texts as a stream and yields Doc objects. Performsnce wise it is much faster than looping through every text and calling nlp(text) on each string.

In [4]:
import json
import spacy

nlp = spacy.load("en_core_web_sm")

with open("tweets.json") as f:
    TEXTS = json.loads(f.read())

# Process the texts and print the adjectives
for doc in list(nlp.pipe(TEXTS)):
    print(doc.text)
    print([token.text for token in doc if token.pos_ == "ADJ"])
    print()

McDonalds is my favorite restaurant.
['favorite']

Here I thought @McDonalds only had precooked burgers but it seems they only have not cooked ones?? I have no time to get sick..
['sick']

People really still eat McDonalds :(
[]

The McDonalds in Spain has chicken wings. My heart is so happy 
['happy']

@McDonalds Please bring back the most delicious fast food sandwich of all times!!....The Arch Deluxe :P
['delicious', 'fast']

please hurry and open. I WANT A #McRib SANDWICH SO BAD! :D
['open', 'BAD']

This morning i made a terrible decision by gettin mcdonalds and now my stomach is payin for it
['terrible']



Pipe method also supports passing in tuples of text with context if you set "as tuples" to True. This is useful for passing in additional metadata for each Doc object. You can even add the context meta data to custom attributes. Output of this is (doc, context)

In [6]:
from spacy.lang.en import English
from spacy.tokens import Doc

with open("bookquotes.json") as f:
    BOOK = json.loads(f.read())

nlp = English()

# Register the Doc metadata 'author'
Doc.set_extension("author", default=None)

# Register the Doc metadata 'book' (default None)
Doc.set_extension("book", default=None)

for doc, context in list(nlp.pipe(BOOK, as_tuples=True)):
    # Set book and author for each doc
    doc._.book = context["book"]
    doc._.author = context["author"]

    # Print the text and custom attribute data
    print(doc.text, "\n", "— '{}' by {}".format(doc._.book, doc._.author), "\n")

One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 
 — 'Metamorphosis' by Franz Kafka 

I know not all that may be coming, but be it what it will, I'll go to it laughing. 
 — 'Moby-Dick or, The Whale' by Herman Melville 

It was the best of times, it was the worst of times. 
 — 'A Tale of Two Cities' by Charles Dickens 

The only people for me are the mad ones, the ones who are mad to live, mad to talk, mad to be saved, desirous of everything at the same time, the ones who never yawn or say a commonplace thing, but burn, burn, burn like fabulous yellow roman candles exploding like spiders across the stars. 
 — 'On the Road' by Jack Kerouac 

It was a bright cold day in April, and the clocks were striking thirteen. 
 — '1984' by George Orwell 

Nowadays people know the price of everything and the value of nothing. 
 — 'The Picture Of Dorian Gray' by Oscar Wilde 



You can also process text faster by running only the tokenizer on the text. This can happen in cases when you have methods that already perform some of the other components for eg a custom function that does parts of speech tagging.

In [7]:
nlp = spacy.load("en_core_web_sm")
text = (
    "Chick-fil-A is an American fast food restaurant chain headquartered in "
    "the city of College Park, Georgia, specializing in chicken sandwiches."
)

# Only tokenize the text
doc = nlp.make_doc(text)
print([token.text for token in doc])

['Chick', '-', 'fil', '-', 'A', 'is', 'an', 'American', 'fast', 'food', 'restaurant', 'chain', 'headquartered', 'in', 'the', 'city', 'of', 'College', 'Park', ',', 'Georgia', ',', 'specializing', 'in', 'chicken', 'sandwiches', '.']


You can also temporarily and selectively disable pipeline components using the with clause.

In [8]:
# Disable the tagger and parser
with nlp.disable_pipes("tagger", "parser"):
    # Process the text
    doc = nlp(text)
    # Print the entities in the doc
    print(doc.ents)

(American, College Park, Georgia)
