### Processing Text

**When you call nlp on a text, spacy will tokenize it and then it will call each component on the doc, in order. It then returns the processed doc that you can work with**

In [1]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
doc = nlp("This is raw text")

- In the above code, when we do doc = nlp("text"), the spacy will perform tokenization of each word.
- In an nlp pipeline, tokenization is followed by tagging, parsing, ner.. then the final doc is compiled again.
- If needed, spacy would perform all the above functions on the doc as well

**When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts. spacy's nlp.pipe method takes and iterable of texts and yields processed Doc objects. The batching is done internally.**

In [4]:
text = ["This is a raw text", "There is a lot of text"]

In [6]:
docs = list(nlp.pipe(text))

- Because we use pipeline, alot of processing happens in parallel as text data is divided into batches.

**Tip for efficient processing**

- Only apply the pipeline components you need. Getting predictions from the model that you dont actually need adds up and becomes very inefficient at scale. To prevent this, use the disable keyword argument to disable components you dont need

In [10]:
# In this eg, we only need entities of docs. so we disable other components

import spacy

text = ["Net income was $9.4 million compared to the prior year of $2.7 million.", "Revenue exceeded twelve billion dollars, with a loss of $1b."]

nlp = spacy.load("en_core_web_sm")

docs = nlp.pipe(text, disable = ["tagger", "parser"])
for doc in docs:
    print([(ent.text, ent.label_) for ent in doc.ents])
    print()

[('$9.4 million', 'MONEY'), ('the prior year', 'DATE'), ('$2.7 million', 'MONEY')]

[('twelve billion dollars', 'MONEY'), ('1b', 'MONEY')]



- Neither nltk nor spacy is 100% accurate. This is because, text data is very vast and meaning of the word changes in different context. Hence spacy is not able to extract correct information all the time.

- When you load a model, Spacy first consults the model's meta.json.

these are the meta data which is loaded in the form of meta.json

{
    
    "lang": "en"
    
    "name": "core_web_sm"
    
    "description": "Example model for spaCy"
    
    "pipeline": ["tagger", "parser", "ner"]
}

- Note that tokenization is always on. Hence, after tokenization, we iterate over each token to perform tagging, parsing, ner etc

- Fundamentally, a spacy model consists of three components: the weights, i.e. the binary data loaded in from a dictionary, a pipeline of functions called in order (tagging, parsing, ner), and language data like the tokenization rules and annotation schemes


### Search for Built-in-Pipeline component online

### Disabling and Modifying pipeline components

If you dont need a particular component of the pipeline - for example, the tagger or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed. 

In [11]:
nlp = spacy.load("en_core_web_sm", disable = ["tagger", "parser"])

In [12]:
nlp

<spacy.lang.en.English at 0xc152495a58>

- Sometimes we want to load all the pipeline components and their weights, because you need them at different points in your application. However, if you only need a doc object with named entities, there's no need to run all pipeline components on it.

In [13]:
doc = nlp("Apple is buying a Startup")

In [14]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG


In [None]:
### Restoring the disabled functions
disabled = nlp.disable_pipes("ner")
doc = nlp("ner is disabled now")
disabled.restore() # This will restore the disabled function and we can use it in future