# spaCy
spaCy is an advanced modern library for Natural Language Processing. This document will guide to learn how to use spaCy for various tasks. It's open source and use for industrial grade.
1. spaCy comes with pretrained NLP models that can perform most common NLP tasks, such as tokenization, parts of speech(POS)tagging, named entity recognition(ner), lemmatization, transforming word vectors etc.

In [1]:
#pip install -U spacy
import spacy

if you're dealing with a particular language, you can load the spacy model specific to the language using spacy.load() function.

In [2]:
#Load small english model: https://spacy.io/models
nlp = spacy.load("en_core_web_sm")
nlp

<spacy.lang.en.English at 0x1e84164ca30>

This returns a Language object that comes ready with multiple built-in-capabilities.

### The Doc object
Let us say you have your text data in a string. what can be done to understand the structure of the text? First, call the load nlp object on the text. It should return a processes Doc object.

In [3]:
#Parse text through the 'nlp' model
my_text = ''' The economic situation of the country is on edge, as the stock market crashed causing loss of millions. Citizens who had their main investment in the share-market are facing a great loss. Many companies might lay off thousands of people to reducelabor cost'''
my_doc = nlp(my_text)
type(my_doc)

spacy.tokens.doc.Doc

what exactly is a Doc object?

It is a sequence of tokens that contains not just the original text but all the results produced by the spaCy model after processing the text. Useful information such as the lemma of the text, whether it is a stop word or not, named entities, the word vector of the text and so on are pre-computed and readily stored in the Doc object.
The good thing is that you have complete control on what information needs to be pre-computed and customized. We will see all of that shortly.

And, through the text gets split into tokens, no information of the original text is actually lost.

### Tokenization with spaCy

In [4]:
# Printing the tokens of a doc
for token in my_doc:
    print(token.text)

 
The
economic
situation
of
the
country
is
on
edge
,
as
the
stock
market
crashed
causing
loss
of
millions
.
Citizens
who
had
their
main
investment
in
the
share
-
market
are
facing
a
great
loss
.
Many
companies
might
lay
off
thousands
of
people
to
reducelabor
cost


##### What is Tokenization

Tokenization is the process of converting a text into smaller sub-text, based on certain predefined rules. For example, sentences are tokenized to words(and punctuation optionally). And paragraphs into sentences, depending on the context.

This is typically the first step for NLP tasks like text classification, sentiment analysis, etc. Each token in spacy has different attributes that tell us a great deal of information.

The above tokens contains punctuation and common words like "a", "the", "was" etc. These do not add any value to the meaning of your text. They are called stop words.


### Text-Preprocessing with spaCy

There is 'noise' in the tokens. you have punctuation like commas, brackets, full stop and some extra while spaces too. The process of removing noise from the noise from the doc is called $Text-Cleaning$ or $Preprocessing$

#### What is the need for text preprocessing?

The outcome of the NLP task you perform be it classification, finding sentiments, topic modelling etc the quality of the output depends heavily on the quality of the input text used

Stop words and punctuation usually (not always) don't add value to the meaning of the text and can potentially impact the outcome. To avoid this, its might make sense to remove them and clean the text of unwanted characters can reduce the size of the corpus.

##### How to identify and remove the stopwords and punctuation?

The tokens in spaCy have attributes which will help you identify if it is a stop word or not.

In [5]:
#Printing tokens and boolean values stored in different attributes.
for token in my_doc:
    print(token.text, "--", token.is_stop, "---", token.is_punct)

  -- False --- False
The -- True --- False
economic -- False --- False
situation -- False --- False
of -- True --- False
the -- True --- False
country -- False --- False
is -- True --- False
on -- True --- False
edge -- False --- False
, -- False --- True
as -- True --- False
the -- True --- False
stock -- False --- False
market -- False --- False
crashed -- False --- False
causing -- False --- False
loss -- False --- False
of -- True --- False
millions -- False --- False
. -- False --- True
Citizens -- False --- False
who -- True --- False
had -- True --- False
their -- True --- False
main -- False --- False
investment -- False --- False
in -- True --- False
the -- True --- False
share -- False --- False
- -- False --- True
market -- False --- False
are -- True --- False
facing -- False --- False
a -- True --- False
great -- False --- False
loss -- False --- False
. -- False --- True
Many -- True --- False
companies -- False --- False
might -- True --- False
lay -- False --- False
off

In [7]:
# Removing Stopwords and punctuations

my_doc_cleaned = [token for token in my_doc if not token.is_stop and not token.is_punct]

for token in my_doc_cleaned:
    print(token.text)

 
economic
situation
country
edge
stock
market
crashed
causing
loss
millions
Citizens
main
investment
share
market
facing
great
loss
companies
lay
thousands
people
reducelabor
cost


The computational costs decreased costs decreased by a great amount due to reduce in the number of tokens. In order to grasp the effect the effect of preprocessing on large text data, you can exclude the below code.

## spaCy pipelines

you have used tokens and docs in many ways till now. In this section, let's dive deeper understand the basic pipeline behind this.

When you call the nlp object on spaCy, the text is segmented into tokensto create a Doc object. Following this, various process are carried out on the Doc to add the attributes like POS tags, Lemma tags, dependency tags etc

### What are pipeline components?

The processing pipeline consists of components. where each component perform it's task and passes the Processed Doc to the next component. These are called as Pipeline Components

spaCy provides certain in-built pipeline components. Let's look at them

The built-in pipeline components of spacy are:

$Tokenizer$ : It is responsible for segmenting the text into tokens are turning a Doc object. This the first and compulsory step in a pipeline.

$Tagger$ : It is responsible for assigning Part-of-speech tags. It takes a Doc as input and creates $Doc[i].tag$

$Dependency$ : It is known as parser. It is responsible for assigning the dependency tags to each token. It takes a Doc as input and returns  the processed Doc

$Entity Recognizer$ : This component is reffered as $ner$. It is responsible for identifying named entities and assigning labels to them.

$Text Categorizer$ : This component is called $textcat$. It will assign categories to Docs.

$Entity Ruler$ : This component is called *entity_ruler*.It is responsible for assigningnamed entitle based on pattern rules. Revisit Rule Based Matching to know more.

$Sentencizer$ : This component is called **sentencizer** and can perform rule based sentence segmentation.

**merge_noun_chunks** : It is called **mergenounchunk**. This component is responsible for merging all noun chunks into a single token. It has to be add in the pipeline after **tagger** and **parser**

**merge_entities** :  It is called **merge_entities**. This component can merge all entities into a single token. It has to added after the **ner**

**merge_subtokens** : It is called **merge_subtokens**. This components can merge the subtokens into a single token. 

These are the various in-built pipeline components. It is not necessary for every spaCy model to have each of the above components.