# spaCy

spaCy is a free and open-source library for NLP in Python, which is designed to simplify building systems for information extraction. spaCy provides production-ready code widely used for NLP use cases. It supports 64+ languages. It is robust, fast and has built-in visualizers for various NLP functionalities.

In [1]:
# import required libraries
import spacy

#### en_core_web_sm

<b>en_core_web_sm</b> is a pre-trained small English language model provided by spaCy. It is designed to support tasks like part-of-speech tagging, dependency parsing, named entity recognition (NER), and tokenization in English text.

It can be installed with following command

*python -m spacy download en_core_web_sm*

In [2]:
# load en_core_web_sm for further processing
# Let us run spaCy nlp pipeline for text processing which returns a language object
nlp = spacy.load("en_core_web_sm")

In [3]:
type(nlp)

spacy.lang.en.English

In [4]:
# let us process a sample string with spaCy

text = 'We are learning spaCy for Natural Language Processing'
doc = nlp(text)

In [5]:
type(doc)

spacy.tokens.doc.Doc

#### As we see above, nlp object converts the text ino a Doc object(container) which contains tokens, linguistic annotations, relationships, etc of the processes text 

In [6]:
doc.text

'We are learning spaCy for Natural Language Processing'

#### Tokenization

In [7]:
# Let us look at the tokens created
print([token.text for token in doc])

['We', 'are', 'learning', 'spaCy', 'for', 'Natural', 'Language', 'Processing']


#### Sentence Segmentation

In [8]:
doc =  nlp('We are learning NLP. We are getting introduced to spaCy.')

print([sent.text for sent in doc.sents])

['We are learning NLP.', 'We are getting introduced to spaCy.']


In [9]:
# print sentence and length of each sentence

doc =  nlp('We are learning NLP. We are getting introduced to spaCy.')

for sent in doc.sents:
    print('Sentence is "{0}" with length {1}'.format(sent.text,len(sent)))

Sentence is "We are learning NLP." with length 5
Sentence is "We are getting introduced to spaCy." with length 7


#### Lemmatization

In [10]:
doc = nlp('We are learning NLP')

for token in doc:
    print('Token: ',token.text,' Lemma: ', token.lemma_)

Token:  We  Lemma:  we
Token:  are  Lemma:  be
Token:  learning  Lemma:  learn
Token:  NLP  Lemma:  NLP


#### POS(part of speech) tagging with spaCy

In [11]:
verb_sent = 'I watch TV.'

for token in nlp(verb_sent):
    print('Token: {0}, POS: {1}, POS Explaination: {2}'.format(token.text, token.pos_, spacy.explain(token.pos_)))

Token: I, POS: PRON, POS Explaination: pronoun
Token: watch, POS: VERB, POS Explaination: verb
Token: TV, POS: NOUN, POS Explaination: noun
Token: ., POS: PUNCT, POS Explaination: punctuation


In [12]:
# let us compare above processing with a sentence which has watch as noun
noun_sent = 'I left without my watch.'

for token in nlp(noun_sent):
    print('Token: {0}, POS: {1}, POS Explaination: {2}'.format(token.text, token.pos_, spacy.explain(token.pos_)))

Token: I, POS: PRON, POS Explaination: pronoun
Token: left, POS: VERB, POS Explaination: verb
Token: without, POS: ADP, POS Explaination: adposition
Token: my, POS: PRON, POS Explaination: pronoun
Token: watch, POS: NOUN, POS Explaination: noun
Token: ., POS: PUNCT, POS Explaination: punctuation


#### Named Entity Recognition(NER)

In [13]:
doc = nlp('Albert Einstein was genius')

for ent in doc.ents:
    print('Text {0} is {1} with start position {2} and end position {3}'.format(ent.text, ent.label_, ent.start_char, ent.end_char))

Text Albert Einstein is PERSON with start position 0 and end position 15


In [14]:
doc = nlp('Taj Mahal is in Agra')

for ent in doc.ents:
    print('Text {0} is {1} with start position {2} and end position {3}'.format(ent.text, ent.label_, ent.start_char, ent.end_char))

Text Taj Mahal is PERSON with start position 0 and end position 9
Text Agra is GPE with start position 16 and end position 20


So sometimes spacy NER fails to detect correctly as we saw

In [15]:
doc = nlp('Taj Mahal is in agra')

for ent in doc.ents:
    print('Text {0} is {1} with start position {2} and end position {3}'.format(ent.text, ent.label_, ent.start_char, ent.end_char))

Text Taj Mahal is PERSON with start position 0 and end position 9


Also, if written in small letters it does not detect correctly

Alternative way of accessing entity types is from the token of doc object

In [16]:
doc = nlp('Albert Einstein was genius')

for token in doc:
    print('Token {0} is of type {1}'.format(token.text, token.ent_type_))

Token Albert is of type PERSON
Token Einstein is of type PERSON
Token was is of type 
Token genius is of type 


For tokens which are not recognized as entity, will give empty value for token.ent_type_

### displaCy

spaCy is equipped with a modern visualizer <b>displaCy</b>. The displaCy entity visualizer highlights named entities and their labels

In [None]:
from spacy import displacy

doc = nlp('Albert Einstein was genius')

displacy.serve(doc, style='ent', port=5001)




Using the 'ent' visualizer
Serving on http://0.0.0.0:5001 ...

