### Spacy:

1. Spacy is used in feature extraction from text data.
2. We can apply machine learning models on these features to do classification and sentiment analysis.

- The main advantage of Spacy is that we can make our own custom rule to extract features from text data based on our knowledge of English.
- Spacy takes the input as raw text and gives output in the form of a document

### Steps Involved In NLP:

1. Tokenization
2. Tagging
3. Parser
4. Named Entity Recognition
5. Custom Rule if any
6. Final Document is classified

### 1. Tokenization

- Tokenization is a task of splitting a text into meaningful segments, called tokens. The input to a tokenizer is a unicode text, and the output is a Doc object.

In [8]:
import spacy

In [9]:
nlp = spacy.load("en_core_web_sm")

In [12]:
doc = nlp("Apple is looking to buy a U.K. startup for $1 billion")

In [13]:
for token in doc:
    print(token.text)

Apple
is
looking
to
buy
a
U.K.
startup
for
$
1
billion


### 2. Lemmatization

- Lemmatization is used to compress the dictionary of words. It is used to bring all the words in their base form.

In [14]:
doc

Apple is looking to buy a U.K. startup for $1 billion

In [20]:
for token in doc:
    print(token.text, token.lemma_)

Apple Apple
is be
looking look
to to
buy buy
a a
U.K. U.K.
startup startup
for for
$ $
1 1
billion billion


In [26]:
# formatting the above text using fstring

for token in doc:
    print(f"{token.text:{20}} {token.lemma_:{10}} {token.pos_:{10}}")

Apple                Apple      PROPN     
is                   be         AUX       
looking              look       VERB      
to                   to         PART      
buy                  buy        VERB      
a                    a          DET       
U.K.                 U.K.       PROPN     
startup              startup    NOUN      
for                  for        ADP       
$                    $          SYM       
1                    1          NUM       
billion              billion    NUM       


- Meaning of POS tags is given in:
    
Spacy.io/api/annotations

### Stopwords

- Stopwords are the most commonly used words that do not convey much information about the text.

In [27]:
for token in doc:
    print(f"{token.text:{20}} {token.lemma_:{10}} {token.pos_:{10}} {token.is_stop}")

Apple                Apple      PROPN      False
is                   be         AUX        True
looking              look       VERB       False
to                   to         PART       True
buy                  buy        VERB       False
a                    a          DET        True
U.K.                 U.K.       PROPN      False
startup              startup    NOUN       False
for                  for        ADP        True
$                    $          SYM        False
1                    1          NUM        False
billion              billion    NUM        False


### Dependency Parsing

- Dependency Parsing means how one word/token is dependent on another token.
- Chunk means collection of words.

In [32]:
for chunk in doc.noun_chunks:
    print(f"{chunk.text:{30}} {chunk.root.text:{15}} {chunk.root.dep_:{15}}")

Apple                          Apple           nsubj          
a U.K. startup                 startup         dobj           


There are two noun chunks
1. Apple
2. a U.K. Startup

- Spacy is not 100% accurate. This is when a rule based classification works well for us.

### Named entity Recognition

- It is used to identify if a word is a Name, Organization, Time, Date, Location etc.

In [34]:
doc

Apple is looking to buy a U.K. startup for $1 billion

In [36]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


### Sentence Segmentation

- Sentence segmentation means, extracting sentences from the entire text document.

In [38]:
doc

Apple is looking to buy a U.K. startup for $1 billion

In [39]:
for sent in doc.sents:
    print(sent) # this has only one sentence

Apple is looking to buy a U.K. startup for $1 billion


In [40]:
doc1 = nlp("My name is Ashish. I am very happy. I love myself and life is good.")

In [41]:
for sent in doc1.sents:
    print(sent)

My name is Ashish.
I am very happy.
I love myself and life is good.


- By default, sent segmentation is done at fullstop, exclamation, question mark.
- But if we use an underscore _, the word before that is taken as new sentence.

In [42]:
doc2 = nlp ("Welcome to Ashish_datascience Thanks for watching")

In [44]:
for sent in doc2.sents:
    print(sent)

Welcome to Ashish_datascience Thanks for watching


In [46]:
doc3 = nlp ("Welcome to...Ashish datascience...Thanks for watching")

In [47]:
for sent in doc3.sents:
    print(sent)

Welcome to...
Ashish datascience...
Thanks for watching


- We can write our own custom rule to define where does a new sentence start at
- For that we have to define where will the sentence start by defining a new custom rule

In [52]:
def set_rule(doc):
    for token in doc[:-1]:
        if token.text == "...":
            doc[token.i + 1].is_sent_start = True
    return doc

In [54]:
#nlp.remove_pipe("set_rule")

('set_rule', <function __main__.set_rule(doc)>)

In [55]:
nlp.add_pipe(set_rule, before = "parser")

In [56]:
doc3 = nlp ("Welcome to...Ashish datascience...Thanks for watching")

In [57]:
for sent in doc3.sents:
    print(sent)

Welcome to...
Ashish datascience...
Thanks for watching


### Vizualization of dependency and Entity using Displacy

In [58]:
from spacy import displacy

In [59]:
doc

Apple is looking to buy a U.K. startup for $1 billion

In [60]:
displacy.render(doc, style = "dep")

In [63]:
# make the dependency chart compact

displacy.render(doc, style = "dep", options= {"compact" : True})

In [66]:
# make the dependency chart compact

displacy.render(doc, style = "dep", options= {"compact" : True, "distance":100})

In [67]:
# vizualizing named entity

displacy.render(doc, style = "ent")