### Installation

In [1]:
conda install -c conda-forge spacy


Note: you may need to restart the kernel to use updated packages.


In [2]:
!python -m spacy download en_core_web_sm

[!] Skipping model package dependencies and setting `--no-deps`. You don't seem
to have the spaCy package itself installed (maybe because you've built from
source?), so installing the model dependencies would cause spaCy to be
downloaded, which probably isn't what you want. If the model package has other
dependencies, you'll have to install them manually.
[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')


You should consider upgrading via the 'D:\Anaconda\python.exe -m pip install --upgrade pip' command.


### Tokenization

In [6]:
import spacy

In [7]:
 nlp = spacy.load("en_core_web_sm")

In [14]:
doc = nlp("Apple isn't looking to buy U.K. startup for $1 billion")

In [15]:
for token in doc:
    print(token.text)

Apple
is
n't
looking
to
buy
U.K.
startup
for
$
1
billion


### Tagger

In [16]:
#part of speech tagging

In [17]:
doc

Apple isn't looking to buy U.K. startup for $1 billion

In [18]:
for token in doc:
    print(token.text, token.lemma)

Apple 6418411030699964375
is 10382539506755952630
n't 447765159362469301
looking 16096726548953279178
to 3791531372978436496
buy 9457496526477982497
U.K. 14409890634315022856
startup 7622488711881293715
for 16037325823156266367
$ 11283501755624150392
1 5533571732986600803
billion 1231493654637052630


In [19]:
#need to add _ to be percievable by the user
for token in doc:
    print(token.text, token.lemma_)

Apple Apple
is be
n't not
looking look
to to
buy buy
U.K. U.K.
startup startup
for for
$ $
1 1
billion billion


In [27]:
for token in doc:
    print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')

Apple           Apple           PROPN      False
is              be              AUX        True
n't             not             PART       True
looking         look            VERB       False
to              to              PART       True
buy             buy             VERB       False
U.K.            U.K.            PROPN      False
startup         startup         NOUN       False
for             for             ADP        True
$               $               SYM        False
1               1               NUM        False
billion         billion         NUM        False


### Dependency Parsing

In [29]:
#printing root text and their dependencies
for chunk in doc.noun_chunks:
    print(f'{chunk.text:{25}} {chunk.root.text:{15}} {chunk.root.dep_}')

Apple                     Apple           nsubj
U.K. startup              startup         dobj


### Named Entity Recognition

In [30]:
doc

Apple isn't looking to buy U.K. startup for $1 billion

In [31]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


### Sentence Segmentation

In [32]:
#doc.sents
doc

Apple isn't looking to buy U.K. startup for $1 billion

In [33]:
for sent in doc.sents:
    print(sent)

Apple isn't looking to buy U.K. startup for $1 billion


In [38]:
doc1 = nlp("This is Anmol Pant...Existence is pain. Will any of this ever end?")

In [39]:
for sent in doc1.sents:
    print(sent)

This is Anmol Pant...
Existence is pain.
Will any of this ever end?


In [42]:
doc2 = nlp("This is Anmol.*.Pant...Existence is pain.*.Will any of this ever end?")

In [43]:
for sent in doc2.sents:
    print(sent)

This is Anmol.*.Pant...
Existence is pain.*.Will any of this ever end?


In [49]:
#custom rule for handling the above case

In [50]:
def set_rule(doc):
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i+1].is_sent_start = True
    return doc

In [51]:
nlp.remove_pipe('set_rule')

('set_rule', <function __main__.set_rule(doc)>)

In [52]:
#adding rule to pipeline
nlp.add_pipe(set_rule, before = 'parser')

In [53]:
text = "Welcome... Here is another text... for testing purposes!"
doc4 = nlp(text)

In [54]:
for sent in doc4.sents:
    print(sent)

Welcome...
Here is another text...
for testing purposes!


In [56]:
for token in doc4:
    print(token.text)

Welcome
...
Here
is
another
text
...
for
testing
purposes
!
