Tokenization  -  Segementing text into words and etc 

Part of speech(POS) tagging  - assigning type of tokens like verbs nouns etc

Dependency Parsing - assigning syntactic dependency labels, describing relations to each token

Lemmatization  - assigning base form of the word i.e was = is rats = rat

Sentence boundary detection - Finding individual sentences

Name entity recognition - NER - labelling real world objects like samsung

Entity linking  (EL) - disambiguating textual entities to unique identifiers in knowledge base

Text Classification - assigning categories to labels to whole document or part of it

Training - updating and improving statisticals model's prediction

Serialization - saving objects to files or byte strings

In [2]:
! pip install -U spacy

Requirement already up-to-date: spacy in c:\users\musingila\anaconda3\lib\site-packages (2.2.4)


In [3]:
! pip install -U spacy-lookups-data

Collecting spacy-lookups-data
  Downloading spacy_lookups_data-0.2.0.tar.gz (29.2 MB)
Building wheels for collected packages: spacy-lookups-data
  Building wheel for spacy-lookups-data (setup.py): started
  Building wheel for spacy-lookups-data (setup.py): finished with status 'done'
  Created wheel for spacy-lookups-data: filename=spacy_lookups_data-0.2.0-py2.py3-none-any.whl size=29164787 sha256=de4fbdbcbe40e1956d2d14f6288d628bee1de67b3720f8e958872dbc2f458573
  Stored in directory: c:\users\musingila\appdata\local\pip\cache\wheels\f6\da\3e\eb4d09aaca732f374c9ec075d8f5e13c6b33a7545403cc3b9d
Successfully built spacy-lookups-data
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-0.2.0


In [4]:
! python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py): started
  Building wheel for en-core-web-md (setup.py): finished with status 'done'
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-py3-none-any.whl size=98051309 sha256=a148a8b7728d7aca97163100fdc4ca71b963db89238e7512fb836b99f63e9bd1
  Stored in directory: C:\Users\MUSING~1\AppData\Local\Temp\pip-ephem-wheel-cache-l1fs6x7d\wheels\69\c5\b8\4f1c029d89238734311b3269762ab2ee325a42da2ce8edb997
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[+] Download and installation successful
You can now load the model via spacy.load('en_core_web_md')


Why spaCy?
Processing raw text intelligently is difficult most words are rarely used and it's common for words that look completely different to mean almost the same thing 

spaCy is designed to takes in raw text, and gives back a Doc object, that comes with a variety of annotations

### Tokenization

In [5]:
# splitting text into meaningful segments called tokens, inputs = unicode text and output = Doc object
import spacy


In [6]:
nlp = spacy.load('en_core_web_md')

In [24]:
doc = nlp("Apple isn't looking at buying a U.K. startup for 1$ billion")

In [25]:
for token in doc:
    print(token.text)

Apple
is
n't
looking
at
buying
a
U.K.
startup
for
1
$
billion


### Part-Of-speech[POS] Tagging

In [26]:
doc

Apple isn't looking at buying a U.K. startup for 1$ billion

In [27]:
for token in doc:
    print(token.text, token.lemma_)

Apple Apple
is be
n't not
looking look
at at
buying buy
a a
U.K. U.K.
startup startup
for for
1 1
$ $
billion billion


In [28]:
for token in doc:
    print(f'{token.text:{15}}  {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')

Apple            Apple           PROPN      False
is               be              AUX        True
n't              not             PART       True
looking          look            VERB       False
at               at              ADP        True
buying           buy             VERB       False
a                a               DET        True
U.K.             U.K.            PROPN      False
startup          startup         NOUN       False
for              for             ADP        True
1                1               NUM        False
$                $               SYM        False
billion          billion         NUM        False


### Dependency Parsing

In [34]:
for chunk in doc.noun_chunks:
    print(f'{chunk.text:{15}} {chunk.root.text:{15}} {chunk.root.dep_:{15}}')

Apple           Apple           nsubj          
a U.K. startup  startup         dobj           


### Named Entity Recognition

In [38]:
for ent in doc.ents:
    print(f'{ent.text:{10}} {ent.label_}')

Apple      ORG
U.K.       GPE
1$ billion MONEY


### Sentence Segmentation

In [None]:
# doc.sents

In [39]:
doc 

Apple isn't looking at buying a U.K. startup for 1$ billion

In [41]:
for sent in doc.sents:
    print(sent)

Apple isn't looking at buying a U.K. startup for 1$ billion


In [42]:
doc1 = nlp('Welcome to my trap house. Thank you for coming please get high')

In [43]:
for sent in doc1.sents:
    print(sent)

Welcome to my trap house.
Thank you for coming please get high


In [52]:
doc1 = nlp('Welcome to.*.my trap house... Thank you for coming please get high')

In [54]:
for sent in doc1.sents:
    print(sent)

Welcome to.*.my trap house...
Thank you for coming please get high


In [73]:
# creating custom rules
def set_rule(doc):
    for token in doc[:-1]:
        if token.text =='...':
            doc[token.i +1].is_sent_start = True
    return doc

In [74]:
nlp.remove_pipe('set_rule')

('set_rule', <function __main__.set_rule(doc)>)

In [75]:
# add it to a pipeline
nlp.add_pipe(set_rule, before = 'parser')

In [78]:
doc1 = nlp('Welcome to...my trap house... Thank you for coming please get high')

In [79]:
for sent in doc1.sents:
    print(sent)

Welcome to...
my trap house...
Thank you for coming please get high


In [80]:
for token in doc1:
    print(token.text)

Welcome
to
...
my
trap
house
...
Thank
you
for
coming
please
get
high


### visualization

In [81]:
from spacy import displacy

In [82]:
doc

Apple isn't looking at buying a U.K. startup for 1$ billion

In [83]:
displacy.render(doc, style = 'dep')

In [84]:
displacy.render(doc, style = 'dep', options ={'compact':True, 'distance': 100} )

In [85]:
displacy.render(doc, style = 'ent')