In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [17]:
spacy.__version__

'3.0.6'

# Tokenization

Tokenization is the task of splitting a text into meaningful segments called tokens. The input to the tokenizer is a unicode text and the output is a Doc object.

A Doc is a sequence of Token objects. Each Doc consists of individual tokens, and we can iterate over them.

In [3]:
doc = nlp("Apple is looking at buyig U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buyig
U.K.
startup
for
$
1
billion


In [4]:
doc = nlp("Apple isn't looking at buyig U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
n't
looking
at
buyig
U.K.
startup
for
$
1
billion


# Lemmatization

A work-related to tokenization, lemmatization is the method of decreasing the word to its base form, or origin form. This reduced form or root word is called a lemma.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma.

Lemmatization is necessary because it helps to reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

In [6]:
for token in doc:
    print(token.text, token.lemma_)

Apple Apple
is be
n't n't
looking look
at at
buyig buyig
U.K. U.K.
startup startup
for for
$ $
1 1
billion billion


# Part-of-speech tagging

Part of speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence.

In [7]:
for token in doc:
    print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')

Apple           Apple           PROPN      False
is              be              AUX        True
n't             n't             PART       True
looking         look            VERB       False
at              at              ADP        True
buyig           buyig           NOUN       False
U.K.            U.K.            PROPN      False
startup         startup         NOUN       False
for             for             ADP        True
$               $               SYM        False
1               1               NUM        False
billion         billion         NUM        False


# Dependency Parsing

Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. All other words are linked to the headword.

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head.To get the noun chunks in a document, simply iterate over Doc.noun_chunks.

In [8]:
for chunk in doc.noun_chunks:
    print(f'{chunk.text:{30}} {chunk.root.text:{15}} {chunk.root.dep_}')

Apple                          Apple           nsubj
buyig                          buyig           pobj
U.K.                           U.K.            nsubj


# Named Entity Recognition

Named Entity Recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

It is used to populate tags for a set of documents in order to improve the keyword search. Named entities are available as the ents property of a Doc.

In [9]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY


In [10]:
dir(doc)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_get_array_attrs',
 '_py_tokens',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set_ents',
 'se

# Sentence Segmentation

Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units.SpaCy uses the dependency parse to determine sentence boundaries. In spaCy, the sents property is used to extract sentences.

In [11]:
for sent in doc.sents:
    print(sent)

Apple isn't looking at buyig U.K. startup for $1 billion


In [12]:
doc1 = nlp("Welcome to ML tutorials. Thanks for watching. Please like and subscribe")
for sent in doc1.sents:
    print(sent)

Welcome to ML tutorials.
Thanks for watching.
Please like and subscribe


In [13]:
doc1 = nlp("Welcome to.*.ML tutorials.*.Thanks for watching")
for sent in doc1.sents:
    print(sent)

Welcome to.*.ML tutorials.*.Thanks for watching


From the above example our sentence segmentation process fail to detect the sentence boundries due to delimiters. In such cases we write our own customize rules to detect sentence boundry based on delimiters

In [27]:
text = 'Welcome to KGP Talkie...Thanks...Like and Subscribe!'
doc = nlp(text)
for sent in doc.sents:
    print(sent)

Welcome to KGP Talkie...
Thanks...Like and Subscribe!


In [36]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1b01d554c20>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1b01d1b49a0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1b01d3a3b80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1b01d55f3a0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1b01e839940>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1b01e84a1c0>),
 ('sentencizer', <spacy.pipeline.sentencizer.Sentencizer at 0x1b02044b780>)]

In [40]:
def set_rule(doc):
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i + 1].is_sent_start = True
    return doc