
This notebook provides a comprehensive guide to Natural Language Processing (NLP) using spaCy. Key highlights include:
1. Introduction to spaCy and its linguistic annotations.
2. Processing text to create Doc objects for linguistic analysis.
3. Tokenization to break down text into words, punctuation marks, etc.
4. Sentence Detection for identifying sentence boundaries.
5. Customizing Tokenizer to modify the default tokenization process.
6. Handling Stop Words to filter out common words from the analysis.
7. Lemmatization to reduce words to their base or dictionary form.
8. Part-of-Speech (POS) Tagging for grammatical analysis of words.
9. Named-Entity Recognition (NER) to identify and categorize entities like names, places, etc.
10. Rule-Based Matching for finding sequences of tokens based on patterns.
11. Visualization with displaCy for displaying entities and dependency parse in a visually appealing manner.

Each section below demonstrates these features with practical code examples and explanations, providing a solid foundation for NLP tasks with spaCy.


- The Doc Object for Processed Text

The load() function returns a Language callable object, which is commonly assigned to a variable called nlp.

To start processing your input, you construct a Doc object. A Doc object is a sequence of Token objects representing a lexical token. Each Token object has information about a particular piece—typically one word—of text.

In [2]:
# spaCy Tutorial Summary
import spacy

nlp = spacy.load("en_core_web_sm")
print(nlp)

<spacy.lang.en.English object at 0x000001B3AC904110>


In [3]:
#**Processing Text
doc = nlp(
    "This tutorial is about NLP in Spacy"
)

type = type(doc)
print(type)

#**Tokenization
ex = [token.text for token in doc]
print(ex)

#reading in data ex:
# import pathlib
# file_name = "introduction.txt"
# introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
# print([token.text for token in introduction_doc])



<class 'spacy.tokens.doc.Doc'>
['This', 'tutorial', 'is', 'about', 'NLP', 'in', 'Spacy']


--Advanced Features--

* Sentence Detection
* Customizing Tokenizer
* Handling Stop Words
* Lemmatization
* Part-of-Speech Tagging
* Named-Entity Recognition
* Rule-Based Matching
* Visualization with displaCy





Sentence Detection

In spaCy, the .sents property is used to extract sentences from the Doc object

In [4]:
#Sentence Detection
sentences = list(doc.sents)
print(sentences)

about_text = ("Gus Proto is a Python developer currently working for a London-based Fintech company. He is interested in learning Natural Language Processing.")

about_doc = nlp(about_text)

about_text_sentences = list(about_doc.sents)
print("sentence # =", len(about_text_sentences))


[This tutorial is about NLP in Spacy]
sentence # = 2


Tokens in spaCy

In [5]:
#Customizing Tokenizer
import spacy 

#check index of each token in doc
# for token in about_doc:
#     print(token, token.idx)

from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

def custom_tokenizer(nlp):
    infix_re = compile_infix_regex(nlp.Defaults.infixes + [r"@"])
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)


In [6]:
#Handling Stop Words

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
filtered_sentences = [token for token in doc if not token.is_stop]
print(filtered_sentences)


[tutorial, NLP, Spacy]


In [7]:
#Lemmatization

lemmas = [token.lemma_ for token in doc]
print(lemmas)

['this', 'tutorial', 'be', 'about', 'NLP', 'in', 'Spacy']


Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are typically eight parts of speech:

    Noun
    Pronoun
    Adjective
    Verb
    Adverb
    Preposition
    Conjunction
    Interjection

attributes:

    1)tag_ displays a fine-grained tag.
    2)pos_ displays a coarse-grained tag, which is a reduced version of the fine-grained tags.
    3)spacy.explain() to give descriptive details about a particular POS tag, which can be a valuable reference tool

    


In [10]:
#Part-of-Speech Tagging

pos_tags = [(token.text, token.pos_) for token in doc]
print(pos_tags)


#Visual POS: Use displaCy
from spacy import displacy

#will show a visualization of how sentence is built through POS workflow, 
# needs to be stopped because its servering it to a browser
#displacy.serve(doc, style="dep")


[('This', 'DET'), ('tutorial', 'NOUN'), ('is', 'AUX'), ('about', 'ADP'), ('NLP', 'PROPN'), ('in', 'ADP'), ('Spacy', 'PROPN')]



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [8]:
#Named-Entity Recognition

entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)



[('NLP', 'ORG')]


In [9]:
#Rule-Based Matching

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]



In [10]:
#Visualization with displaCy

from spacy import displacy

displacy.render(doc, style="ent")
