In [1]:
import spacy
# loading en_core_web_sm and creating an nlp object
nlp = spacy.load('en_core_web_sm')


text = "NLP is becoming increasingly popular for providing business solutions."
# Creating a Doc container for the text object
doc = nlp(text)

# Creating a list containing the text of each token in the Doc container
print([token.text for token in doc])

['NLP', 'is', 'becoming', 'increasingly', 'popular', 'for', 'providing', 'business', 'solutions', '.']


### spaCy NLP pipeline

- Use `spacy.load()` to return nlp, a Language class
- The Language object is the text processing pipeline
- Apply `nlp()` on any text to get a `Doc` container

### Container objects in spaCy
- `Doc` - A container for accessing linguistic annotations of text
- `Span` - A slice from a `Doc` object
- `Token` - An individual token, i.e. a word, punctuation, whitespace, etc.

### Pipeline components

- Tokenizer
- Tagger
- Lemmatizer
- EntityRecognizer
- Language
- DependencyParser
- Sentencizer

### Sentence segmentation

- More complex than tokenization
- Is a part of DependencyParser component

In [2]:
text = "We are learning NLP. This course introduces spaCy."
doc = nlp(text)
for sent in doc.sents:
    print(sent.text)

We are learning NLP.
This course introduces spaCy.


### Lemmatization

- A lemma is a base form of a token

In [3]:
doc = nlp('We are seeing her after one year.')
print([(token.text, token.lemma_) for token in doc])

[('We', 'we'), ('are', 'be'), ('seeing', 'see'), ('her', 'she'), ('after', 'after'), ('one', 'one'), ('year', 'year'), ('.', '.')]


### Linguistic features in spaCy

#### POS tagging 
- Categorizing words grammatically, based on function and context within a sentence

- Named Entity Recognition

A named entity is a word or phrase that refers to a specific entity with a name
Named-entity recognition (NER) classifies named entities into pre-defined categories

### displaCy

- The displaCy entity visualizer highlights named entities and their labels

In [4]:
from spacy import displacy

text = 'Albert Einstein was genius.'
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.serve(doc, style='ent')




Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [5]:
displacy.render(doc, style='ent')

In [6]:
texts = ['What is the arrival time in San francisco for the 7:55 AM flight leaving Washington?',
 'Cheapest airfare from Tacoma to Orlando is 650 dollars.',
 'Round trip fares from Pittsburgh to Philadelphia are under 1000 dollars!']

In [7]:
### POS tagging

documents = [nlp(text) for text in texts]

for doc in documents:
    for token in doc:
        print("text: ", token.text, "| POS tag: ", token.pos_, "| POS explaination: ", spacy.explain(token.pos_))
        print('\n')

text:  What | POS tag:  PRON | POS explaination:  pronoun


text:  is | POS tag:  AUX | POS explaination:  auxiliary


text:  the | POS tag:  DET | POS explaination:  determiner


text:  arrival | POS tag:  NOUN | POS explaination:  noun


text:  time | POS tag:  NOUN | POS explaination:  noun


text:  in | POS tag:  ADP | POS explaination:  adposition


text:  San | POS tag:  PROPN | POS explaination:  proper noun


text:  francisco | POS tag:  PROPN | POS explaination:  proper noun


text:  for | POS tag:  ADP | POS explaination:  adposition


text:  the | POS tag:  DET | POS explaination:  determiner


text:  7:55 | POS tag:  NUM | POS explaination:  numeral


text:  AM | POS tag:  PROPN | POS explaination:  proper noun


text:  flight | POS tag:  NOUN | POS explaination:  noun


text:  leaving | POS tag:  VERB | POS explaination:  verb


text:  Washington | POS tag:  PROPN | POS explaination:  proper noun


text:  ? | POS tag:  PUNCT | POS explaination:  punctuation


text:  Cheape

In [8]:
## Using tokens

documents = [nlp(text) for text in texts]

for doc in documents:
    print([(ent.text, ent.label_) for ent in doc.ents])
    
print("\ntext:", documents[1][5].text, "| Entity type: ", documents[1][5].ent_type_)

[('San francisco', 'GPE'), ('7:55 AM', 'TIME'), ('Washington', 'GPE')]
[('Tacoma', 'GPE'), ('Orlando', 'GPE'), ('650 dollars', 'MONEY')]
[('Pittsburgh', 'GPE'), ('Philadelphia', 'GPE')]

text: Orlando | Entity type:  GPE


In [9]:
# Create a list to store sentences of each Doc container in documents
sentences = [[sent for sent in doc.sents] for doc in documents]

# Create a list to track number of sentences per Doc container in documents
num_sentences = [len([sent for sent in doc.sents]) for doc in documents]
print("Number of sentences in documents:\n", num_sentences, "\n")

# Record entities text and corresponding label of the third Doc container
third_text_entities = [(ent.text, ent.label_) for ent in documents[2].ents]

print("Third text entities:\n", third_text_entities, "\n")

Number of sentences in documents:
 [1, 1, 1] 

Third text entities:
 [('Pittsburgh', 'GPE'), ('Philadelphia', 'GPE')] 



### Linguistic Features