### Using pre-trained and readily available spaCy models for **Advanced Processing**
We can see tokenization, lemmatization, POS tagging, and several other steps in action

Simple processing is: lowercasing, removal of punctuation, stemming and lemmatization. 

In [10]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Charles Spencer Chaplin was born on 16 April 1889 toHannah Chaplin')
for token in doc:
    print(token.text, ': ', token.lemma_, token.pos_,
          token.shape_, token.is_alpha, token.is_stop, '\n')

Charles :  Charles PROPN Xxxxx True False 

Spencer :  Spencer PROPN Xxxxx True False 

Chaplin :  Chaplin PROPN Xxxxx True False 

was :  be AUX xxx True True 

born :  bear VERB xxxx True False 

on :  on ADP xx True True 

16 :  16 NUM dd False False 

April :  April PROPN Xxxxx True False 

1889 :  1889 NUM dddd False False 

toHannah :  toHannah PROPN xxXxxxx True False 

Chaplin :  Chaplin PROPN Xxxxx True False 



### Start with Simple Heuristics
A popular approach to incorporating heuristics in your system is using regular expressions. Let’s say we’re developing a system to extract different forms of information from text documents, such as dates and phone numbers, names of people who work in a given organization, etc. While some information, such as email IDs, dates, and telephone numbers can be extracted using normal (albeit complex) regular expressions.

spaCy's rule-based matching is useful for defining advanced regular expressions: https://spacy.io/usage/rule-based-matching

In [13]:
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])

doc = nlp("Hello, world! Hello, world! Hello world!") # the third hello world is missing the punctuation. 
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

15578876784678163569 HelloWorld 0 3 Hello, world
15578876784678163569 HelloWorld 4 7 Hello, world
