Code from https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e

In [1]:
import spacy
import textacy.extract

In [2]:
text = """London is the capital and most populous city of England and 
the United Kingdom.  Standing on the River Thames in the south east 
of the island of Great Britain, London has been a major settlement 
for two millennia. It was founded by the Romans, who named it Londinium.
"""

In [3]:
# tip: download the package "python -m spacy download en_core_web_sm"
nlp = spacy.load('en_core_web_sm')

In [4]:
# pipeline (example):
# * sentence segmentation
# * tokenization
# * parts of speech tagging
# * lemmatization
# * stop words
# * dependency parsing
# * named entity recognition
# * coreference resolution

doc = nlp(text) # runs the entire pipeline

for entity in doc.ents:
    print(f'"{entity.text}" ({entity.label_})')

"London" (GPE)
"England" (GPE)
"
" (GPE)
"the United Kingdom" (GPE)
" " (ORDINAL)
"the River Thames" (ORG)
"
" (GPE)
"Great Britain" (GPE)
"London" (GPE)
"
" (GPE)
"two" (CARDINAL)
"Romans" (NORP)
"Londinium" (PERSON)
"
" (GPE)


Say that we need to remove all the names in several documents...

In [5]:
def replace_name_with_placeholder(token):
    if token.ent_iob != 0 and token.ent_type_ == 'PERSON':
        return '[REDACTED]'

    return token.string


def scrub(text):
    doc = nlp(text)
    for ent in doc.ents:
        ent.merge()
        
    tokens = map(replace_name_with_placeholder, doc)
    return ''.join(tokens)

In [6]:
s = """
In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". In 1957, Noam Chomsky’s 
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.
"""

print(s)
print('---- ##### ----')
print(scrub(s))


In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence". In 1957, Noam Chomsky’s 
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.

---- ##### ----

In 1950, [REDACTED]published his famous article "Computing Machinery and Intelligence". In 1957, [REDACTED]
Syntactic Structures revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures.



What if we want some facts about a subject?

In [7]:
text = """London is the capital and most populous city of England and  the United Kingdom.  
Standing on the River Thames in the south east of the island of Great Britain, 
London has been a major settlement  for two millennia.  It was founded by the Romans, 
who named it Londinium.
"""

doc = nlp(text)

statements = textacy.extract.semistructured_statements(doc, 'London')

print('Here are the things I know about London')

for statement in statements:
    (subject, verb, fact) = statement
    print(f' - {fact}')

Here are the things I know about London
 - the capital and most populous city of England and  the United Kingdom.  

 - a major settlement  for two millennia.  


What if we wanted some kind of autocompletion, like Google...

In [14]:
text = """London is the capital and most populous city of England and  the United Kingdom.  
Standing on the River Thames in the south east of the island of Great Britain, 
London has been a major settlement  for two millennia.  It was founded by the Romans, 
who named it Londinium.
"""

doc = nlp(text)

# if using a bigger text, increase min_freq
noun_chunks = textacy.extract.noun_chunks(doc, min_freq=1)

noun_chunks = map(str, noun_chunks)
noun_chunks = map(str.lower, noun_chunks)

for noun_chunk in set(noun_chunks):
    if len(noun_chunk.split(' ')) > 1:
        print(noun_chunk)

major settlement
united kingdom
most populous city
two millennia
great britain
south east
river thames
