<a href="https://colab.research.google.com/github/dk-wei/nlp-algo-implementation/blob/main/NLP_Fundamentals_Toolkit_(spaCy_%2B_NLTK).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

参考资料1：[Python for NLP: Tokenization, Stemming, and Lemmatization with SpaCy Library](https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/)                  
参考资料2: [Getting started in Natural Language Processing with spaCy](https://blog.devgenius.io/getting-started-in-natural-language-processing-with-spacy-part-1-5026748cadc2)   
参考资料3: [https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/](https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/)

这篇文档主要介绍NLP基础处理部分最好的practice




# spaCy Intro

先来看看spaCy的灵魂- spaCy's Statistical Models

These models are the power engines of spaCy. These models enable spaCy to perform several NLP related tasks, such as part-of-speech tagging, named entity recognition, and dependency parsing.

I’ve listed below the different statistical models in spaCy along with their specifications:

- `en_core_web_sm`: English multi-task CNN trained on OntoNotes. Size – 11 MB
- `en_core_web_md`: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 91 MB
- `en_core_web_lg`: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 789 MB

Importing these models is super easy. We can import a model by just executing spacy.load(‘model_name’) as shown below:

In [180]:
import spacy
nlp = spacy.load('en_core_web_sm')

这些个统计模型包括了下面几个处理器：

![](https://miro.medium.com/max/1400/1*w4qkY84JfG5h2ChhR8SKnA.png))

In [181]:
nlp.pipeline, nlp.pipe_names

([('tagger', <spacy.pipeline.pipes.Tagger at 0x7f5604372610>),
  ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f560543d9f0>),
  ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f560543d830>)],
 ['tagger', 'parser', 'ner'])

我们也可以选择禁用一些component加快速度：

In [None]:
nlp.disable_pipes('tagger', 'parser')

In [182]:
nlp.pipe_names

['tagger', 'parser', 'ner']

下面看看具体的任务：

# [Stopwords](https://stackabuse.com/removing-stop-words-from-strings-in-python/#usingpythonsnltklibrary)

详情参见：[Removing Stop Words from Strings in Python](https://stackabuse.com/removing-stop-words-from-strings-in-python/#usingpythonsnltklibrary)

## NLTK版本

In [128]:
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [126]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [133]:
stopwords.words('english')[:10]   #不注明english的话，啥语言stopwords都会有

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [130]:
text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)

tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'however', 'fond', 'tennis', '.']


## Spacy版本

In [153]:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])   # disble 一些process可以加快处理速度

In [154]:
list(nlp.Defaults.stop_words)[:10]

['get',
 '‘s',
 'two',
 'else',
 'do',
 'nevertheless',
 'over',
 '‘re',
 'too',
 'former']

In [155]:
nlp.vocab['myself'].is_stop

True

### To add a stop word

In [156]:
len(nlp.Defaults.stop_words)

329

In [157]:
nlp.Defaults.stop_words.add('name_one')

In [158]:
len(nlp.Defaults.stop_words)

329

In [159]:
nlp.vocab['name_one'].is_stop

True

### To remove a stop word

In [160]:
len(nlp.Defaults.stop_words)

329

In [161]:
nlp.Defaults.stop_words.remove('name_one')

In [162]:
len(nlp.Defaults.stop_words)

328

In [163]:
nlp.vocab['name_one'].is_stop

False

# Tokenization

In [177]:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])   # disble 一些process可以加快处理速度
doc = nlp('did displaying words testing') # 按照spaCy的调性，全部先放入nlp，然后

In [87]:
# make sure your downloaded the english model with "python -m spacy download en"

import spacy
nlp = spacy.load('en')

doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't. Testing engineers are really in demand.")
print(doc.text)

Apples and oranges are similar. Boots and hippos aren't. Testing engineers are really in demand.


In [97]:
print(doc[2])

oranges


In [88]:
for token in doc:
  # 得放入doc这个容器，然后得到lemma
  #print(token, token.lemma, token.lemma_)
  print(token, '--->', token.lemma_, '--->', token.pos_,  '--->',token.dep_)

Apples ---> apple ---> NOUN ---> nsubj
and ---> and ---> CCONJ ---> cc
oranges ---> orange ---> NOUN ---> conj
are ---> be ---> AUX ---> ROOT
similar ---> similar ---> ADJ ---> acomp
. ---> . ---> PUNCT ---> punct
Boots ---> boot ---> NOUN ---> nsubj
and ---> and ---> CCONJ ---> cc
hippos ---> hippos ---> NOUN ---> conj
are ---> be ---> AUX ---> ROOT
n't ---> not ---> PART ---> neg
. ---> . ---> PUNCT ---> punct
Testing ---> test ---> VERB ---> compound
engineers ---> engineer ---> NOUN ---> nsubj
are ---> be ---> AUX ---> ROOT
really ---> really ---> ADV ---> advmod
in ---> in ---> ADP ---> prep
demand ---> demand ---> NOUN ---> pobj
. ---> . ---> PUNCT ---> punct


# Stemming

You may want to reduce the words to their **root form** for the sake of uniformity. 


For instance, `compute`, `computer`, `computing`, `computed`, etc. 

spaCy只有lemmatization，木有stemming，所以我们只能用NLTK

## Porter Stemmer


In [60]:
import nltk

from nltk.stem.porter import *

In [62]:
stemmer = PorterStemmer()

In [63]:
tokens = ['compute', 'computer', 'computed', 'computing']

In [64]:
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


## Snowball Stemmer

Snowball stemmer is a **slightly improved version** of the Porter stemmer and is usually preferred over the latter. It offers a slight improvement over the original Porter stemmer, both in logic and speed. Let's see snowball stemmer in action:

In [65]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')

tokens = ['compute', 'computer', 'computed', 'computing']

for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))


compute --> comput
computer --> comput
computed --> comput
computing --> comput


# Lemmatization

Lemmatization reduces the word to its stem as it appears in the dictionary. The stems returned through lemmatization are actual dictionary words and are semantically complete unlike the words returned by stemmer.

Lemmatization相比于Stemming的优点在于，返回的是一个dictionary word，而不只是残缺的root。

In [70]:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
doc = nlp('did displaying words testing')
print (" ".join([token.lemma_ for token in doc]))

do display word test


In [75]:
# make sure your downloaded the english model with "python -m spacy download en"

import spacy
nlp = spacy.load('en')

doc = nlp(u"Apples and oranges are similar. Boots and hippos children aren't. Testing engineers are really in demand.")
print(doc.text)

Apples and oranges are similar. Boots and hippos children aren't. Testing engineers are really in demand.


In [76]:
for token in doc:
  # 得放入doc这个容器，然后得到lemma
  #print(token, token.lemma, token.lemma_)
  print(token, '===>', token.lemma_, '===>', token.pos_)

Apples ===> apple ===> NOUN
and ===> and ===> CCONJ
oranges ===> orange ===> NOUN
are ===> be ===> AUX
similar ===> similar ===> ADJ
. ===> . ===> PUNCT
Boots ===> boot ===> NOUN
and ===> and ===> CCONJ
hippos ===> hippos ===> NOUN
children ===> child ===> NOUN
are ===> be ===> AUX
n't ===> not ===> PART
. ===> . ===> PUNCT
Testing ===> test ===> VERB
engineers ===> engineer ===> NOUN
are ===> be ===> AUX
really ===> really ===> ADV
in ===> in ===> ADP
demand ===> demand ===> NOUN
. ===> . ===> PUNCT


### Reader-friendly version

In [119]:
def show_lemmas(text):
  for token in text:
    print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')
    
doc2 = nlp(u"I saw eighteen mice today!")
show_lemmas(doc2)

I            PRON   561228191312463089     -PRON-
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !


# Part-of-Speech Tagging (POS)

In [89]:
# make sure your downloaded the english model with "python -m spacy download en"

import spacy
nlp = spacy.load('en')

doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't. Testing engineers are really in demand.")
print(doc.text)

Apples and oranges are similar. Boots and hippos aren't. Testing engineers are really in demand.


In [92]:
for token in doc:
  # 得放入doc这个容器，然后得到lemma
  #print(token, token.lemma, token.lemma_)
  print(token, '--->',token.lemma_,  '--->',token.pos_,)

Apples ---> apple ---> NOUN
and ---> and ---> CCONJ
oranges ---> orange ---> NOUN
are ---> be ---> AUX
similar ---> similar ---> ADJ
. ---> . ---> PUNCT
Boots ---> boot ---> NOUN
and ---> and ---> CCONJ
hippos ---> hippos ---> NOUN
are ---> be ---> AUX
n't ---> not ---> PART
. ---> . ---> PUNCT
Testing ---> test ---> VERB
engineers ---> engineer ---> NOUN
are ---> be ---> AUX
really ---> really ---> ADV
in ---> in ---> ADP
demand ---> demand ---> NOUN
. ---> . ---> PUNCT


# Dependency Parsing

In [114]:
import spacy
from spacy import displacy

In [115]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Autonomous cars shift insurance liability toward manufacturers Net income was $9.4 million compared to the prior year of $2.7 million.")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers Net income income pobj toward
the prior year year pobj to


In [116]:
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

Autonomous amod cars NOUN []
cars nsubj shift VERB [Autonomous]
shift ROOT shift VERB [cars, liability]
insurance compound liability NOUN []
liability dobj shift VERB [insurance, toward]
toward prep liability NOUN [income]
manufacturers compound income NOUN []
Net amod income NOUN []
income pobj toward ADP [manufacturers, Net]
was ROOT was AUX [million, compared, .]
$ quantmod million NUM []
9.4 compound million NUM []
million attr was AUX [$, 9.4]
compared prep was AUX [to]
to prep compared VERB [year]
the det year NOUN []
prior amod year NOUN []
year pobj to ADP [the, prior, of]
of prep year NOUN [million]
$ quantmod million NUM []
2.7 compound million NUM []
million pobj of ADP [$, 2.7]
. punct was AUX []


## Visualization

In [117]:
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [111]:
doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='dep', options={'distance': 110})

'<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" id="7d4ee861505043d9ae30a33547d6af03-0" class="displacy" width="1370" height="357.0" direction="ltr" style="max-width: none; height: 357.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr">\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="267.0">\n    <tspan class="displacy-word" fill="currentColor" x="50">Apple</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="50">PROPN</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="267.0">\n    <tspan class="displacy-word" fill="currentColor" x="160">is</tspan>\n    <tspan class="displacy-tag" dy="2em" fill="currentColor" x="160">AUX</tspan>\n</text>\n\n<text class="displacy-token" fill="currentColor" text-anchor="middle" y="267.0">\n    <tspan class="displacy-word" fill="currentColor" x="270">going</tspan>\n    <tspan class="displacy-t

# Additional Token Attributes

We’ll see these again in upcoming stories. For now we just want to illustrate some of the other information that spaCy assigns to tokens:
![](https://miro.medium.com/max/1400/1*uXwjvPXtANA1z_Ivw4JQeA.jpeg)

In [106]:
# make sure your downloaded the english model with "python -m spacy download en"

import spacy
nlp = spacy.load('en')

doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't. Testing engineers are really in demand.")
print(doc.text)

Apples and oranges are similar. Boots and hippos aren't. Testing engineers are really in demand.


In [95]:
for token in doc:
  # 得放入doc这个容器，然后得到lemma
  #print(token, token.lemma, token.lemma_)
  print(token, '--->', token.tag_, '--->',token.is_alpha,  '--->',token.is_stop,)

Apples ---> NNS ---> True ---> False
and ---> CC ---> True ---> True
oranges ---> NNS ---> True ---> False
are ---> VBP ---> True ---> True
similar ---> JJ ---> True ---> False
. ---> . ---> False ---> False
Boots ---> NNS ---> True ---> False
and ---> CC ---> True ---> True
hippos ---> NN ---> True ---> False
are ---> VBP ---> True ---> True
n't ---> RB ---> False ---> True
. ---> . ---> False ---> False
Testing ---> VBG ---> True ---> False
engineers ---> NNS ---> True ---> False
are ---> VBP ---> True ---> True
really ---> RB ---> True ---> True
in ---> IN ---> True ---> True
demand ---> NN ---> True ---> False
. ---> . ---> False ---> False


# Detecting Entities (NER)

The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. 

In [44]:
import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"], n_process=2):
    # Do something with the doc here
    print([(ent.text, ent.label_, str(spacy.explain(ent.label_))) for ent in doc.ents])

[('$9.4 million', 'MONEY', 'Monetary values, including unit'), ('the prior year', 'DATE', 'Absolute or relative dates or periods'), ('$2.7 million', 'MONEY', 'Monetary values, including unit')]
[('twelve billion dollars', 'MONEY', 'Monetary values, including unit'), ('1b', 'MONEY', 'Monetary values, including unit')]


## Visualization

In [112]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True)

# Detecting Nouns    
**专注于(附带形容词的)名词**

`Noun chunks` are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as **a noun plus the words describing the noun** – for example, in Sheb Wooley's 1958 song, a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.


In [59]:
import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load('en_core_web_sm')

for doc in nlp.pipe(texts, n_process=2):
    # Do something with the doc here
    print([(ent.text) for ent in doc.noun_chunks])

['Net income', 'the prior year']
['Revenue', 'twelve billion dollars', 'a loss']


 # Vocabulary and Matching

In [164]:
import spacy
nlp = spacy.load('en_core_web_sm')
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [165]:
pattern1 = [{'LOWER': 'solarpower'}] # looks for a single token whose lowercase text reads 'solarpower'
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}] # looks for two adjacent tokens that read 'solar' and 'power' in that order
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}] # looks for three adjacent tokens, with a middle token that can be any punctuation. That single spaces are not tokenized, so they don’t count as punctuation.
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

In [166]:
doc = nlp(u'The Solar Power industry continues to grow as demand for solarpower increases. Solar-power cars are gaining popularity.')

In [167]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


matcher returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span `doc[start:end]`

In [169]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


Setting pattern options and quantifiers

In [170]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

In [171]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


In [173]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN
# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

In [174]:
doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


In [175]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solarpowered'}]
pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]
# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)

found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


## Other token attributes

Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
![](https://miro.medium.com/max/1400/1*MYiYSqmsAMxFjviTo1Du2g.jpeg)


## PhraseMatcher

In [176]:
# Perform standard imports, reset nlp
import spacy
nlp = spacy.load('en_core_web_sm')

# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)