### 句子切分

In [2]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint

In [3]:
alice = gutenberg.raw(fileids = 'carroll-alice.txt')
sample_text = 'We will discuss briefly about the basic syntax,structure and design philosophies.There is a defined hierarchical syntax for Python code which you should remember when writing code !Python is a really powerful programming language!'

In [4]:
print (len(alice))

144395


In [5]:
print(alice[0:100])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


nltk.sent_tokenize 函数是nltk推荐的默认的句子切分函数。它内部使用了一个PunktSentenceTokenizer类的实例，然而，它不仅仅是一个普通的对象或实例——它已经在几种语言模型上完成了预训练(牛批)，并且在除英语外的许多主流语言上取得了良好的运行效果

以下代码展示了该函数在示例文本中的基本用法：

In [6]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text = alice)
sample_sentences = default_st(text = sample_text)

In [7]:
print ('Total sentences in sample_text:',len(sample_sentences))

Total sentences in sample_text: 1


In [8]:
print('Sample text sentences:-')
pprint(sample_sentences)

Sample text sentences:-
['We will discuss briefly about the basic syntax,structure and design '
 'philosophies.There is a defined hierarchical syntax for Python code which '
 'you should remember when writing code !Python is a really powerful '
 'programming language!']


In [9]:
print('\nTotal sentences in alice:',len(alice_sentences))


Total sentences in alice: 1625


In [10]:
print('First 5 sentences in alice"-')
pprint(alice_sentences[0:5])

First 5 sentences in alice"-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 'Down the Rabbit-Hole\n'
 '\n'
 'Alice was beginning to get very tired of sitting by her sister on the\n'
 'bank, and of having nothing to do: once or twice she had peeped into the\n'
 'book her sister was reading, but it had no pictures or conversations in\n'
 "it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
 "conversation?'",
 'So she was considering in her own mind (as well as she could, for the\n'
 'hot day made her feel very sleepy and stupid), whether the pleasure\n'
 'of making a daisy-chain would be worth the trouble of getting up and\n'
 'picking the daisies, when suddenly a White Rabbit with pink eyes ran\n'
 'close by her.',
 'There was nothing so VERY remarkable in that; nor did Alice think it so\n'
 "VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!",
 'Oh dear!']


### 词语切分

主流接口：
word_tokenize

TreebankWordTokenizer

RegexpTokenizer

从RegexpTokenizer继承的切分器

我们将使用例句“The brown fox wasn't that quick and he couldn't win the race "作为各种切分器的输入，nltk.word_tokenize 函数是nltk默认并且推荐的词语切分器。该切分器实际上是TreebankWordTokenizer类的一个实例或者对象，并且是该核心类的一个封装

In [11]:
sentence = "The brown fox wasn't that quick and he couldn't win the race"
default_wt = nltk.word_tokenize
words = default_wt(sentence)
print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


使用TreebankWordTokenizer

In [12]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


使用正则表达式和RegeTokenizer类

In [13]:
TOKEN_PATTERN = r'\w+'

In [14]:
regex_wt = nltk.RegexpTokenizer(pattern = TOKEN_PATTERN,gaps = False)
words = regex_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [15]:
GAP_PATTERN = r'\s+'
regex_wt = nltk.RegexpTokenizer(pattern = GAP_PATTERN,gaps = True)
print(words)

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [16]:
word_indices = list(regex_wt.span_tokenize(sentence))
print (word_indices)
print([sentence[start:end]for start ,end in word_indices])

[(0, 3), (4, 9), (10, 13), (14, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 47), (48, 51), (52, 55), (56, 60)]
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


## 文本规范化

文本规范化定义为这样的一个过程，它包含一系列步骤，一次是转换，清洗以及将文本数据标准化成可供NLP、分析系统和应用程序使用的格式。通常，文本切分本身也是文本规范化的一部分。除了文本切分以外，还有各种其他技术，包括文本清洗、大小写转换、词语校正、停用词删除、词干提取和词形还原。文本规范化也常常成为文本清洗或转换

In [17]:
import nltk
import re
import string
from pprint import pprint
corpus = ["The brown fox wasn't that quick and he couldn't win the race ",
         "Hey that's a great deal ! I just bought a phone for 199",
         "@@ You'll (learn) a **lot** in the book .Python is an amazing language!@@"]

### 文本清洗

可以使用nltk的clean_html函数清洗来自HTML的不必要的标记，甚至是BeautifulSoup库来解析HTML数据，你还可以使用自定义的逻辑，包括正则表达式，xpath和lxml来解析xml数据。从JSON获取数据较为容易，因为它具有明确的键值注释。

### 文本切分

这里介绍一个通用的切分函数

In [18]:
def tokenize_text(text):
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence)for sentence in sentences]
    return word_tokens

这个函数的功能是接受文本数据，再从中提取句子，最后将每个句子划分成标识，这些标识可以是单词、特殊字符或标点符号。以下代码说明了该函数的功能：

In [19]:
token_list = [tokenize_text(text)for text in corpus]
pprint(token_list)

[[['The',
   'brown',
   'fox',
   'was',
   "n't",
   'that',
   'quick',
   'and',
   'he',
   'could',
   "n't",
   'win',
   'the',
   'race']],
 [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  ['I', 'just', 'bought', 'a', 'phone', 'for', '199']],
 [['@',
   '@',
   'You',
   "'ll",
   '(',
   'learn',
   ')',
   'a',
   '**lot**',
   'in',
   'the',
   'book',
   '.Python',
   'is',
   'an',
   'amazing',
   'language',
   '!'],
  ['@', '@']]]


### 删除特殊字符

标点和特殊字符往往没什么意义，我们将在切分前后删除这两类特殊字符，以下显示了在切分之后删除特殊字符

In [20]:
def remove_characters_after_tokenization(tokens):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None,[pattern.sub('',token)for token in tokens])
    return filtered_tokens

In [21]:
filtered_list_1 = [filter(None,[remove_characters_after_tokenization(tokens)for tokens in sentence_tokens])for sentence_tokens in token_list]
print (filtered_list_1)

[<filter object at 0x000001909F0F5C18>, <filter object at 0x000001909F0F5C50>, <filter object at 0x000001909F0E5208>]


在文本切分之前删除特殊字符（推荐）：

In [22]:
def remove_characters_before_tokenization(sentence,keep_apostrophes = False):
    sentence = sentence.strip()
    if keep_apostrophes:
        PATTERN = r'[?|$|&|*|%|@|(|)|~]'#add other characters here to remove them
        filtered_sentence = re.sub(PATTERN,r'',sentence)
    else:
        PATTERN = r'[^a-zA-Z0-9]'#only extract alpha-numeric characters
        filtered_sentence = re.sub(PATTERN,r'',sentence)
    return filtered_sentence

In [23]:
filtered_list_2 = [remove_characters_before_tokenization(sentence)for sentence in corpus]
print (filtered_list_2) 

['Thebrownfoxwasntthatquickandhecouldntwintherace', 'HeythatsagreatdealIjustboughtaphonefor199', 'YoulllearnalotinthebookPythonisanamazinglanguage']


In [24]:
cleaned_corpus = [remove_characters_before_tokenization(sentence,keep_apostrophes = True)for sentence in corpus]
print(cleaned_corpus)

["The brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal ! I just bought a phone for 199", " You'll learn a lot in the book .Python is an amazing language!"]


### 扩展缩写词

缩写词是词或者音节的缩短形式，他们既在书面形式中存在，也在口语中存在，比如“is not”缩写为“isn't”，缩写词中撇号用来表示缩写，而一些元音和其他字母则被删除了，通常，在正式书写时会避免使用缩写词，但在非正式情况下，他们被广泛使用。

缩写词为NLP和文本分析制造了一个难题，首先因为在该单词中有一个特殊的撇号字符，此外，我们有两个甚至更多的单词由缩写词表示。

可以使用映射关系来扩展缩写词，我们创建了一个缩写词及其扩展形式的词汇表，你可以在Python库中的contractions.py 中访问他们

### 大小写转换

函数lower()和upper() 

### 删除停用词

停用词是指没有或只有极小意义的词语，通常在处理过程中将他们从文本中删除，以保留具有最大意义及语境的词语，如果基于单个标识聚合语料库，然后检查词语频率，就会发现停用词的出现频率是最高的，类似"a","the","me","and so on"这样的单词或词组就是停用词，每个领域都有可能有一系列独用的停用词。以下代码展示了一种过滤和删除英语停用词的方法：

In [25]:
def remove_stopwords(tokens):
    stopword_list = nltk.corpus.stopwords.words('english')
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    return filtered_tokens

在这个函数中，nltk中有一个英文的停用词表。我们使用tokeniez_text 函数来分割在上一节中获得的expanded_corpus然后使用前面的函数删除停用词：

In [26]:
expanded_corpus_tokens = [tokenize_text(text)for text in expanded_corpus]
filtered_list_3 = [[remove_stopwords(tokens)for tokens in sentence_tokens]for sentence_tokens in expanded_corpus_tokens]
print(filtered_list_3)

NameError: name 'expanded_corpus' is not defined

## 词语校正

文本规范化面临的主要挑战之一就是文本中存在不正确的单词，这里不正确的定义包括拼写错误的单词以及某些字母过多重复的单词，举例来说想要表达强烈情绪的人会把“finally”拼写为“finallllyyyy”我们的主要目的就是将这些单词标准化为正确形式。

#### 1.校正重复字符

在这里，我们将介绍一种语法和语义组合使用的拼写校正方法，首先，从矫正这些单词的语法开始，然后转向语义。

算法的第一步是，使用正则表达式来识别单词中的重复字符，然后使用置换来逐个删除重复字符，考虑前面例子中“finallllyyyy”一词，可以使用模式r'(\*w)(\w)\2(\w*)'来识别单词中在两个不同字符之间的重复字符。通过利用正则表达式匹配组（组1，2，3）并使用模式r'\1\2\3'，能够使用置换方法消除一个重复字符，然后迭代此过程，直到消除所有重复字符。

In [33]:
old_word = 'finallllyyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1
while True:
    #remove one repeated character
    new_word = repeat_pattern.sub(match_substitution,old_word)
    if new_word != old_word:
        print('Step:{}Word:{}'.format(step,new_word))
        step += 1  #update step
        old_word = new_word
        continue
    else:
        print("Final word:",new_word)
        break

Step:1Word:finallllyyy
Step:2Word:finallllyy
Step:3Word:finalllly
Step:4Word:finallly
Step:5Word:finally
Step:6Word:finaly
Final word: finaly


实际上我们在step3中就得到了正确的单词“finally”，现在我们将使用WordNet语料库来检查每个步骤得到的单词，一旦获得有效单词就立即终止循环，这就引入了算法所需的语义校正，如下面的代码所示：

In [32]:
from nltk.corpus import wordnet 
old_word = 'finallllyyyy'
repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
match_substitution = r'\1\2\3'
step = 1

while True:
    if wordnet.synsets(old_word):
        print("Final correct word:",old_word)
        break
    new_word = repeat_pattern.sub(match_substitution,old_word)
    if new_word != old_word:
        print('Step:{},Word:{}'.format(step,new_word))
        step += 1
        old_word = new_word
        continue
    else:
        print("Final word:",new_word)
        break

Step:1,Word:finallllyyy
Step:2,Word:finallllyy
Step:3,Word:finalllly
Step:4,Word:finallly
Step:5,Word:finally
Final correct word: finally


可以通过将该逻辑编写到函数中来构建一个更好的代码段，以便使其在校正词语时变得更为通用，如下面的代码段：

In [34]:
from nltk.corpus import wordnet

def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution,old_word)
        return replace(new_word) if new_word != old_word else new_word
    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens

该段代码使用内部函数replace()来实现我们的算法，然后在外部函数remove_repeated_characters()中对句子中的每个标识重复调用他

In [36]:
sample_sentence = 'My schooooooooooooooooooooooooooooooooool is reallllllllllllllllllllllllllllyyy amazzzzzzzzzzzzzzzzzinggggggggg 23333'
sample_sentence_tokens = tokenize_text(sample_sentence)[0]
print (sample_sentence_tokens)

['My', 'schooooooooooooooooooooooooooooooooool', 'is', 'reallllllllllllllllllllllllllllyyy', 'amazzzzzzzzzzzzzzzzzinggggggggg', '23333']


In [37]:
print (remove_repeated_characters(sample_sentence_tokens))

['My', 'school', 'is', 'really', 'amazing', '23']


#### 校正拼写错误

我们面临的另一个问题是认为错误导致的拼写错误，甚至是由于自动更正文本等功能导致的及其拼写错误。

最著名的处理拼写错误的算法是由谷歌开发的，你可以在http://norvig.com/spell-collect.html 上找到完整详细的算法

我们的主要目标是，给出一个单词，找到这个单词最有可能的正确形式。我们遵循的方法是生成一系列类似输入词的候选词，并从该集合中选择最有可能的单词作为正确的单词，我们使用标准英文单词语料库，根据语料库中单词的频率，从距离输入单词最近的最后一组候选词中识别出正确的单词，这个距离（即一个单词与输入单词的测量距离）也称为编辑距离。我们使用的输入语料库包含Gutenberg语料库书籍、维基词典和英国国家语料库中的最常用单词列表。

In [44]:
import re,collections
def tokens(text):
    """
    Get all words from the corpus
    """
    return re.findall('[a-z]+',text.lower())
WORDS = tokens(open('big.txt').read())
WORD_COUNTS = collections.Counter(WORDS)
#top10 words in the corpus
print (WORD_COUNTS.most_common(10))

[('the', 66492), ('of', 33683), ('and', 30259), ('to', 22830), ('in', 18743), ('a', 17369), ('that', 9222), ('he', 8952), ('was', 8760), ('it', 8615)]


拥有了自己的词汇以后，就可以定义三个函数，计算出与输入单词的编辑距离为0、1和2的单词组，这些编辑距离由插入、删除、添加和调换位置等操作产生。

In [67]:
import sys,string
def edits0(word):
    """
    Return all strings that are zero edits away
    from the input word(i.e.,the word itself)
    """
    return {word}
def edits1(word):
    """Retuern all strings that are one edit away from the input word is made of"""
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    def splits(word):
        return [(word[:i],word[i:])for i in range(len(word)+1)]

    pairs  =  splits(word)
    deletes =  [a+b[1:]  for (a,b) in pairs if b]
    transposes = [a+b[1]+b[0]+b[2:] for (a,b) in pairs if len(b)>1]
    replaces  = [a+c+b[1:] for (a,b)in pairs for c in alphabet if b]
    inserts = [a+c+b for (a,b) in pairs for c in alphabet]
    return set(deletes+transposes+replaces+inserts)
def edits2(word):
    return {e2 for e1 in edits1(word)for e2 in edits1(e1)}

我们还可以定义一个known()函数，该函数根据单词是否存在于词汇词典OWRD_COUNTS中，从edit函数得出的候选词组中返回一个单词子集，这使我们可以从候选词组中获得一个有效单词列表：

In [55]:
def known(words):
    return {w for w in words if w in WORD_COUNTS}

In [56]:
word = 'finally'
edits0(word)

{'finally'}

In [57]:
known(edits0(word))

{'finally'}

In [63]:
edits1(word)

{'afinally',
 'ainally',
 'bfinally',
 'binally',
 'cfinally',
 'cinally',
 'dfinally',
 'dinally',
 'efinally',
 'einally',
 'fainally',
 'fanally',
 'fbinally',
 'fbnally',
 'fcinally',
 'fcnally',
 'fdinally',
 'fdnally',
 'feinally',
 'fenally',
 'ffinally',
 'ffnally',
 'fginally',
 'fgnally',
 'fhinally',
 'fhnally',
 'fiaally',
 'fially',
 'fianally',
 'fianlly',
 'fibally',
 'fibnally',
 'fically',
 'ficnally',
 'fidally',
 'fidnally',
 'fieally',
 'fienally',
 'fifally',
 'fifnally',
 'figally',
 'fignally',
 'fihally',
 'fihnally',
 'fiially',
 'fiinally',
 'fijally',
 'fijnally',
 'fikally',
 'fiknally',
 'filally',
 'filnally',
 'fimally',
 'fimnally',
 'finaally',
 'finaaly',
 'finablly',
 'finably',
 'finaclly',
 'finacly',
 'finadlly',
 'finadly',
 'finaelly',
 'finaely',
 'finaflly',
 'finafly',
 'finaglly',
 'finagly',
 'finahlly',
 'finahly',
 'finailly',
 'finaily',
 'finajlly',
 'finajly',
 'finaklly',
 'finakly',
 'finalaly',
 'finalay',
 'finalbly',
 'finalby',
 '

In [64]:
known(edits1(word))

{'finally'}

In [68]:
edits2(word)

{'finzalwy',
 'tinallny',
 'fbyally',
 'finazhy',
 'finalkyy',
 'finlallyo',
 'finajlqly',
 'tfibally',
 'pinalty',
 'frfinally',
 'dfinaily',
 'gfiwally',
 'finalgtly',
 'finallypj',
 'finmxlly',
 'linalxly',
 'finalflyz',
 'finalouy',
 'finhgly',
 'ufinaley',
 'finrcly',
 'yinaglly',
 'tfinallr',
 'finaldiy',
 'fzinallye',
 'fevinally',
 'pinalky',
 'fibaally',
 'ofinalldy',
 'finallxry',
 'finalafy',
 'fpirnally',
 'fiynamly',
 'fiptnally',
 'fpinallj',
 'fiwdally',
 'fijnalsly',
 'finallys',
 'ainalmy',
 'oinalwly',
 'finaqrlly',
 'fknaally',
 'fknallyl',
 'fifnadly',
 'kfknally',
 'uznally',
 'efinallo',
 'frinaluy',
 'finazln',
 'fbneally',
 'finalslyl',
 'fionzlly',
 'hfcinally',
 'fisalely',
 'fosnally',
 'fyinasly',
 'fginahlly',
 'finsll',
 'finajylly',
 'fitnylly',
 'figallp',
 'kfinallyv',
 'fimalley',
 'fiyawlly',
 'finapkly',
 'finglcy',
 'fsnflly',
 'finalltk',
 'cfinalley',
 'minalyl',
 'fixanly',
 'finawllw',
 'finaflely',
 'qfinaloy',
 'finpallq',
 'rfmnally',
 'fjsin

In [69]:
known(edits2(word))

{'fatally', 'final', 'finally', 'finely', 'vitally'}

上面的输出显示了一组能够替换错误输入词的候选词，通过赋予编辑距离更小的单词更高的优先级，可以从前面的列表中选出候选词

In [71]:
candidates = (known(edits0(word))or
              known(edits1(word))or
              known(edits2(word))or
              [word])
candidates

{'finally'}

假如在前面的候选词中两个单词的编辑距离相同，则可以通过使用max(candidates,key = WORD_COUNTS.get)函数从词汇字典WORD_COUNTS中选取出现频率最高的词来作为有效词。现在，我们使用上述逻辑定义拼写校正函数：

In [73]:
def correct(word):
    candidates = (known(edits0(word))or
              known(edits1(word))or
              known(edits2(word))or
              [word])
    return max(candidates,key = WORD_COUNTS.get)
correct('fianlly')

'finally'

In [74]:
correct('FIANLLY')

'FIANLLY'

可以看出这个函数对大小写比较敏感，它无法校正非小写的单词，因此我们编写下列函数，以使其能够同时校正大写和小写的单词，该函数的逻辑是存储单词的原始大小写格式，然后将所有字母转换成小写字母，更正拼写错误，最后使用case——of函数将其重新转换回初始的大小写格式：

In [75]:
def correct_match(match):
    word = match.group()
    def case_of(text):
        return (str.upper if text.isupper() else
                str.lower if text.islower() else
                str.title if text.istitile() else
                str)
    return case_of(word)(correct(word.lower()))
def correct_text_generic(text):
    return re.sub('[a-zA-Z]+',correct_match,text)

现在上述函数既可以用来校正大写也可以用来校正小写

In [76]:
correct_text_generic('fianlly')

'finally'

In [77]:
correct_text_generic('FIANLLY')

'FINALLY'

当然这种方法并不总是准确的，如果单词没有出现在词汇字典中，就有可能无法被校正。使用更多的词汇表数据以涵盖更多的词语可以解决这个问题，在pattern库中也有类似的，开箱即用的算法

In [80]:
from pattern.en import suggest 
print (suggest('fianlly'))

[('finally', 1.0)]


In [82]:
print(suggest('flaot'))

[('flat', 0.85), ('float', 0.15)]


### 词干提取

词素是任何自然语言中最小的独立单元，词素由词干和词缀组成。词缀是指前缀后缀等词语单元，他们附加到词干上以改变其含义或创建一个新单词，词干也常称为单词的基本形式，我们可以通过在词干上以添加词缀来创建新词，这个过程成为词形变化，相反的过程是从单词的变形形式中获得单词的基本形式，这成为“词干提取”

以“jump”一词为例，你可以对其添加词缀形成新的单词，如“jumps”“jumped”“jumping”。在这些情况下，基本单词“jump”是词干，如果对于这三种变形形式中的任一种进行词干提取，都将得到基本形式。

对于词干提取器，nltk包有几种实现算法，这些词干提取器包含在stem模块中。还有一个Porter2词干提取算法，这是一个改进算法，以下代码展示了波特词干提取器：

In [92]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
#print (ps.stem('jumping'),ps.stem('jumps'),ps.stem('jumped'))
print(ps.stem('jumping'))
print(ps.stem('selfless'))

jump
selfless


兰卡斯特词干提取器基于兰卡斯特词干提取算法，具有超过120条规则来具体说明如何删减或替换词缀以获得词干

In [91]:
from nltk.stem import LancasterStemmer
ls = LancasterStemmer()
print(ls.stem('jumping'))
print(ls.stem('selfless'))

jump
selfless


可以看出这个词干提取器的行为与波特词干提取器的行为是不同的

### 词形还原

词形还原的过程与词干提取非常相似，去除词缀以获得单词的基本形式，但在这种情况下，这种基本形式成为词根，而不是词干，他们的不同之处在于，词干不一定是标准的、正确的单词。也就是说。他可能不存在于词典中。词根也称为词元，始终存在于词典中。

词形还原的过程比词干提取慢得多。nltk有一个强大的词形还原模块，它使用WordNet、单词的句法和语义（如词性和语境）来获得词根或词元。

In [93]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

print(wnl.lemmatize('cars','n'))
print(wnl.lemmatize('men','n'))

car
men


In [94]:
print(wnl.lemmatize('running','v'))

run


In [95]:
print(wnl.lemmatize('ate','v'))

eat


至此就结束了处理和规范化文本技术的讨论