### 句子切分

In [1]:
import nltk
from nltk.corpus import gutenberg
from pprint import pprint

In [3]:
alice = gutenberg.raw(fileids = 'carroll-alice.txt')
sample_text = 'We will discuss briefly about the basic syntax,structure and design philosophies.There is a defined hierarchical syntax for Python code which you should remember when writing code !Python is a really powerful programming language!'

In [5]:
print (len(alice))

144395


In [6]:
print(alice[0:100])

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


nltk.sent_tokenize 函数是nltk推荐的默认的句子切分函数。它内部使用了一个PunktSentenceTokenizer类的实例，然而，它不仅仅是一个普通的对象或实例——它已经在几种语言模型上完成了预训练(牛批)，并且在除英语外的许多主流语言上取得了良好的运行效果

以下代码展示了该函数在示例文本中的基本用法：

In [13]:
default_st = nltk.sent_tokenize
alice_sentences = default_st(text = alice)
sample_sentences = default_st(text = sample_text)

In [14]:
print ('Total sentences in sample_text:',len(sample_sentences))

Total sentences in sample_text: 1


In [15]:
print('Sample text sentences:-')
pprint(sample_sentences)

Sample text sentences:-
['We will discuss briefly about the basic syntax,structure and design '
 'philosophies.There is a defined hierarchical syntax for Python code which '
 'you should remember when writing code !Python is a really powerful '
 'programming language!']


In [16]:
print('\nTotal sentences in alice:',len(alice_sentences))


Total sentences in alice: 1625


In [18]:
print('First 5 sentences in alice"-')
pprint(alice_sentences[0:5])

First 5 sentences in alice"-
["[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I.",
 'Down the Rabbit-Hole\n'
 '\n'
 'Alice was beginning to get very tired of sitting by her sister on the\n'
 'bank, and of having nothing to do: once or twice she had peeped into the\n'
 'book her sister was reading, but it had no pictures or conversations in\n'
 "it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
 "conversation?'",
 'So she was considering in her own mind (as well as she could, for the\n'
 'hot day made her feel very sleepy and stupid), whether the pleasure\n'
 'of making a daisy-chain would be worth the trouble of getting up and\n'
 'picking the daisies, when suddenly a White Rabbit with pink eyes ran\n'
 'close by her.',
 'There was nothing so VERY remarkable in that; nor did Alice think it so\n'
 "VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!",
 'Oh dear!']


### 词语切分

主流接口：
word_tokenize

TreebankWordTokenizer

RegexpTokenizer

从RegexpTokenizer继承的切分器

我们将使用例句“The brown fox wasn't that quick and he couldn't win the race "作为各种切分器的输入，nltk.word_tokenize 函数是nltk默认并且推荐的词语切分器。该切分器实际上是TreebankWordTokenizer类的一个实例或者对象，并且是该核心类的一个封装

In [20]:
sentence = "The brown fox wasn't that quick and he couldn't win the race"
default_wt = nltk.word_tokenize
words = default_wt(sentence)
print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


使用TreebankWordTokenizer

In [21]:
treebank_wt = nltk.TreebankWordTokenizer()
words = treebank_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']


使用正则表达式和RegeTokenizer类

In [22]:
TOKEN_PATTERN = r'\w+'

In [23]:
regex_wt = nltk.RegexpTokenizer(pattern = TOKEN_PATTERN,gaps = False)
words = regex_wt.tokenize(sentence)
print(words)

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [24]:
GAP_PATTERN = r'\s+'
regex_wt = nltk.RegexpTokenizer(pattern = GAP_PATTERN,gaps = True)
print(words)

['The', 'brown', 'fox', 'wasn', 't', 'that', 'quick', 'and', 'he', 'couldn', 't', 'win', 'the', 'race']


In [26]:
word_indices = list(regex_wt.span_tokenize(sentence))
print (word_indices)
print([sentence[start:end]for start ,end in word_indices])

[(0, 3), (4, 9), (10, 13), (14, 20), (21, 25), (26, 31), (32, 35), (36, 38), (39, 47), (48, 51), (52, 55), (56, 60)]
['The', 'brown', 'fox', "wasn't", 'that', 'quick', 'and', 'he', "couldn't", 'win', 'the', 'race']


## 文本规范化

文本规范化定义为这样的一个过程，它包含一系列步骤，一次是转换，清洗以及将文本数据标准化成可供NLP、分析系统和应用程序使用的格式。通常，文本切分本身也是文本规范化的一部分。除了文本切分以外，还有各种其他技术，包括文本清洗、大小写转换、词语校正、停用词删除、词干提取和词形还原。文本规范化也常常成为文本清洗或转换

In [27]:
import nltk
import re
import string
from pprint import pprint
corpus = ["The brown fox wasn't that quick and he couldn't win the race ",
         "Hey that's a great deal ! I just bought a phone for 199",
         "@@ You'll (learn) a **lot** in the book .Python is an amazing language!@@"]

### 文本清洗

可以使用nltk的clean_html函数清洗来自HTML的不必要的标记，甚至是BeautifulSoup库来解析HTML数据，你还可以使用自定义的逻辑，包括正则表达式，xpath和lxml来解析xml数据。从JSON获取数据较为容易，因为它具有明确的键值注释。

### 文本切分

这里介绍一个通用的切分函数

In [33]:
def tokenize_text(text):
    sentences = nltk.sent_tokenize(text)
    word_tokens = [nltk.word_tokenize(sentence)for sentence in sentences]
    return word_tokens

这个函数的功能是接受文本数据，再从中提取句子，最后将每个句子划分成标识，这些标识可以是单词、特殊字符或标点符号。以下代码说明了该函数的功能：

In [36]:
token_list = [tokenize_text(text)for text in corpus]
pprint(token_list)

[[['The',
   'brown',
   'fox',
   'was',
   "n't",
   'that',
   'quick',
   'and',
   'he',
   'could',
   "n't",
   'win',
   'the',
   'race']],
 [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  ['I', 'just', 'bought', 'a', 'phone', 'for', '199']],
 [['@',
   '@',
   'You',
   "'ll",
   '(',
   'learn',
   ')',
   'a',
   '**lot**',
   'in',
   'the',
   'book',
   '.Python',
   'is',
   'an',
   'amazing',
   'language',
   '!'],
  ['@', '@']]]


### 删除特殊字符

标点和特殊字符往往没什么意义，我们将在切分前后删除这两类特殊字符，以下显示了在切分之后删除特殊字符

In [37]:
def remove_characters_after_tokenization(tokens):
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None,[pattern.sub('',token)for token in tokens])
    return filtered_tokens

In [39]:
filtered_list_1 = [filter(None,[remove_characters_after_tokenization(tokens)for tokens in sentence_tokens])for sentence_tokens in token_list]
print (filtered_list_1)

[<filter object at 0x0000014F6F9D67F0>, <filter object at 0x0000014F6F9D69B0>, <filter object at 0x0000014F6F9D6C18>]


在文本切分之前删除特殊字符（推荐）：

In [46]:
def remove_characters_before_tokenization(sentence,keep_apostrophes = False):
    sentence = sentence.strip()
    if keep_apostrophes:
        PATTERN = r'[?|$|&|*|%|@|(|)|~]'#add other characters here to remove them
        filtered_sentence = re.sub(PATTERN,r'',sentence)
    else:
        PATTERN = r'[^a-zA-Z0-9]'#only extract alpha-numeric characters
        filtered_sentence = re.sub(PATTERN,r'',sentence)
    return filtered_sentence

In [48]:
filtered_list_2 = [remove_characters_before_tokenization(sentence)for sentence in corpus]
print (filtered_list_2) 

['Thebrownfoxwasntthatquickandhecouldntwintherace', 'HeythatsagreatdealIjustboughtaphonefor199', 'YoulllearnalotinthebookPythonisanamazinglanguage']


In [49]:
cleaned_corpus = [remove_characters_before_tokenization(sentence,keep_apostrophes = True)for sentence in corpus]
print(cleaned_corpus)

["The brown fox wasn't that quick and he couldn't win the race", "Hey that's a great deal ! I just bought a phone for 199", " You'll learn a lot in the book .Python is an amazing language!"]


### 扩展缩写词

缩写词是词或者音节的缩短形式，他们既在书面形式中存在，也在口语中存在，比如“is not”缩写为“isn't”，缩写词中撇号用来表示缩写，而一些元音和其他字母则被删除了，通常，在正式书写时会避免使用缩写词，但在非正式情况下，他们被广泛使用。

缩写词为NLP和文本分析制造了一个难题，首先因为在该单词中有一个特殊的撇号字符，此外，我们有两个甚至更多的单词由缩写词表示。

可以使用映射关系来扩展缩写词，我们创建了一个缩写词及其扩展形式的词汇表，你可以在Python库中的contractions.py 中访问他们

### 大小写转换

函数lower()和upper() 

### 删除停用词

停用词是指没有或只有极小意义的词语，通常在处理过程中将他们从文本中删除，以保留具有最大意义及语境的词语，如果基于单个标识聚合语料库，然后检查词语频率，就会发现停用词的出现频率是最高的，类似"a","the","me","and so on"这样的单词或词组就是停用词，每个领域都有可能有一系列独用的停用词。以下代码展示了一种过滤和删除英语停用词的方法：

In [50]:
def remove_stopwords(tokens):
    stopword_list = nltk.corpus.stopwords.words('english')
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    return filtered_tokens

在这个函数中，nltk中有一个英文的停用词表。我们使用tokeniez_text 函数来分割在上一节中获得的expanded_corpus然后使用前面的函数删除停用词：

In [56]:
expanded_corpus_tokens = [tokenize_text(text)for text in expanded_corpus]
filtered_list_3 = [[remove_stopwords(tokens)for tokens in sentence_tokens]for sentence_tokens in expanded_corpus_tokens]
print(filtered_list_3)

NameError: name 'expanded_corpus' is not defined

# 看到了89页