# 2. Tokenization

Tokenization: split a document, any string, into discrete **tokens of meaning**.

本章主要介绍以下内容：

- 把一句话切成token / n-grams
- 处理特殊的标点，例如emoticons
- stemming / Lemmatization：进一步合并（压缩）所有token
- 对每句话创建一个vector representation
- sentiment analyzer from handcrafted token scores

## 1 Challenges

### 通常的预处理流程：

1. tokenization
    - separate punctuation from words
    - split contractions: we'll -> we will
    - emoticons
    - math symbols
    - **what is a token?**
        - ice cream: 1 token or 2 tokens? 
    
2. stemming
    - 根据token 的syllables, prefix, suffix
    - processing, processed, process
    - 困难：
        - remove ing
            - ending -> end
            - running -> run (not runn)
            - sing -> s (wrong)
        - plural form:
            - words -> word
            - bus -> bu (wrong)
        - more info: ch05_word2vec

3. invisible words
    - don't (do that)!
    
4. n-grams
    - including pairs of words
    - filter out n-grams that rarely occur together (low frequency)
    - 留下一些常用的组合，例如 ice cream, Mr. Smith


## 2. Building your vocabulary with a tokenizer

#### natural language processing v.s. programming language compiler

A tokenizer used for compiling computer languages is often called a **scanner** , **lexical analyzer** or **lexer**.
The vocabulary (the set of all the valid tokens) for a computer language is often called a **lexicon**

| Natural Language Processing   | Parser   | Tokenizer                        | Vocabulary |
|-------------------------------|----------|----------------------------------|------------|
| Programming Language Compiler | Compiler | Scanner, Lexer, Lexical Analyzer | Lexicon    |

A tokenizer breaks unstructured data, natural language text, into chunks of information that can be counted as discrete elements. (unstructured string -> numerical data structure)

#### 最简单的tokenizer: 使用空格

In [1]:
sentence = "Thomas Jefferson began building Monticello at the age of 26."

In [7]:
print(sentence.split())

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26.']


In [8]:
print(str.split(sentence))

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26.']


从上面的例子可以看出，最后一个26后面的句号没有被切分开。通常，word 和punctuation 要分开。后面我们会逐步优化tokenizer。下面，我们先focus on pipeline.

### one-hot vector


In [4]:
import numpy as np
token_sequence = str.split(sentence)
vocab = sorted(set(token_sequence))

In [6]:
print(vocab)

['26.', 'Jefferson', 'Monticello', 'Thomas', 'age', 'at', 'began', 'building', 'of', 'the']


In [15]:
num_tokens = len(token_sequence)
print('the corpus has {} tokens.'.format(num_tokens))

vocab_size = len(vocab)
print('the corpus has {} unique tokens (vocab size).'.format(vocab_size))

onehot_vectors = np.zeros((num_tokens, vocab_size), int)  # #rows = #token, #cols = |vocab|
for i, word in enumerate(token_sequence):
    onehot_vectors[i, vocab.index(word)] = 1
    print('The {}th word: {}\tone-hot: {}'.format(i, word, onehot_vectors[i]))

the corpus has 10 tokens.
the corpus has 10 unique tokens (vocab size).
The 0th word: Thomas	one-hot: [0 0 0 1 0 0 0 0 0 0]
The 1th word: Jefferson	one-hot: [0 1 0 0 0 0 0 0 0 0]
The 2th word: began	one-hot: [0 0 0 0 0 0 1 0 0 0]
The 3th word: building	one-hot: [0 0 0 0 0 0 0 1 0 0]
The 4th word: Monticello	one-hot: [0 0 1 0 0 0 0 0 0 0]
The 5th word: at	one-hot: [0 0 0 0 0 1 0 0 0 0]
The 6th word: the	one-hot: [0 0 0 0 0 0 0 0 0 1]
The 7th word: age	one-hot: [0 0 0 0 1 0 0 0 0 0]
The 8th word: of	one-hot: [0 0 0 0 0 0 0 0 1 0]
The 9th word: 26.	one-hot: [1 0 0 0 0 0 0 0 0 0]


为了更好的可视化，我们使用**Pandas**库. 
使用pandas，每一行代表一个token 的one-hot vector.

In [14]:
import pandas as pd
pd.DataFrame(onehot_vectors, columns=vocab)

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0,0,0,0


One-hot 可以想象成在弹钢琴，每一个键代表一个token，依次排开。每次只能弹一个键。例如，第一次弹了第4个键(Thomas), 第二次弹第2个键(Jefferson), 以此类推。


通过one-hot vector，我们把一个token 转换成了numbers，以便计算机理解和计算。
这种One-hot vector 通常用于：

- used in neural nets
- sequence-to-sequence language models
- generative language models
- etc.

#### **Point++**: 没有information lost 
除了空格无法恢复，但是空格本身带的信息很少).

#### **Point--**: one-hot vector representation 得到一个十分稀疏的矩阵。

对于实际应用场景，impractical. 下面我们预估以下：
- 假设我们有3000 books, 每本书有3500行，每行15个单词，所以我们表格的总行数为：

`#rows = 3000*3500*15 = 157500000`

假设我们vocab 有 100,000 个单词，那我们表格的总列数为：

`#cols = 100000`

假设matrix 的每个cell 用1个byte 来存储，则matrix 的总大小为：

`#size_in_byte = 157500000 * 100000 = 15750000000000`

所以总大小是 15.75 TB

`size_in_byte / le12 = 15.75`

所以我们要对这样one-hot matrix 做**dimension reduction**.



### Bag-of-Word

许多时候我们发现，词的顺序其实并不太影响我们理解一句话的意思。

所以我们又一个假设：the meaning of a sentence can be gleaned from just the words themselves. 

所以我们可以把所有的单词放在一个bag中，忽略单词之间的顺序。每一个bag 代表一个document，例如，一句话。

bag of word vector 可以通过把一个document 中每个token 的one-hot vector 相加得到（OR operation）。所以bag of word vector count the **frequency** of words, **not order**. 以前是每个token 一个长度为|vocab| 的向量，现在是每个document 一个长度为|vocab| 的向量。

one-hot vector 好比solo，一次只弹一个键；bag of word 更像和旋，一次可以同时弹多个键。

所以通过使用bag of word，我们压缩了初步的one-hot matrix.

除此以外，可以使用bag of word 来index document，因为使用bag of word vector 我们可以快速知道某一个document 中是否包含某一个单词。

#### set of word

下面的代码实现了set of word (区别在于，每个token 只有0:没有出现和1:出现两种状态，而不计数)

对于set of word，每个vector 是一个binary vector (0 and 1). Binary vector 的优势是，All modern CPUs have hard
wired memory addressing instructions that can efficiently hash, index, and search a large set of binary vectors like this.

In [47]:
sentence_bow = {}
for token in sentence.split():
    sentence_bow[token] = 1

print(sorted(sentence_bow.items()))

[('26.', 1), ('Jefferson', 1), ('Monticello', 1), ('Thomas', 1), ('age', 1), ('at', 1), ('began', 1), ('building', 1), ('of', 1), ('the', 1)]


因为每个token 的计数只可能是0或者1，所以我们可以直接使用一个set 来存储所有出现过的token，这样更省空间。

In [17]:
import pandas as pd
df = pd.DataFrame(pd.Series(dict([(token, 1) for token in sentence.split()])), columns=['sent']).T

In [18]:
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent,1,1,1,1,1,1,1,1,1,1


下面，我们在corpus 中加一些句子。

In [20]:
sentences = "Thomas Jefferson began building Monticello at the age of 26.\n"
sentences += "Construction was done mostly by local masons and carpenters.\n"
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += "Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."

print(sentences)

Thomas Jefferson began building Monticello at the age of 26.
Construction was done mostly by local masons and carpenters.
He moved into the South Pavilion in 1770.
Turning Monticello into a neoclassical masterpiece was Jefferson's obsession.


In [26]:
import pprint

corpus = {}

"""
Normally you should use .splitlines() 
but here you explicitly add a single '\n' character to the end of each line/
sentence, so you need to explicitly split on this character.
"""
for i, sent in enumerate(sentences.split('\n')):  # 对每一句话生成一个beg of word vector
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())
    
pp = pprint.PrettyPrinter()
pp.pprint(corpus)

{'sent0': {'26.': 1,
           'Jefferson': 1,
           'Monticello': 1,
           'Thomas': 1,
           'age': 1,
           'at': 1,
           'began': 1,
           'building': 1,
           'of': 1,
           'the': 1},
 'sent1': {'Construction': 1,
           'and': 1,
           'by': 1,
           'carpenters.': 1,
           'done': 1,
           'local': 1,
           'masons': 1,
           'mostly': 1,
           'was': 1},
 'sent2': {'1770.': 1,
           'He': 1,
           'Pavilion': 1,
           'South': 1,
           'in': 1,
           'into': 1,
           'moved': 1,
           'the': 1},
 'sent3': {"Jefferson's": 1,
           'Monticello': 1,
           'Turning': 1,
           'a': 1,
           'into': 1,
           'masterpiece': 1,
           'neoclassical': 1,
           'obsession.': 1,
           'was': 1}}


In [27]:
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.,...,South,Pavilion,in,1770.,Turning,a,neoclassical,masterpiece,Jefferson's,obsession.
sent0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0,...,1,1,1,1,0,0,0,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,1,1,1,1


### Similarity between sentence

One way to check for the similarities between sentences is to count the number of overlapping tokens using a **dot product**. (also called the scalar product because it produces a single scalar value as its output)

In [28]:
v1 = pd.np.array([1, 2, 3])
v2 = pd.np.array([2, 3, 4])
v1.dot(v2)

20

In [29]:
(v1 * v2).sum()  # fast

20

In [30]:
sum([x1 * x2 for x1, x2 in zip(v1, v2)])  # slow 

20

In [32]:
# use numpy matrix product operatior, np.matmul() function or the @ operator
v1.reshape(-1, 1).T @v2.reshape(-1, 1)  

array([[20]])

In [33]:
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.,...,South,Pavilion,in,1770.,Turning,a,neoclassical,masterpiece,Jefferson's,obsession.
sent0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0,...,1,1,1,1,0,0,0,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,1,1,1,1


In [38]:
df = df.T
df

Unnamed: 0,sent0,sent1,sent2,sent3
Thomas,1,0,0,0
Jefferson,1,0,0,0
began,1,0,0,0
building,1,0,0,0
Monticello,1,0,0,1
at,1,0,0,0
the,1,0,1,0
age,1,0,0,0
of,1,0,0,0
26.,1,0,0,0


In [39]:
df.sent0.dot(df.sent1)

0

In [40]:
df.sent0.dot(df.sent2)

1

In [41]:
df.sent0.dot(df.sent3)

1

In [42]:
df.sent1.dot(df.sent2)

0

In [43]:
df.sent1.dot(df.sent3)

1

In [44]:
df.sent3.dot(df.sent2)

1

下面，我们可以查看同时出现的词。

In [46]:
[(k, v) for (k, v) in (df.sent2 & df.sent3).items() if v]

[('into', 1)]

### A token improvement

很多时候，我们不仅仅需要根据空格切分，需要根据其他的特殊字符切分。

In [48]:
import re

tokens = re.split(r'[-\s.,;!?]+', sentence)
print(tokens)

['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '']


### regex

注意，当我们需要通过-字符切分时，-字符必须放在open bracket 右侧，即第一个。原因是-在一个character class 中有特殊的意义，例如:r'[A-Z]' 表明匹配任意一个大写字符。

`re.split` 和`str.split`的行为类似，只是根据regex 的正则表达式的匹配作为分隔符。

- `[...]` 定义一个character class，匹配任意一个即可
- `(...)` 定义一个regex group，需要全部匹配

#### compiled regex 的好处：
- 更快：Python caches the compiled objects for the last MAXCACHE=100 regular expressions.

然后，我们尝试移除无效的token，例如空字符等。

In [49]:
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
[x for x in tokens if x and x not in '- \t\n.,;!?']

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26']

In [52]:
# 第二种方法，使用filter
list(filter(lambda x: x if x and x not in '- \t\n.,;!?' else None, tokens))

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26']

### tokenizer libraries:

- spaCy — Accurate , flexible, fast, Python
- Stanford CoreNLP — More accurate, less flexible, fast, depends on Java 8
- NLTK — Standard used by many NLP contests and comparisons, popular, Python

我们下面首先来看一个NLTK tokenizer

In [54]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')
tokenizer.tokenize(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '.']

It also separates sentence-ending trailing punctuation from tokens that do not contain any other punctuation characters.

#### TreeBank Tokenizer

- separates phrase-terminating punctuation
- English contractions: isn't = is n't
    - 用途：syntax tree

In [56]:
from nltk.tokenize import TreebankWordTokenizer
new_sentence = "Monticello wasn't designated as UNESCO World Heritage Site until 1987."
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(new_sentence)

['Monticello',
 'was',
 "n't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987',
 '.']

#### Casual Tokenizer



In [57]:
from nltk.tokenize.casual import casual_tokenize
message = "RT @TJMonticello Best day everrrrrrr at Monticello. Awesommmmmmeeeeeeee day :*)"""
casual_tokenize(message)

['RT',
 '@TJMonticello',
 'Best',
 'day',
 'everrrrrrr',
 'at',
 'Monticello',
 '.',
 'Awesommmmmmeeeeeeee',
 'day',
 ':*)']

In [58]:
# reduce_len: reduce the number of repeated characters within a token
# strip_handles: strip usernames
casual_tokenize(message, reduce_len=True, strip_handles=True)

['RT',
 'Best',
 'day',
 'everrr',
 'at',
 'Monticello',
 '.',
 'Awesommmeee',
 'day',
 ':*)']

### n-grams

An n-gram is a sequence containing up to n elements that have been extracted from a sequence of those elements, usually a string. 我们要讲的是n-grams of words, not characters.

和bag of word 相比，n-gram 可以保留更多的信息（因为保留了部分序列信息）。比如一个not，使用n-gram 可以和它的neighboring words 依然attach 在一起。n-grams are one of the ways to maintain context information as data passes through your pipeline

- 首先，我们构建n-grams
- 其次，我们根据频率筛选出最有可能的n-grams(下一章)-- prioritization of n-grams
    - So rare n-grams won’t be helpful for classification problems

下面我们使用NLTK 的ngram 函数来获取n-grams. the ngrams function of the NLTK library returns a Python generator (memory efficient).

In [65]:
from nltk.util import ngrams
print(tokens)
valid_tokens = [token for token in tokens if token and token not in ['', ' ']]
print(valid_tokens)  # 获取有效token

['Thomas', ' ', 'Jefferson', ' ', 'began', ' ', 'building', ' ', 'Monticello', ' ', 'at', ' ', 'the', ' ', 'age', ' ', 'of', ' ', '26', '.', '']
['Thomas', 'Jefferson', 'began', 'building', 'Monticello', 'at', 'the', 'age', 'of', '26', '.']


In [66]:
list(ngrams(valid_tokens, 2))  # 计算2-grams

[('Thomas', 'Jefferson'),
 ('Jefferson', 'began'),
 ('began', 'building'),
 ('building', 'Monticello'),
 ('Monticello', 'at'),
 ('at', 'the'),
 ('the', 'age'),
 ('age', 'of'),
 ('of', '26'),
 ('26', '.')]

In [67]:
list(ngrams(valid_tokens, 3))  # 计算3-grams

[('Thomas', 'Jefferson', 'began'),
 ('Jefferson', 'began', 'building'),
 ('began', 'building', 'Monticello'),
 ('building', 'Monticello', 'at'),
 ('Monticello', 'at', 'the'),
 ('at', 'the', 'age'),
 ('the', 'age', 'of'),
 ('age', 'of', '26'),
 ('of', '26', '.')]

上面n-grams 都是以tuple 的形式给出，下面我们还原成string.

In [69]:
two_grams = list(ngrams(valid_tokens, 2))
[" ".join(x) for x in two_grams]

['Thomas Jefferson',
 'Jefferson began',
 'began building',
 'building Monticello',
 'Monticello at',
 'at the',
 'the age',
 'age of',
 'of 26',
 '26 .']

随着n-gram 的引入，我们的vocabulary 的size 会呈指数型上涨。当vocabulary size > number of documents in your copus 的时候，就会有问题，很容易出现overfitting. (**相当于方程的个数少于feature 的个数**)

If your feature vector dimensionality exceeds the number of all your documents, your feature extraction step is counterproductive. So n-grams are filtered out that occur too infrequently.

#### n-gram 的另一个问题

有些n-gram 出现的很频繁，例如"at the", 但却没有什么意义。If n-grams are so common, they are not really useful for discriminating between the meanings of your documents. So it has little predictive power.

所以，出现过多的n-grams 也要被过滤掉。（例如，高于25%）

### Stop words

occur with a high frequency but carry much less substantive information about the meaning of a phrase.

#### 是否需要remove stop words
- 如果你有足够的数据，足够的内存，足够的processing power，那么没有必要移除stop words
    - stop word 对vocabulary size 的影响不大，通常stop word 只有几十个，而vocabulary 大小上万个。
    - 留着stop words 有时候会发现一些named entity，例如电影名，饭店名，等。
- 如果你的数据量不够，过大的vocab 会造成overfitting，内存不够的情况下，我们需要控制vocab的大小
    - 这时候我们可以通过stemming 等其他更好的方法来做dimension reduction，而不要忽略stop words
- 通过NLTK / sklearn 获取stop words，通常随着版本的更新，这些stop words 列表也会更新。
    - 所以remove stop words 有时候会让你的结果不可复现，或者造成不一致的问题。

may not be able to reproduce your results.
    
下面，我们查看NLTK 的stop words

In [71]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/chenwang/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [72]:
stop_words = nltk.corpus.stopwords.words('english')
len(stop_words)

179

In [73]:
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [75]:
[sw for sw in stop_words if len(sw) == 1]

['i', 'a', 's', 't', 'd', 'm', 'o', 'y']

上面我们可以看出，有很多只有一个字母的stop words，看上去很奇怪，然后我们如果使用NLTK tokenizer 和Porter Stemmer，就会发现这个很有意义。

下面，我们可以看出，sklearn 的stop words 有318个，其中有119个是与NLTK 重复的。

In [76]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
len(sklearn_stop_words)

318

In [78]:
len(sklearn_stop_words.union(stop_words))

378

In [79]:
len(sklearn_stop_words.intersection(stop_words))

119

### Normalizing your vocabulary

#### case folding (case normalization)

case “denormalized”: Hello and hello

It helps consolidate words that are intended to mean the same thing (and be spelled the same way).

做case folding 可以在tokenization 之前，直接是有`doc.lower()` 直接讲整个doc 变成小写，也可以tokenization 之后使用list comprehension.

In [80]:
tokens = ['House', 'Visitor', 'Center']
normalized_tokens = [x.lower() for x in tokens]
print(normalized_tokens)

['house', 'visitor', 'center']


#### 什么时候该使用case normalization

- 当然，有时候全大写代表了一些语义，例如，是一个proper noun. 所以如果是做**named entity recognition**，是否要在processing pipeline 中加入case folding，需要慎重考虑。

- Case normalization is particularly useful for a search engine.
    - 使用"keyword", 来告诉search engine turn off case folding

- 有时候，camel case 也代表了特殊的含义，例如WordPerfect，FedEx，等。如果全部小写了，可能让这些词失去特有的意义。 所以，A better approach for case normalization is to lowercase only the first word of a sentence and allow all other words to retain their capitalization.

The best way to find out what works is to try several different approaches, and see which approach gives you the best performance for the objectives of your NLP project

### Stemming

A stem isn’t required to be a properly spelled word, but merely a token, or label, representing several possible spellings of a word. 

e.g. housing and houses -> hous.

每个stemmer 又一个参数：aggressiveness

Stemming is important for keyword search or information retrieval. (模糊搜索)当你搜索一句话的时候，有相通的stem 的词可以认定为相同。-- “broadening” of your search -- less likely to miss a relevant document or web page

和case folding 一样：
- improve recall
- lose precision
- dimension reduction -- avoid overfitting

下面，我们首先自己写一个简单的stemmer。

In [81]:
def stem(phrase):
    # The strip method ensures that some possessive words can be stemmed along with plurals.
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$', word)[0][0].strip("'") for word in phrase.lower().split()])
print(stem('houses'))
print(stem("Doctor House's calls"))


house
doctor house call


Two of the most popular stemming algorithms are the **Porter** and **Snowball** stemmers.

In [82]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])

'dish washer wash dish'

### Lemmatization

**Lemmatization**: associate several words together even if their spelling is quite different (lemma: semantic root of a word).

Lemmatization is a potentially more accurate way to normalize a word than stemming or case normalization because it takes into account a word’s meaning.

例如，better，如果使用stemming，可能会得到bett, or bet. 所以可能认为和bets，betting 等单词相同，从而出现错误。
使用lemmatization，better 和good 可以被认定相同，因为是基于语义的。

lemmatizer 使用以下信息来确认lemma: 
- a knowledge base of word synonyms (同义词典) 
- word endings 
- part of speech (POS) 
    - improve accuracy
    - 需要每个词的上下文信息context

#### 什么时候用lemmatization 和stemming
- 两种方法都可以 reduce your vocabulary size and increase the ambiguity of the text.
- 在search 场景中，两种方法都可以(nigligible) improving recall, (significantly) reduce precision (irrelative words / docs). 
- 通常，lemmatization 的效果比stemming 要好（所以，有些包例如spacy 没有提供stemming 的方法）
- Stemmers are generally faster
- stemming 通常用于information retrieval
- 可以在stemmer 前使用lemmatizer,得到更好的效果
    - 以为lemma 是valid English word, stemmer works well on the outputof a lemmatizer
- Bottom line, 不要使用lemmatization 和stemming，除非绝大多数的documents 都不包含有意义的大写词，例如FedEx。

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("better")  # 没有PoS 信息，所以没有变化

如果没有指明POS，NLTK 默认是名词。

## 3. Sentiment

每个token 都带有丰富的信息，有一种信息就是sentiment。NLP 的一个应用领域是情感分析(sentiment analysis). 

作者claim NLP has less chance of bias.

chatbot 要能理解用户的情感。

if you can’t say something nice, don’t say anything at all. So you need your bot to measure the niceness of everything you’re about to say and use that to decide whether to respond.

positivity: -1 ~ 1

- rule-based (heuriestics)
    - keywords - mapping to numeric scores/weights
    - score 加和，计算一句话的情感得分
    - VADER algorithm
- machine learning
    - self-labeled dataset
        - twitter: #happy - 可以看作是标签
        - product review: 5 star

### 3.1 VADER: A rule-based sentiment analyzer

VADER: Valence Aware Dictionary for sEntiment Reasoning. "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text" by Hutto and Gilbert (http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf)

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

In [92]:
sentences = ["VADER is smart, handsome, and funny.", # positive sentence example
             "VADER is smart, handsome, and funny!", # punctuation emphasis handled correctly (sentiment intensity adjusted)
             "VADER is very smart, handsome, and funny.",  # booster words handled correctly (sentiment intensity adjusted)
             "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
             "VADER is VERY SMART, handsome, and FUNNY!!!",# combination of signals - VADER appropriately adjusts intensity
             "VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!",# booster words & punctuation make this close to ceiling for score
             "The book was good.",         # positive sentence
             "The book was kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
             "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
             "A really bad, horrible book.",       # negative sentence with booster words
             "At least it isn't a horrible book.", # negated negative sentence with contraction
             ":) and :D",     # emoticons handled
             "",              # an empty string is correctly handled
             "Today sux",     #  negative slang handled
             "Today sux!",    #  negative slang with punctuation emphasis handled
             "Today SUX!",    #  negative slang with capitalization emphasis
             "Today kinda sux! But I'll get by, lol" # mixed sentiment example with slang and constrastive conjunction "but"
            ]

In [95]:
tricky_sentences = [
    "Most automated sentiment analysis tools are shit.",
    "VADER sentiment analysis is the shit.",
    "Sentiment analysis has never been good.",
    "Sentiment analysis with VADER has never been this good.",
    "Warren Beatty has never been so entertaining.",
    "I won't say that the movie is astounding and I wouldn't claim that the movie is too banal either.",
    "I like to hate Michael Bay films, but I couldn't fault this one",
    "It's one thing to watch an Uwe Boll film, but another thing entirely to pay for it",
    "The movie was too good",
    "This movie was actually neither that funny, nor super witty.",
    "This movie doesn't care about cleverness, wit or any other kind of intelligent humor.",
    "Those who find ugly meanings in beautiful things are corrupt without being charming.",
    "There are slow and repetitive parts, BUT it has just enough spice to keep it interesting.",
    "The script is not fantastic, but the acting is decent and the cinematography is EXCELLENT!",
    "Roger Dodger is one of the most compelling variations on this theme.",
    "Roger Dodger is one of the least compelling variations on this theme.",
    "Roger Dodger is at least compelling as a variation on the theme.",
    "they fall in love with the product",
    "but then it breaks",
    "usually around the time the 90 day warranty expires",
    "the twin towers collapsed today",
    "However, Mr. Carter solemnly argues, his client carried out the kidnapping under orders and in the ''least offensive way possible.''"
]

In [97]:
sid = SentimentIntensityAnalyzer()
for sentence in sentences:
    print(sentence)
    ss = sid.polarity_scores(sentence)
    for k in sorted(ss):
        print('{0}: {1}, '.format(k, ss[k]), end='')
    print()

LookupError: 
**********************************************************************
  Resource [93mvader_lexicon[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('vader_lexicon')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93msentiment/vader_lexicon.zip/vader_lexicon/vader_lexicon.txt[0m

  Searched in:
    - '/Users/chenwang/nltk_data'
    - '/Users/chenwang/opt/anaconda3/nltk_data'
    - '/Users/chenwang/opt/anaconda3/share/nltk_data'
    - '/Users/chenwang/opt/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


### 3.2 Naive Bayes

A Naive Bayes model tries to find keywords in a set of documents that are **predictive** of your target (output) variable. (a.k.a feature selection)

和VADER 相比，不需要人为输入score，NB 的internal coeeficients will map tokens to scores (just like VADER). The machine will find the “best” scores for any problem.
