# 2. NLP Pipeline

> The whole is more than the sum of its parts. It is more correct to say that the whole is something else than the sum of its parts, because summing up is a meaningless procedure, whereas the whole-part relationship is meaningful.
> -- Kurt Koffka

一般解决NLP 问题的方法：
1. 把一个大问题分解成许多小问题(sub-problem)
2. 逐步解决这些小问题(step by step)

step-by-step processing of text is known as a **pipeline**.


#### The figure below shows the main components of a generic pipeline for modern-day, data-driven NLP system development:

1. Data acquisition

2. Text cleaning

3. Pre-processing

4. Feature engineering

5. Modeling

6. Evaluation

7. Deployment

8. Monitoring and model updating

<img src="../figures/2-1.png" alt="drawing" width="600"/>

depending on the phase of the project, different steps can take different amounts of time. In the initial phases, most of the time is used in modeling and evaluation, whereas once the system matures, feature engineering can take far more time.

## 1. Data  Acquisition

下面我们介绍一下various strategies for gathering relevant data for an NLP project.

ideal 场景: 有数据，有label。 下面我们着重看less-than-ideal 的场景。

### 1.1 use a public dataset

- NLP dataset: https://github.com/niderhoff/nlp-datasets
- Google dataset search: https://datasetsearch.research.google.com/

问题: 可能与生产环境的数据不同。

### 1.2 Scrape data

从网络(例如论坛)获取数据，然后通过通过人来标记数据。

### 1.3 Product intervention

AI 模型总归是服务于产品的。AI team should work with the product team to collect more and richer data by developing better instrumentation in the product.

Product intervention is often the best way to collect data for building intelligent applications in industrial settings. 

### 1.4 Data augmentation

上述方法的一个问题是：持续收集产品数据需要时间。

to create text that is syntactically similar to source text data.

#### 单词替换
- Synonym replacement: 随机选取K 个非stop words 的token，用同义词替换
- TF-IDF–based word replacement
- Replacing entities: 识别出entity，然后用同类替换，例如：replace person name with another person name, city with another city, etc. 
- Adding noise to data (用拼写错误的单词来替换)
    - spelling 错误在应用场景中很常见，加入噪声，可以帮助我们训练更robust 的模型。
    - randomly choose a word in a sentence and replace it with another word that’s closer in spelling to the first word
    - flat finger: 用键盘上最近的character 来替换某个character


#### 创建意思相近的document
- Back translation: 翻译成另一种语言，再翻译回来，会得到和原始句子很像的一句话。
    - 常用于text classification

<img src="../figures/2-2.png" alt="drawing" width="600"/>

- Snorkel
    - using heuristics and creating synthetic data by transforming existing data and creating new data samples
    - ref: Snorkel DryBell: [A Case Study in Deploying Weak Supervision at Industrial Scale](https://arxiv.org/pdf/1812.00417.pdf)

- NLP data augmentation library:

    - Easy Data Augmentation (EDA)
    - NLPAug

#### 局部修改

- Bigram flipping
    - 将一个句子切成bigrams，然后随机选一个bigram，然后交换两个词的顺序
    
#### active learning
当我们有大量的unlabeled data 时，manually labelling is costly.
The question to answer is: for which data points should we ask for labels to maximize learning while keeping the labeling cost low?

## 2. Text Extraction and Cleanup

从上面可以看出，数据的来源很多，可以是public datasets, labeled datasets, augmented datasets. 所以下一步也是很重要的一步：数据清洗。

extracting raw text from the input data by: 
1. removing all the other non-textual information, such as markup, metadata, etc., 
2. and converting the text to the required encoding format.

<img src="../figures/2-3.png" alt="drawing" width="600"/>

### 2.1 HTML parsing and cleanup

如果我们需要从论坛爬数据，例如，提取question-answer pair，我们会发现 questions and answers have special tags associated with them.

所以，我们可以利用这些已知的HTML 文档结构来提取我们想要的信息。下面，我们通过一个例子来看如何提取stackoverflow 的question-answer pair. 通用的library 有beautifulShop 和Scrapy.

#### web scraping using Beautifulshop

BeautifulSoup is one of the many libraries which allow us to scrape web pages. Depending on your needs you can choose between the many available choices like beautifulsoup, scrapy, selenium, etc

In [1]:
#making the necessary imports
from pprint import pprint
from bs4 import BeautifulSoup
from urllib.request import urlopen 

In [2]:
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" #specify the url
html = urlopen(myurl).read() #query the website so that it returns a html page  
soupified = BeautifulSoup(html, 'html.parser') # parse the html in the 'html' variable, and store it in Beautiful Soup format

In [3]:
pprint(soupified.prettify()) #to get an idea of the html structure of the webpage

('<!DOCTYPE html>\n'
 '<html class="html__responsive" itemscope="" '
 'itemtype="http://schema.org/QAPage">\n'
 ' <head>\n'
 '  <title>\n'
 '   datetime - How to get the current time in Python - Stack Overflow\n'
 '  </title>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196" '
 'rel="shortcut icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="apple-touch-icon"/>\n'
 '  <link '
 'href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a" '
 'rel="image_src"/>\n'
 '  <link href="/opensearch.xml" rel="search" title="Stack Overflow" '
 'type="application/opensearchdescription+xml"/>\n'
 '  <link '
 'href="https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python" '
 'rel="canonical">\n'
 '   <meta content="width=device-width, height=device-height, '
 'initial-scale=1.0, minimum-scale=1.0" name="viewport"/>\n'
 '   

In [4]:
soupified.title #to get the title of the web page 

<title>datetime - How to get the current time in Python - Stack Overflow</title>

下面我们来看如何提取问题。从网页截屏我们可以看出问题。

<img src="fig/stackoverflow_question.png" alt="drawing" width="600"/>


In [11]:
question = soupified.find("div", {"class": "question"}) #find the nevessary tag and class which it belongs to
questiontext = question.find("div", {"class": "s-prose js-post-body"})
print("Question: \n", questiontext.get_text().strip())



Question: 
 What is the module/method used to get the current time?


最佳回答

<img src="fig/stackoverflow_best_answer.png" alt="drawing" width="600"/>



In [10]:
answer = soupified.find("div", {"class": "answer"}) #find the nevessary tag and class which it belongs to
answertext = answer.find("div", {"class": "s-prose js-post-body"})
print("Best answer: \n", answertext.get_text().strip())

Best answer: 
 Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.


### 2.2 Unicode Normalization

场景：
1. 有多重encoding schemes，default ecoding 可能和操作系统的不同。
2. Unicode characters, including symbols, emojis, graphic characters.

Unicode normalization (text encoding) 会在第8章介绍.

In [13]:
text = 'I love 🍕!  Shall we book a 🚕 to gizza'
text

'I love 🍕!  Shall we book a 🚕 to gizza'

In [14]:
text_utf_8 = text.encode("utf-8")
print(text_utf_8)

b'I love \xf0\x9f\x8d\x95!  Shall we book a \xf0\x9f\x9a\x95 to gizza'


### 2.3 Spelling Correction

Spelling correction is prevalent in search engines, etc.

miss-spelling may hurt the linguistic understanding of the data.

下面我们使用Bing Spelling Check Cloud Service.

```python
import requests
import json

api_key = "<ENTER-KEY-HERE>"
example_text = "Hollo, wrld" # the text to be spell-checked

data = {'text': example_text}
params = {
    'mkt':'en-us',
    'mode':'proof'
    }
headers = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Ocp-Apim-Subscription-Key': api_key,
    }
response = requests.post(endpoint, headers=headers, params=params, data=data)
json_response = response.json()
print(json.dumps(json_response, indent=4))
```

执行过后会看到类似以下结果。

```
"suggestions": [
            {
               "suggestion": "Hello",
               "score": 0.9115257530801
            },
            {
               "suggestion": "Hollow",
               "score": 0.858039839213461
            },
            {
               "suggestion": "Hallo",
               "score": 0.597385084464481
            }
```

### 2.4 System-Specific Error Correction

取决于数据源，有不同的问题。

#### pdf to text
从pdf 提取text

[What's so hard about PDF text extraction?](https://filingdb.com/b/pdf-text-extraction#:~:text=The%20main%20problem%20is%20that,control%20over%20the%20resulting%20document)

#### OCR (Optical Character Recognition)
从scanned 图片中提取text.

例如，我们想从下面的图片中提取文字。

<img src="../figures/2-5.png" alt="drawing" width="600"/>



In [17]:
from PIL import Image
from pytesseract import image_to_string

In [19]:
filename = "../figures/2-5.png"
text = image_to_string(Image.open(filename))
print(text)

 

In the nineteenth century the only kind of linguistics considered
seriously was this comparative and historical study of words in |: eS
known or believed to be cognate—say the Semitic languages, or the Indo-
European languages. It is significant that the Germans who really made
the subject what it was, used the term Indo-germanisch. Those who know
the popular works of Otto Jespersen wil remember how firmly he
declares that linguistic science is historical And those who have noticed

 

 

 

Figure 2-5. An example of scanned text



可以看出，OCR 有可能会出错。通常的处理方法：

- spelling correction: [pyenchant](https://github.com/pyenchant/pyenchant)
- post-correction: [ochre: Toolbox for OCR post-correction](https://github.com/KBNLresearch/ochre)

#### ASR (Automatic Speech Recognition)

从语音系统，转换成文字，也会出错。同样可以使用上面介绍的两种方法来做clean up.

## 3. Pre-Processing

#### pre-processing 和前面clean up 的区别

设想，我们从Wikipedia crawl 了一个网页。在text extraction 和clean up 环节，我们做了以下事情：
- remove html tags
- text coding 转换
- get clean text

在pre-processing step，我们需要：
- 把document 切成句子
- 把句子切成单词
- remove 一些词(例如stop words)
- case normalization
- ...

这些pre-processing step 通常是根据你的业务场景来确定。我们大致把这些step 分为以下几类：

- Preliminaries
    - Sentence segmentation 
    - word tokenization.

- Frequent steps
    - Stop word removal
    - stemming and lemmatization
    - removing digits/punctuation
    - lowercasing
    - etc.

- Other steps
    - Normalization
    - language detection
    - code mixing
    - transliteration
    - etc.

- Advanced processing
    - POS tagging
    - parsing
    - coreference resolution
    - etc.

### 3.1 Preliminaries

#### Sentence Segmentation

使用NLTK sentence tokenizer。

In [23]:
from nltk.tokenize import sent_tokenize, word_tokenize

mytext = """In the previous chapter, we saw examples of some common NLP 
applications that we might encounter in everyday life. If we were asked to 
build such an application, think about how we would approach doing so at our 
organization. We would normally walk through the requirements and break the 
problem down into several sub-problems, then try to develop a step-by-step 
procedure to solve them. Since language processing is involved, we would also
list all the forms of text processing needed at each step. This step-by-step 
processing of text is known as pipeline. It is the series of steps involved in
building any NLP model. These steps are common in every NLP project, so it 
makes sense to study them in this chapter. Understanding some common procedures
in any NLP pipeline will enable us to get started on any NLP problem encountered 
in the workplace. Laying out and developing a text-processing pipeline is seen 
as a starting point for any NLP application development process. In this
chapter, we will learn about the various steps involved and how they play  
important roles in solving the NLP problem and we’ll see a few guidelines
about when and how to use which step. In later chapters, we’ll discuss  
specific pipelines for various NLP tasks (e.g., Chapters 4–7)."""

my_sentences = sent_tokenize(mytext)

In [25]:
len(my_sentences)

11

In [26]:
my_sentences[0]

'In the previous chapter, we saw examples of some common NLP \napplications that we might encounter in everyday life.'

#### WORD TOKENIZATION



In [27]:
word_tokenize(my_sentences[0])

['In',
 'the',
 'previous',
 'chapter',
 ',',
 'we',
 'saw',
 'examples',
 'of',
 'some',
 'common',
 'NLP',
 'applications',
 'that',
 'we',
 'might',
 'encounter',
 'in',
 'everyday',
 'life',
 '.']

需要注意的是，现有的tokenizer 都远不够好，可能不能够适用于你的场景。例如twitter 消息包含很多emoji。比如下面的例子，我们可以看到，一个笑脸被切成三个单独的字符。所以，有时候我们需要自己写一个适合自己场景的tokenizer.

In [28]:
emoji = ':-)'

word_tokenize(emoji)

[':', '-', ')']

### 3.2 Frequent Steps

#### remove stop words
Stop words 并没有太多的语义信息，在一些任务中需要移除。 例如，在文本分类的任务中，stop words对区分class 并没有什么帮助，因为每个文章都包含了很多Stop words.

NLTK 提供了默认的stop words 列表。

#### case normalization
有些场景里，例如classification，大小写对语义没有分别，因此可以都换成小写。

但是case normalization 对于name entity recognition 的场景中可能不适用，因为有些专有名词可能识别不到。

#### remove punctuation and numbers
在text classification, information retrieval, social media analytics 中经常使用。我们在日志分析中也remove numbers.

上面三步，不是必须的，根据使用场景来确定。

In [29]:
from nltk.corpus import stopwords
from string import punctuation


In [40]:
mystopwords = set(stopwords.words("english"))
 
# 处理一句话
def remove_stops_digits(tokens):
    return [token.lower() for token in tokens if token not in mystopwords and 
            not token.isdigit() and token not in punctuation]

# 处理一个文本
def preprocess_corpus(texts):
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

In [30]:
# def preprocess_corpus(texts):
#     mystopwords = set(stopwords.words("english"))
    
#     def remove_stops_digits(tokens):
#         return [token.lower() for token in tokens if token not in mystopwords and 
#                not token.isdigit() and token not in punctuation]
    
#     return [remove_stops_digits(word_tokenize(text)) for text in texts]

In [41]:
remove_stops_digits(word_tokenize(my_sentences[0]))

['in',
 'previous',
 'chapter',
 'saw',
 'examples',
 'common',
 'nlp',
 'applications',
 'might',
 'encounter',
 'everyday',
 'life']

#### Stemming 

Stemming 也是一种normalization. **Stemming** refers to the process of removing suffixes and reducing a word to some base form such that all different variants of that word can be represented by the same form.

实现：Apply a fixed set of rules.

<img src="../figures/2-7.png" alt="drawing" width="600"/>

stemming is commonly used in：
- **search engines** to match user queries to relevant documents 
- **text classification** to reduce the feature space to train machine learning models.

下面，使用NLTK 做一个展示。可以看到，stem 有时候并不是一个linguistically correct 的形式，例如“revolut”。

In [43]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
word1, word2 = 'cars', 'revolution' 
print(stemmer.stem(word1), stemmer.stem(word2))

car revolut


#### Lemmatization

mapping all the different forms of a word to its base word, or lemma. 和stem 的区别是，lemma 是linguistically correct 的。

Lemmatization requires more linguistic knowledge, and modeling and developing efficient lemmatizers remains an open problem in NLP research even now.



In [51]:
from nltk.stem import WordNetLemmatizer 

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a")) #a is for adjective
print(lemmatizer.lemmatize("best", pos="a")) #a is for adjective

good
best


In [52]:
import spacy

sp = spacy.load('en_core_web_sm')
token = sp(u'better')
for word in token:
    print(word.text,  word.lemma_)
    
token = sp(u'best')
for word in token:
    print(word.text,  word.lemma_)

better well
best good


lemmatization 一般更复杂，耗时更长，所以除非必须，否则不太用。




### 3.3 Other Pre-Processing Steps

上面我们介绍了common 的方法，下面介绍一些特殊的处理方法。

#### Text normalization

场景：电话号码的不同写法：13888888888, 138-8888-8888, 138-888-88888

解决方法：
- convert all text to lowercase or uppercase
- convert digits to text (e.g., 9 to nine)
- expand abbreviations
- etc.

spaCy 使用一个字典mapping 来做text normalization. https://github.com/explosion/spaCy/blob/master/spacy/lang/norm_exceptions.py

#### Language detection

场景: 评论区可能有一些non-english comments.

通常language detection 是pipeline 的第一个step，通常后面接着language-specific pipeline.常用的library 有[Polyglot](https://polyglot.readthedocs.io/en/latest/)。

#### Code Mixing and Transliteration

有时，一句话里可能有多个语言，就像我们说话的时候也会带很多英语单词。

- **Code mixing** refers to this phenomenon of switching between languages. 

- **Transliteration** 类似于用拼音拼写中文名字或英文单词拼下日文名字，所以在一句话中只有英文单词。


### 3.4 Advanced Processing

#### 应用场景：named entity recognition
有时，做named entity recognition 需要先做POS Tagging，因为识别出proper nouns 对于识别named entity 很有帮助。

通常，我们不需要自己开发POS Tagging solution，直接使用spaCy，NLTK 这些成熟的解决方案。

In [57]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Charles Spencer Chaplin was born on 16 April 1889 to Hannah Chaplin \
          born Hannah Harriet Pedlingham Hill Charles Chaplin Sr')

for token in doc:
    print(token.text, token.lemma_, token.pos_,
          token.shape_, token.is_alpha, token.is_stop)

Charles Charles PROPN Xxxxx True False
Spencer Spencer PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
was be AUX xxx True True
born bear VERB xxxx True False
on on ADP xx True True
16 16 NUM dd False False
April April PROPN Xxxxx True False
1889 1889 NUM dddd False False
to to ADP xx True True
Hannah Hannah PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
                      SPACE      False False
born bear VERB xxxx True False
Hannah Hannah PROPN Xxxxx True False
Harriet Harriet PROPN Xxxxx True False
Pedlingham Pedlingham PROPN Xxxxx True False
Hill Hill PROPN Xxxx True False
Charles Charles PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
Sr Sr PROPN Xx True False


注意，使用不同的library，得到的结果可能不一样，例如NLTK 和spaCy。应用选择哪个libray 是一个主观的决定。

#### 应用场景：relation extraction

identify named entity 之间的关系，例如Satya Nadella is the CEO of Microsoft. 我们需要在pre-processing 中做什么？
1. named entity recognition
2. POS tagging
3. relation 描述：syntactic representation of the sentence
4. coreference resolution



### Pre-processing 小结

上面我们介绍的都是一些常用的pre-processing 步骤。
- 具体要用那些步骤，需要根据场景和经验来判断
- 具体预处理的方法的顺序，依据具体场景确定。

在后面的章节中，我们会针对不同场景，来介绍不同的pre-processing pipeline。下图总结了我们现在介绍的一些方法。

<img src="../figures/2-11.png" alt="drawing" width="600"/>

## 4. Feature Engineering

pre-processing 之后，我们需要做feature engineering，主要是为了给我们的ML 算法提供输入，也就是把text 转换成向量。因此也称为text representation.
The goal of feature engineering is to capture the characteristics of the text into a **numeric vector** that can be understood by the ML algorithms.

下图展示了对于传统的方法和DL 方法两种不同的pipeline。

<img src="../figures/2-12.png" alt="drawing" width="600"/>

### 4.1 Classical NLP/ML Pipeline

feature engineering 的方法通常是handcrafted，例如在sentiment analysis 应用场景中，count the number of positive and negative words in each review.

For classical ML models, the features are heavily inspired by the task at hand as well as domain knowledge.

#### Advantage:
可解释性强

#### Disadvantage:
Handcrafted feature engineering becomes a bottleneck for both model performance and the model development cycle.

### 4.2 DL pipeline

区别在于: DL model is capable of “learning” features from the data.


## 5. Modeling

下一步: how to build a useful solution out of this (geatures, numeric vectors). 

通常的方法是先获取少量数据，尝试较简单的模型。然后不断的获取数据，使用更复杂的模型，优化性能。

### 5.1 Start with simple heuristics - good starting point

human-built heuristics can provide a great start in some ways.

即便是使用ML/DL 方法，也可以加入heuristics. Even when we’re building ML-based models, we can use such heuristics to handle special cases—for example, cases where the model has failed to learn well. 

- 例如spam detection，我们可以设置一个blacklist，直接判定为spam.
- information extraction: using regex. 提取email，电话，等。
    - [TokensRegex](https://nlp.stanford.edu/software/tokensregex.html)
    - [spaCy’s rule-based matching](https://spacy.io/usage/rule-based-matching)

### 5.2 Building your model

heuristic based 方法最大的问题是，当系统变得复杂时，管理和维护起来很困难。

#### Create a feature from the heuristic for your ML model

例如，spam 识别，我们可以加入一个feature: the number of words from the blacklist in a given email

#### Pre-process your input to the ML model

如果有些heuristic rules 很准，可以直接使用规则过滤，剩下的输入到ML model.



### 5.3 Building THE model

We start with a baseline approach and work toward improving it. have to do many iterations.

下面我们来介绍一些improving 的方法。

#### Ensemble and stacking

use a collection of ML models, dealing with different aspects of the prediction problem. 

通常，有两种方法来做model 叠加, 如下图所示:
- **model stacking**: sequentially going from one model to another and obtaining a final output. 
- **model ensembling**: pool predictions from multiple models and make a final prediction.

<img src="../figures/2-14.png" alt="drawing" width="600"/>

meta-model: a model that uses other models.

#### Better feature engineering

A better feature engineering step may lead to better performance. see chapter 11.

#### Transfer learning

Transfer learning tries to transfer preexisting knowledge from a big, well-trained model to a newer model at its initial phase. Transfer learning provides a better initialization, which helps in the downstream tasks, especially when the dataset for the downstream task is smaller.

#### Reapplying heuristics

对于ML model 出错的例子，深入研究，找到common patterns in errors, and use heuristics to correct them. 

相当于多了一层过滤和保护。

#### active learning

当数据不够多的时候，可以使用active learning: use user feedback or other such sources to continuously collect new data to build better models. 

#### 如何做model 选型

| Data attribute                                               | Decision path                                                                                                                                                                                                                                 | Examples                                                                                                                                                                  |
|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Large data volume                                            | Can use techniques that require more data, like DL. Can use a richer set of features as well. If the data is sufficiently large but unlabeled, we can also apply unsupervised techniques.                                                     | If we have a lot of reviews and metadata associated with them, we can build a sentiment-analysis tool from scratch.                                                       |
| Small data volume                                            | Need to start with rule-based or traditional ML solutions that are less data hungry. Can also adapt cloud APIs and generate more data with weak supervision. We can also use transfer learning if there’s a similar task that has large data. | This often happens at the start of a completely new project.                                                                                                              |
| Data quality is poor and the data is heterogeneous in nature | More data cleaning and pre-processing might be required.                                                                                                                                                                                      | This entails issues like code mixing (different languages being mixed in the same sentence), unconventional language, transliteration, or noise (like social media text). |
| Data quality is good                                         | Can directly apply off-the-shelf algorithms or cloud APIs more easily.                                                                                                                                                                        | Legal text or newspapers.                                                                                                                                                 |
| Data consists of full-length documents                       | Choose the right strategy for breaking the document into lower levels, like paragraphs, sentences, or phrases, depending on the problem.                                                                                                      | Document classification, review analysis, etc.                                                                                                                            |




## 6. Evaluation

Key step: measure how good the model is.

Evaluation 最关键的问题有：
- using the right metric for evaluation: 
    - 不同的NLP task 需要不同的evaluation metrics
    - 不同的phase 使用不同的evaluation metrics: model building, deplyment phase 使用ML metrics, production phase 使用business metrics
- following the right evaluation process

evaluation 分为两类：
- intrinsic: focuses on intermediary objectives
    - e.g., precision and recall for spam filter
- extrinsic: focuses on evaluating performance on the final objective
    - e.g., measuring the time a user wasted because a spam email went to their inbox or a genuine email went to their spam folder.

### 6.1 Intrinsic Evaluation

通常需要一个 test set where we have the ground truth or labels.

一些场景下，可以automate，例如document classification. 一些场景下，不能automate，例如text summarization，这时候的评估比较主观。下表列出了常用的metrics.

| Metric                                      | Description                                                                                                                                                                                                                                         | Applications                                                                                                                                                                             |
|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Accuracy [48]                               | Used when the output variable is categorical or discrete. It denotes the fraction of times the model makes correct predictions as compared to the total predictions it makes.                                                                       | Mainly used in classification tasks, such as sentiment classification (multiclass), natural language inference (binary), paraphrase detection (binary), etc.                             |
| Precision [48]                              | Shows how precise or exact the model’s predictions are, i.e., given all the positive (the class we care about) cases, how many can the model classify correctly?                                                                                    | Used in various classification tasks, especially in cases where mistakes in a positive class are more costly than mistakes in a negative class, e.g., disease predictions in healthcare. |
| Recall [48]                                 | Recall is complementary to precision. It captures how well the model can recall positive class, i.e., given all the positive predictions it makes, how many of them are indeed positive?                                                            | Used in classification tasks, especially where retrieving positive results is more important, e.g., e-commerce search and other information-retrieval tasks.                             |
| F1 score [49]                               | Combines precision and recall to give a single metric, which also captures the trade-off between precision and recall, i.e., completeness and exactness. F1 is defined as (2 × Precision × Recall) / (Precision + Recall).                          | Used simultaneously with accuracy in most of the classification tasks. It is also used in sequence-labeling tasks, such as entity extraction, retrieval-based questions answering, etc.  |
| AUC [48]                                    | Captures the count of positive predictions that are correct versus the count of positive predictions that are incorrect as we vary the threshold for prediction.                                                                                    | Used to measure the quality of a model independent of the prediction threshold. It is used to find the optimal prediction threshold for a classification task.                           |
| MRR (mean reciprocal rank) [50]             | Used to evaluate the responses retrieved given their probability of correctness. It is the mean of the reciprocal of the ranks of the retrieved results.                                                                                            | Used heavily in all information-retrieval tasks, including article search, e-commerce search, etc.                                                                                       |
| MAP (mean average precision) [51]           | Used in ranked retrieval results, like MRR. It calculates the mean precision across each retrieved result.                                                                                                                                          | Used in information-retrieval tasks.                                                                                                                                                     |
| RMSE (root mean squared error) [48]         | Captures a model’s performance in a real-value prediction task. Calculates the square root of the mean of the squared errors for each data point.                                                                                                   | Used in conjunction with MAPE in the case of regression problems, from temperature prediction to stock market price prediction.                                                          |
| MAPE (mean absolute percentage error) [52]  | Used when the output variable is a continuous variable. It is the average of absolute percentage error for each data point.                                                                                                                         | Used to test the performance of a regression model. It is often used in conjunction with RMSE.                                                                                           |
| BLEU (bilingual evaluation understudy) [53] | Captures the amount of n-gram overlap between the output sentence and the reference ground truth sentence. It has many variants.                                                                                                                    | Mainly used in machine-translation tasks. Recently adapted to other text-generation tasks, such as paraphrase generation and text summarization.                                         |
| METEOR [54]                                 | A precision-based metric to measure the quality of text generated. It fixes some of the drawbacks of BLEU, such as exact word matching while calculating precision. METEOR allows synonyms and stemmed words to be matched with the reference word. | Mainly used in machine translation.                                                                                                                                                      |
| ROUGE [55]                                  | Another metric to compare quality of generated text with respect to a reference text. As opposed to BLEU, it measures recall.                                                                                                                       | Since it measures recall, it’s mainly used for summarization tasks where it’s important to evaluate how many words a model can recall.                                                   |
| Perplexity [56]                             | A probabilistic measure that captures how confused an NLP model is. It’s derived from the cross-entropy in a next word prediction task. The exact definition can be found at [56].                                                                  | Used to evaluate language models. It can also be used in language-generation tasks, such as dialog generation.                                                                           |

[48] Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning, Second Edition. New York: Springer, 2001. ISBN: 978-0-387-84857-0

[49] Wikipedia. “F1 score”. Last modified April 18, 2020.

[50] Wikipedia. “Mean reciprocal rank”. Last modified December 6, 2018.

[51] Wikipedia. “Evaluation measures (information retrieval)”. Last modified February 12, 2020.

[52] Wikipedia. “Mean absolute percentage error”. Last modified February 6, 2020.

[53] Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. “BLEU: A Method for Automatic Evaluation of Machine Translation.” Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002): 311–318.

[54] Banerjee, Satanjeev and Alon Lavie. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.” Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (2005): 65–72.

[55] Lin, Chin-Yew. “ROUGE: A Package for Automatic Evaluation of Summaries.” Text Summarization Branches Out (2004): 74–81.

[56] Wikipedia. “Perplexity”. Last modified February 13, 2020.

#### classification task - confusion matrix

visual evaluation. it helps to understand how “confused” the classification model is in terms of identifying different classes.

#### Ranking tasks

通常使用MRR，MAP。也可以使用recall, 例如：Recall at rank K. It looks for the presence of ground truth in top K retrieved results. If present, it’s a success.

#### text generation task

text generation评估的难点: 结果和标签可能有相同的意思，但是不同的表达。可能需要人为参与，成本较高。

通用的方法：

- BLEU and METEOR for translation
- Perplexity: dialog generation



### 6.2 Extrinsic Evaluation

Extrinsic evaluation focuses on evaluating the model performance on the final objective. In industrial projects, any AI model is built with the aim of solving a business problem. 

有时model 的intrinsic 评估很好，但是extrinsic evaluation 可能不好。原因很多，有可能是选择了错误的metrics。但是最终的目的是business goal.

好的实践: set up the business metrics and the process to measure them correctly at the start of the project.

#### 为什么不直接做Extrinsic evaluation?

- extrinsic evaluation is much more expensive.
    - extrinsic evaluation often includes project stakeholders outside the AI team—sometimes even end users. Intrinsic evaluation can be done mostly by the AI team itself. 
- 先来intrinsic，有足够信心了之后，再去做extrinsic evaluation.
    - bad results in intrinsic evaluation often imply bad results in extrinsic evaluation.

## 7.  Post-Modeling Phases - model deployment, monitoring, and updating

评估通过后，下面的工作是，部署，监控，持续更新。

### 7.1 Deployment

Once we’re happy with one final solution, it needs to be deployed in a production environment as a part of a larger system.

An NLP module is typically deployed as a web service. 例如spam detection 就是一个web service, takes a text as input and returns the email’s category (spam or non-spam) as output.

第11章会介绍deployment.

### 7.2 Monitoring
 
the model performance is monitored constantly after deploying. If we’re automatically training the model frequently, we have to make sure that the models behave in a reasonable manner.

第11章会介绍mornitoring.

### 7.3 Model Update

Once we get the new data, we’ll iterate the model based on this new data to stay current with predictions. 

下面的表格列出了一些不同的model update process 的方法。

| Project attribute                                                                      | Decision paths                                                                                                                                                   | Examples                                                                                   |
|----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| More training data is generated post-deployment.                                       | Once deployed, extracted signals can be used to automatically improve the model. Can also try online learning to train the model automatically on a daily basis. | Abuse-detection systems where users flag data.                                             |
| Training data is not generated post-deployment.                                        | Manual labeling could be done to improve evaluation and the models. Ideally, each new model has to be manually built and evaluated.                              | A subset of a larger NLP pipeline with no direct feedback.                                 |
| Low model latency is required, or model has to be online with near-real-time response. | Need to use models that can be inferred quickly. Another option is to create memoization strategies like caching or have substantially bigger computing power.   | Systems that need to respond right away, like any chatbot or an emergency tracking system. |
| Low model latency is not required, or model can be run in an offline fashion.          | Can use more advanced and slower models. This can also help in optimizing costs where feasible.                                                                  | Systems that can be run on a batch process, like retail product catalog analysis.          |

## 8. Working with other languages

上面我们已经针对英文介绍了处理的pipeline。下面我们来看一下其他语言，有些语言的pipeline 更简单，有些则更复杂。



| Language attribute         | Example and languages                                                                                                                                                                                                                                                                                      | Action                                                                                                                                                                                                                                                                                                                       |
|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| High-resource languages    | Languages that have both ample data as well as pre-built models. Examples include English, French, and Spanish.                                                                                                                                                                                            | Possible to use pre-trained DL models. Easier to use.                                                                                                                                                                                                                                                                        |
| Low-resource languages     | Languages that have limited data and recent digital adoption. May not have pre-built models. Examples include Swahili, Burmese, and Uzbek.                                                                                                                                                                 | Depending on the task, may need to label more data as well as explore individual components.                                                                                                                                                                                                                                 |
| Morphologically rich       | Linguistic and grammatical information like subject, object, predicate, tense, and mode are not separate words, but are joined together. Examples include Latin, Turkish, Finnish, and Malayalam.                                                                                                          | If the language is not resource rich, we’ll need to explore morphological analyzers that exist for the language. In the worst case, manual rules to handle certain cases might be needed.                                                                                                                                    |
| Vocabulary variation heavy | Nonstandard spellings and high word variation. For Arabic and Hindi, the spellings are nonstandard.                                                                                                                                                                                                        | If the language is not resource rich, then we may need to first normalize the words/spellings before training any model. This may not be needed for languages with large datasets, as they can still learn of vocabulary variation.                                                                                          |
| CJK languages              | These languages are derived from ancient Chinese characters. They’re not alphabet based and have several thousand characters for basic literacy and over 40,000 characters for larger coverage. Thus, they have to be handled differently. They include Chinese, Japanese, and Korean, hence the name CJK. | Use specific tokenization schemes in these languages. Given that an ample amount of CJK data is available, it’s possible to build NLP models for various tasks from scratch. There are also pre-trained models for them. Transfer learning from models trained in other languages beyond CJK may not be useful in this case. |

## 9. Case study - Uber COTA system

每天会有很多问题单，对于每个问题单，有很多可行的解决方案。
COTA (Customer Obsession Ticketing Assistant) is to rank these solutions and pick the best possible one.

这个系统的pipeline 如下图所示：

- pre-processing：
    - tokenization 
    - lowercasing 
    - stop word removal
    - lemmatization. 
- feature engineering: 目标是topic modeling
    - TF-IDF
    - LSI
    
<img src="../figures/2-15.png" alt="drawing" width="600"/>



## 10. Wrap-up
