# 1. NLP Primer


计算机只能理解0和1，所以NLP 的主要问题是，怎么样将自然语言以0和1的形式给程序，并理解。
NLP deals with methods to analyze, model, and understand human language


## 1. NLP tasks and applications

<img src="../figures/1-1.png" alt="drawing" width="800"/>

### 1.1 NLP applications
- **Email platforms**: spam classification, priority inbox, calendar event extraction, auto-complete, etc. 
- **Voice-based assistants**: interact with the user, understand user commands, and respond accordingly.
- **Search engine**: query understanding, query expansion, question answering, information retrieval, and ranking and grouping of the results
- **machine translation**
- **social media**: sentimental analysis
- **E-Commerce platforms**: understanding customer reviews
- **Text generation**: reports
- **Spelling check**
- **Plagiarism detection**
- **Build large knowledge bases**: knowledge base / graph is used for search and QA.
- many others...

### 1.2 NLP tasks

上述NLP 问题大多可以分解成以下tasks:

- Language modeling:
    - The goal of this task is to learn the probability of a sequence of words appearing in a given language.
    - 使用场景: speech recognition, OCR (Optical Character Recognition), handwriting recognition, machine translation, spelling correction, etc.

- Text classification
    - 使用场景: email spam identification, sentiment analysis
    
- Information extraction
    - 使用场景: calendar events, etc.
    
- Information retrieval
    - This is the task of finding documents relevant to a user query from a large collection. 
    - 使用场景: Google search
    
- Conversational agent
    - Siri, Alexa, etc.

- Text summarization
    - Create short summaries of longer documents while retaining the core content and preserving the overall meaning of the text.

- Question answering

- Machine translation
    - 使用场景: Google translate

- Topic modeling
    - This is the task of uncovering the topical structure of a large collection of documents. 

下面，我们按这些task 的复杂度进行排序

<img src="../figures/1-2.png" alt="drawing" width="600"/>



## 2. Building blocks of language?

自然语言主要有4个组成部分:

<img src="../figures/1-3.png" alt="drawing" width="800"/>


### 2.1 phonemes
important in speech understanding.
- smallest units of sound
- may not have meaning
- can induce meanings: un, in, etc.

### 2.2 morphemes and lexemes
A morpheme is the smallest unit of language that has a meaning. multimedia, multi is a morpheme.

<img src="../figures/1-5.png" alt="drawing" width="600"/>

Lexemes are the structural variations of morphemes related to one another by meaning. “run” and “running” belong to the same lexeme form.


### 2.3 syntax
Syntax is a set of rules to construct grammatically correct sentences out of words and phrases in a language.
syntactic structure 通常用一个parsing tree 来表示(**N** stands for noun, **V** for verb, and **P** for preposition, Noun phrase is denoted by **NP** and verb phrase by **VP**).

<img src="../figures/1-6.png" alt="drawing" width="600"/>

Entity extraction and relation extraction are some of the NLP tasks that build on this knowledge of parsing.

### 2.4 context

Context is how various parts in a language come together to convey a particular meaning. The meaning of a sentence can change based on the context, as words and phrases can sometimes have multiple meanings. context 包括semantics and pragmatics.

- semantics: the direct meaning of the words and sentences without external context
- pragmatics: adds world knowledge and external context of the conversation to enable us to infer implied meaning

使用场景: sarcasm detection, summarization, topic modeling. 


## 3. Why is NLP challenging

### 3.1 Ambiguity

Ambiguity = uncertainty of meaning, 参考下面的例子(from **Winograd Schema Challenge**)。

<img src="../figures/1-7.png" alt="drawing" width="600"/>

因为每一对句子都只有一两个词不一样，所以使用大部分NLP 方法，很难区分他们。但是人类可以轻松区分他们。

### 3.2 Common Knowledge

人们在谈话的时候，有一些common knowledge 是大家默认都知道的，所以不需要exlicitly mention. 

One of the key challenges in NLP is how to encode all the things that are common knowledge to humans in a computational model.



### 3.3 Creativity

Various styles, dialects, genres, and variations are used in any language. 


### 3.4 Diversity across languages

当我们有一个NLP 系统时，我们想把它用于另一个language，会遇到困难。因为For most languages in the world, there is no direct mapping between the vocabularies of any two languages.

对于一个语言的解决方案，对于另一个语言可能不适用。

## 4. Machine Learning, Deep Learning, and NLP: An Overview

下面我们来看看一些解决NLP 问题的通用方法。

machine learning，deep learning 和NLP之间的关系如下图所示：

<img src="../figures/1-8.png" alt="drawing" width="600"/>

- **Supervised Learning**: The goal is to learn the mapping function from input to output given a large number of examples in the form of input-output pairs.

- **Unsupervised Learning**: aim to find hidden patterns in given input data without any reference output.

- **Reinforcement Learning**: learn tasks via trial and error and is characterized by the absence of either labeled or unlabeled data in large quantities. (现在在NLP 领域还没有大规模使用)



## 5. Approaches to NLP

通常分为三类：
- heuristics
- machine learning
- deep learning

### 5.1 heuristics-based NLP

Rule-based. 
Example: lexicon-based sentiment analysis, count of positive and negative words.

#### heristics-based 方法的优势是什么?
- Put simply, rules and heuristics help you quickly build the first version of the model and get a better understanding of the problem at hand.
- Rules and heuristics can also be useful as features for machine learning–based NLP systems.
- 多一些确定性，某些行业对可靠性要求更高，例如health care

#### 通常使用工具：
- 字典
- 知识库(例如wordnet等)
    - Synonyms: refer to words with similar meanings
    - Hyponyms: capture is-type-of relationships.
        - baseball, tennis are hyponyms of sports
    - Meronyms capture is-part-of relationships
        - hands and legs are meronyms of the body

<img src="../figures/1-9.png" alt="drawing" width="500"/>

- Regex
    - build rule-based system 通常使用Regex. 
    - Regexes are a great way to incorporate domain knowledge in your NLP system.
    - **Probabilistic regexes** “including a probability of a match (参见pregex library).

### 5.2 Machine Learning for NLP

- **classification**: classify an article to news topics
- **regression**: predict stock price based on relevant news (social media discussion and rumors)
- **clustering**: group similar documents

所有机器学习的方法分为三步：
1. extract features from text -- (chapter 3)
2. using features to learn a model
    - **Naive Bayes**
        - assumes each feature is independent of all other features
        - Pros: simple to understand, fast to train and run
        - Cons: strong assumption -- usually used as the starting algorithms for text classification
    - **SVM**
        - To learn a decision boundary that acts as a separation between different categories of text
        - Pros: robustness to variation and noise in the data
        - Cons: training time, inability to scale
    - **Hiddel Markov Model**
        - assumes there is an underlying, unobservable process with hidden states that generates the data. “each hidden state is dependent on the previous state(s).
        - POS tagging: underlying grammar rules
    - **Conditional Random Fields**
        - performs a classification task on each element in the sequence
        - CRFs outperform HMMs for tasks such as POS tagging
3. evaluating and improving the model -- (chapter 2)

### 5.3 Deep learning for NLP

#### RNN
The memory is temporal

<img src="../figures/1-13.png" alt="drawing" width="500"/>

RNN is useful for text classification, named entity recognition, machine translation, text generation, etc.


#### LSTM
RNN suffers from forgetful memory — they cannot remember longer contexts and therefore do not perform well when the input text is long.

LSTM 丢弃掉无用的信息，保留有用的context。这个memory 用一个vector 表示。

#### GRU
相比LSTM，更多用于text generation。

#### CNN
常用于text-classification tasks

The main advantage CNNs have is their ability to look at a group of words together using a context window.

<img src="../figures/1-15.png" alt="drawing" width="600"/>

#### Transformers
They model the textual context but not in a sequential manner

large transformers have been used for transfer learning with smaller downstream tasks. 
- pre-training: 
- fine-tuning: fine-tuned on downstream NLP tasks, such as text classification, entity extraction, question answering

<img src="../figures/1-16.png" alt="drawing" width="600"/>

BERT.

<img src="../figures/1-17.png" alt="drawing" width="600"/>

#### Autoencoders
learning compressed vector representation of the input. Autoencoders are typically used to create feature representations needed for any downstream tasks.

<img src="../figures/1-18.png" alt="drawing" width="600"/>

LSTM auto-encoders.



## 6. Deep learning is not yet the silver bullet for NLP

deep learning在industry 实际落地场景中存在有一些新的挑战:

#### Overfitting on small datasets

Occam’s razor suggests that a simpler solution is always preferable given that all other conditions are equal. 

当数据不足时，简单的模型更好。

#### Few-shot learning and synthetic data generation

在图像处理领域，已经可以通过少数的样本训练生成较好的模型。但在NLP 领域还做不到。

#### Domain adaptation

例如法律，both the syntactic and semantic structure of the language is specific to the domain. transfer 的时候会造成performance degradation.

#### Interpretable models

Controllability and interpretability is hard for DL models because, most of the time, they work like a black box.

在图像处理上，DL model 不是那么的blackbox.

#### Common sense and world knowledge

科学家对于language 本身的理解，并不透彻。人们在使用语言时，同时有logical reasoning, 对计算机很难。

现在的DL model 对于common sense understanding and logical reasoning 表现困难。如何把knowledge (e.g., knowledge graph) 和DL model integrate？

#### Cost

In terms of both money and time.

除此以外，bulky models may cause latency issues during inference time and may not be useful in cases where low latency is a must.

#### On-device deployment

owing to limitations of the device, the solution must work with limited memory and power.

基于以上挑战，我们可以看出，DL is not always the go-to solution for all industrial NLP applications.

本书会介绍pipeline，而并非以两种DL 模型。




## 7. An NLP walkthrough: conversational agents

Siri, Alexa. 包括以下几个major NLP components.

<img src="../figures/1-19.png" alt="drawing" width="500"/>

#### Speech recognition and synthesis

获取用户输入：voice-based conversational agent 必备。

- Speech recognition: speech signals to text
- Speech synthesis: text to speech

这两项技术现在都比较成熟，通常的做法是使用Cloud API.

#### Natural language understanding

理解用户的输入，可分为以下几个组件：

1. Sentiment Analysis
analyze the sentiment of the user response (chapter 4)

2. Named Entity Recognition
identify all the important entities the user mentioned in their response (chapter 5)

3. Coreference resolution
find out the references of the extracted entities from the conversation history. (chapter 5)

#### Dialog management

1. Once we’ve extracted the useful information from the user’s response, we may want to understand the user’s intent.
    - **Intent classification**: We can use a text-classification system to classify the user response as one of the pre-defined intents.

2. Once we’ve figured out the user’s intent, we want to figure out which suitable action the conversational agent should take to fulfill the user’s request.

#### Response generation
Finally, the conversational agent generates a suitable action to perform based on a semantic interpretation of the user’s intent and additional inputs from the dialogue with the user. 

- retrieve information from the knowledge base
- generate response using a pre-defined template