# Chapter 4. Text Classification


Text classification (known as topic classification, text categorization, or document categorization) is a special instance of the classification problem, where the input data point(s) is text and the goal is to categorize the piece of text into one or more buckets (called a class) from a set of pre-defined buckets (classes).

#### Types of Classification
- binary classification: each example has one label in two possible classes
- multiclass classification: each example has one label in more than two possible classes
- multilabel classification: each example can have one or more labels
    - labels of news: a news can have more than one labels
    - **hierarchical classification**: to check
    
本章我们重点看binary classificatino 和multiclass classification. 

#### 实用场景

- Content classification and organization
    - tagging product descriptions in an e-commerce website; 
    - routing customer service requests in a company to the appropriate support team; 
    - email systems:
        - spam filter
        - organizing emails into personal, social, and promotions in Gmail

- Customer support
    - identify the tweets that brands must respond to (i.e., those that are actionable) and those that don’t require a response (i.e., noise) 
    
<img src="../figures/4-1.png" alt="drawing" width="500"/>

- E-commerce
    - sentiment analysis: to understand and analyze customers’ perception of a product or service based on their comments.  
        - 简单场景: classify all customer reviews for a product into three categories: positive, negative, and neutral.
        - 复杂场景: aspect-based sentiment analysis, fine-grained analysis
            - 一个comment 中有多个观点: the food is great, the service is bad.

<img src="../figures/4-2.png" alt="drawing" width="600"/>

- other applications:
    - language identification
    - Authorship attribution: 通过text 来判定其作者是谁
    - segregate fake news

## 1. Text Classification Pipeline

1. Collect or create a labeled dataset suitable for the task.

2. Split the dataset into two (training and test) or three parts: training, validation (i.e., development), and test sets, then decide on evaluation metric(s).
    - 对于分类，我们通常使用的metrics 是: classification accuracy, precision, recall, F1 score, and area under ROC curve.

3. Transform raw text into feature vectors.

4. Train a classifier using the feature vectors and the corresponding labels from the training set.

5. Using the evaluation metric(s) from Step 2, benchmark the model performance on the test set.

6. Deploy the model to serve the real-world use case and monitor its performance.

step 3-5 are iterated. step 2 and step 3 前两章已经讲过了。本章我们focus 在step 4 and 5. Step 6 在11章讲述。

<img src="../figures/4-3.png" alt="drawing" width="500"/>

在正式商用的时候，key performance indicators (KPIs) specific to a given business use case are also used to evaluate their impact and return on investment (ROI). 例如，using text classification to automatically route customer service requests, KPI 是reduction in wait time before the request is responded to compared to manual routing. 后面我们会介绍**industry verticals**, 详细介绍KPI.

### lexicon-based sentiment analysis

不需要上述pipeline.

例如，对于微博，我们需要判断是否为positive or negative. 可以使用一个positive 的词汇表和一个negative 的词汇表，根据词的使用来判断一句话是否是positive 或者negative. 也可以进一步使用一个字典，给每个词一个分数，1分表示positive，-1 表示negative，0表示neutral. 

上述过程并没有"learning" 过程。通常可以作为快速构建Minimum Viable Product (MVP)[1]的方法. 通常，对于每个NLP 问题，最好start with such simpler approaches. 

[1] A **minimum viable product** (MVP) 最简可行产品 is a version of a product with just enough features to satisfy early customers and provide feedback for future product development.

#### 使用cloud api

More generic. 如果generic solution 可以解决我们NLP 的问题，我们通常不需要build 自己的systems.

- Google Cloud. “Natural Language”. Last accessed June 15, 2020. https://cloud.google.com/natural-language/

- Amazon Comprehend. Last accessed June 15, 2020. https://aws.amazon.com/comprehend/

- Azure Cognitive Services. Last accessed June 15, 2020. https://azure.microsoft.com/en-in/services/cognitive-services/text-analytics/

## 2. One Pipeline, Many Classifiers

本节不是介绍一个万用的方法，而是针对不同的场景，介绍几种Step 3 ~ 5 的不同实现。通常在实践中，我们也会实现多种方法，然后选择最好的一种用于生产环境。

A good dataset is a prerequisite to start using the pipeline. good: a dataset that is a true representation of the data we’re likely to see in production. 

常用的NLP datasets:
- https://github.com/niderhoff/nlp-datasets
- https://datasetsearch.research.google.com/
- https://archive.ics.uci.edu/ml/index.php
- https://www.kaggle.com/c/sa-emotions

作为demo，我们使用Economic News Article Tone and Relevance 数据集:
- 8000 news articles
- label: is relevant to US economy - binary classification
- imbalanced: ~20% is relevant

我们使用bag-of-word, 以及三个classifiers:
- Naive Bayes
- logistic regression 
- support vector machines

### 2.1 Naive Bayes Classifier

Naive Bayes is commonly used as a baseline algorithm in classification experiments.

#### Bayes’ theorem

- It estimates the conditional probability of each feature of a given text for each class based on the occurrence of that feature in that class 
- and multiplies the probabilities of all the features of a given text to compute the final probability of classification for each class.
- Finally, it chooses the class with maximum probability.

### 2.2 Logistic Regression

Naive Bayes 是一种generative classifier.

Logistic regression 是一种discriminative classifier，which aims to learn the probability distribution over all classes.

- Naive Bayes estimates probabilities based on feature occurrence in classes
- logistic regression “learns” the weights for individual features based on how important they are to make a classification decision. 

### 2.3 SVM

SVM 也是一种discriminative classifier.

## 3. Using neural embeddings in text classification