# Classification

Classification lies at the heart of both human and machine intelligence. Deciding what letter, word, or image has been presented to our senses, recognizing faces or voices, sorting mail, assigning grades to homeworks; these are all examples of assigning a category to an input.

**Classification** is the type of supervised learning where **$\mathcal{y}$ is a <a src="../1_programming/41_types.html#categorical-features">discrete categorical variable</a>**. 

The discrete output variable $\mathcal{y}$ is often also called the **label** or **target** or **class**.

For example, we might want to predict whether a patient has a disease or not, based on their symptoms. In this case, $\mathcal{y}$ is a binary variable, taking the value 1 if the patient has the disease, and 0 otherwise. Other examples of classification problems include predicting the sentiment of a movie review: positive, negative, or neutral.

For example,

<img align="center" width="90%" src="../assets/sentiment.png">

<br/>

In other words, the classification problem is to learn a function $f$ that maps the input $\mathcal{X}$ to the discrete output $\mathcal{Y}$.



### Evaluation Metrics

The most common metric for evaluating a classifier is **accuracy**. Accuracy is the proportion of correct predictions. It is the number of correct predictions divided by the total number of predictions.

$$Accuracy = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

For example, if we have a test set of 100 documents, and our classifier correctly predicts the class of 80 of them, then the accuracy is 80%.

Assuming the categorical variable that we are trying to predict is binary, we can define the accuracy in terms of the four possible outcomes of a binary classifier: 

1. True Positive (TP): The classifier correctly predicted the positive class.
2. False Positive (FP): The classifier **incorrectly** predicted the negative class as positive.
3. True Negative (TN): The classifier correctly predicted the negative class.
4. False Negative (FN):  The classifier **incorrectly** predicted the positive class as negative.

True positive means that the classifier correctly predicted the positive class. False positive means that the classifier incorrectly predicted the positive class. True negative means that the classifier correctly predicted the negative class. False negative means that the classifier incorrectly predicted the negative class.

These definitions are summarized in the table below: 

|       | Prediction $\hat{y} = f\prime(x)$ | Truth $y = f(x)$     |
| :---        |    :----:   |          ---: |
| True Negative (TN)    | 0        | 0   |
| False Negative (FN)   | 0        | 1      |
| False Positive (FP)   | 1        | 0      |
| True Positive (TP)   | 1        | 1      |

In terms of the four outcomes above, the accuracy is:

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$

Accuracy is a useful metric, but it can be misleading. 

Other metrics that are often used to evaluate classifiers are: 

* **Precision**: The proportion of positive predictions that are correct. Mathematically, precision is defined as:

$$Precision = \frac{TP}{TP + FP}$$

* **Recall**: The proportion of positive instances that are correctly predicted. Mathematically, recall is defined as:

$$Recall = \frac{TP}{TP + FN}$$

The precision and recall are often combined into a single metric called the **F1 score**. The F1 score is the harmonic mean of precision and recall. The harmonic mean of two numbers is given by:

$$\frac{2}{\frac{1}{x} + \frac{1}{y}}$$

* **F1 Score**: The harmonic mean of precision and recall.

$$F1\ Score = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}$$

<!-- $$Baseline\ Accuracy = \frac{Number\ of\ majority\ class\ predictions}{Total\ number\ of\ predictions}$$ -->

<!-- The baseline accuracy is the accuracy of the majority class classifier. It is the accuracy we would get if we just guessed the majority class for every instance. It is a useful baseline to compare our classifier to. If our classifier is not better than the baseline, then we should probably just use the baseline classifier.

Another way to evaluate a classifier is to look at the confusion matrix. A confusion matrix is a table that shows the number of correct and incorrect predictions for each class. For example, if we have a test set of 100 documents, and our classifier correctly predicts the class of 80 of them, then the accuracy is 80%. But if we had just guessed the majority class for all of them, we would have gotten 50% accuracy. This is called the baseline accuracy.

<img src="../assets/confusion_matrix.png">



<img src="../assets/classification.png">


<img src="../assets/cross_validation.png">


<img src="../assets/training_testing.png">
 -->



One method for classifying text is to use handwritten rules. There are many areas of language processing where handwritten rule-based classifiers constitute a state-of-the-art system, or at least part of it.

We focus on one common text categorization task, sentiment analysis traction of sentiment, the positive or negative orientation that a writer expresses toward some object. A review of a movie, book, or product on the web expresses the author’s sentiment toward the product, while an editorial or political text expresses sentiment toward a candidate or political action. Extracting consumer or public sentiment is thus relevant for fields from marketing to politics.

**Spam detection** is another important commercial application, the binary classification task of assigning an email to one of the two classes spam or not-spam.

Many lexical and other features can be used to perform this classification. For example you might quite reasonably be suspicious of an email containing phrases like “online pharmaceutical” or “WITHOUT ANY COST” or “Dear Winner”.

Rules can be fragile, however, as situations or data change over time, and for some tasks humans aren’t necessarily good at coming up with the rules. Most cases of classification in language processing are instead done via supervised machine learning, and this will be the subject of the remainder of this chapter. 

Formally, the task of supervised classification is to take an input x and a fixed set of output classes $Y = {y_1, y_2,..., y_M}$ and return a predicted class $y ∈ Y$. For text classification, we’ll sometimes talk about c (for “class”) instead of y as our output variable, and d (for “document”) instead of x as our input variable. In the
supervised situation we have a training set of N documents that have each been hand labeled with a class: ${(d1, c1),....,(dN, cN)}$. 

Our goal is to learn a classifier that is capable of mapping from a new document d to its correct class c ∈ C, where C is some set of useful document classes. A probabilistic classifier additionally will tell us the probability of the observation being in the class. This full distribution over the classes can be useful information for downstream decisions; avoiding making discrete decisions early on can be useful when combining system

Many kinds of machine learning algorithms are used to build classifiers. This chapter introduces naive Bayes; the following one introduces logistic regression. These exemplify two ways of doing classification. Generative classifiers like naive Bayes build a model of how a class could generate some input data. Given an observation, they return the class most likely to have generated the observation. Discriminative classifiers like logistic regression instead learn what features from the input are most useful to discriminate between the different possible classes. While discriminative systems are often more accurate and hence more commonly used, generative classifiers still have a role
