# Classification


Most machine learning problems involve predicting a single random variable $\mathcal{y}$ from one or more random variables $\mathcal{X}$. 

The underlying assumption, when we set out to accomplish this, is that the value of $\mathcal{y}$ is dependent on the value of $\mathcal{X}$ and the relationship between the two is governed by some unknown function $f$ i.e. 

$$ \mathcal{y = f(X)} $$

$\mathcal{X}$ is here simply some data that we have as a pandas Dataframe `pd.DataFrame` whereas $\mathcal{y}$ here is the target variable, one value for each observation, that we want to predict, as a pandas Series `pd.Series`.

<img width="80%" align="center" src="../assets/supervised2.png">

<br/>

The figure above just depicts the core assumption underlying most machine learning problems. 

**Assumptions**, loosely speaking, are what we formally call **models**. 

Therefore, the basic mathematical model underlying most machine learning problems is that the target variable $\mathcal{y}$ is a function of the input variables $\mathcal{X}$ i.e. $\mathcal{y = f(X)}$.

<hr/>

The **core problem**, distinct from any models or assumptions, here is that <u>the function $\mathcal{f}$ is unknown to us and we need to <i>"learn"</i> it from the data</u>. 


Such problems fall under the broad category of **Supervised Learning**. In such problems, we have a set of observations, each observation consisting of a set of input variables $\mathcal{X}$ and a target variable $\mathcal{y}$, and we want to learn a function $\mathcal{f}$ that maps the input $\mathcal{X}$ to the output $\mathcal{y}$.


<img align="center" width="80%" src="../assets/supervised1.png">

**Classification** is the case where **$\mathcal{y}$ is a <a src="../1_programming/41_types.html#categorical-features">discrete categorical variable</a>**. For example, we might want to predict whether a patient has a disease or not, based on their symptoms. In this case, $\mathcal{y}$ is a binary variable, taking the value 1 if the patient has the disease, and 0 otherwise. Other examples of classification problems include predicting the sentiment of a movie review: positive, negative, or neutral.

The classification problem is to learn a function $f$ that maps the input $\mathcal{X}$ to the discrete output $\mathcal{Y}$.




Classification lies at the heart of both human and machine intelligence. Deciding what letter, word, or image has been presented to our senses, recognizing faces or voices, sorting mail, assigning grades to homeworks; these are all examples of assigning a category to an input.

The goal of classification is to take a single observation, extract some useful
features, and thereby classify the observation into one of a set of discrete classes.
One method for classifying text is to use handwritten rules. There are many areas of
language processing where handwritten rule-based classifiers constitute a state-ofthe-art system, or at least part of it.

We focus on one common text categorization task, sentiment analysis, the ex- sentiment
analysis
traction of sentiment, the positive or negative orientation that a writer expresses
toward some object. A review of a movie, book, or product on the web expresses the
author’s sentiment toward the product, while an editorial or political text expresses
sentiment toward a candidate or political action. Extracting consumer or public sentiment is thus relevant for fields from marketing to politics.

Spam detection is another important commercial application, the binary classification task of assigning an email to one of the two classes spam or not-spam.
Many lexical and other features can be used to perform this classification. For example you might quite reasonably be suspicious of an email containing phrases like
“online pharmaceutical” or “WITHOUT ANY COST” or “Dear Winner”.

Rules can be fragile, however, as situations or data change over time, and for
some tasks humans aren’t necessarily good at coming up with the rules. Most cases
of classification in language processing are instead done via supervised machine
learning, and this will be the subject of the remainder of this chapter. In supervised
supervised
machine
learning learning, we have a data set of input observations, each associated with some correct
output (a ‘supervision signal’). The goal of the algorithm is to learn how to map
from a new observation to a correct output.
Formally, the task of supervised classification is to take an input x and a fixed
set of output classes Y = {y1, y2,..., yM} and return a predicted class y ∈ Y. For
text classification, we’ll sometimes talk about c (for “class”) instead of y as our
output variable, and d (for “document”) instead of x as our input variable. In the
supervised situation we have a training set of N documents that have each been handlabeled with a class: {(d1, c1),....,(dN, cN)}. Our goal is to learn a classifier that is
capable of mapping from a new document d to its correct class c ∈ C, where C is
some set of useful document classes. A probabilistic classifier additionally will tell
us the probability of the observation being in the class. This full distribution over
the classes can be useful information for downstream decisions; avoiding making
discrete decisions early on can be useful when combining system

Many kinds of machine learning algorithms are used to build classifiers. This
chapter introduces naive Bayes; the following one introduces logistic regression.
These exemplify two ways of doing classification. Generative classifiers like naive
Bayes build a model of how a class could generate some input data. Given an observation, they return the class most likely to have generated the observation. Discriminative classifiers like logistic regression instead learn what features from the
input are most useful to discriminate between the different possible classes. While
discriminative systems are often more accurate and hence more commonly used,
generative classifiers still have a role

<img src="../assets/classification.png">


<img src="../assets/cross_validation.png">


<img src="../assets/training_testing.png">


<img src="../assets/sentiment.png">