# Introduction to Statistical Learning

Statistical learning can be divided into two categories which
are called supervised learning and unsupervised learning.

`Supervised learning` refers to a collection of techniques and algorithms
that, when given a set of example inputs and example outputs,
learn to associate the inputs with the outputs. The outputs usually need to be provided by a `supervisor`, which
could be a human or another algorithm, and this is where the name comes from

`Unsupervised learning` refers to a collection of techniques and algorithms
that are given inputs only and there are no outputs. The goal of unsupervised learning is to learn relationships and structure
from such data.

We will look at `regression`, which refers to problems
that have continuous outputs, and we will also
look at `classification`, which refers to problems
that have categorical outputs like 0 or 1, blue or green, and so on.
We will mainly use to scikit-learn machine learning library
for Python to implement these models.

Note that this case study deals with supervised learning only
and we will leave unsupervised learning for the future.

Quantitative variables take on numerical values, such as income,
whereas qualitative variables take on values in a given category,
such as male or female.
The range of quantitative variables depends on what they measure
and what units are used.
For example, we might measure annual income
in dollars or thousands of dollars.
Similarly, qualitative variables can have two or more categories or classes.
The male-female example refers to a qualitative variable
with two categories, but in principle, there can be any number of categories.
In some cases, we convert a continuous variable to a categorical variable
by specifying the cutoff points between the categories.
For example, for income we might specify that a household with an annual income
less than `$30,000` a year is a low income household;
a household with an income between `$30,000` and `$100,000`
is a middle income household; and a household with an annual income
exceeding `$100,000` a year a high income household.

Methods in supervised learning are divided into two groups
based on whether the output variable, also called the outcome,
is quantitative or qualitative.
If the outcome is quantitative, we talk about `regression problems`,
whereas if the outcome is qualitative, we talk about `classification problems`.

Note that this division into regression and classification problems
is made based on the nature of the output, not the inputs,
and it's common for both regression and classification problems
to involve a mixture of quantitative and qualitative inputs.

In both problems, we have some input variable X and an output variable Y,
and we seek some function f of X for predicting Y, given values of the input
X. What the best prediction is depends on the so-called `loss function`,
which is a way of quantifying how far our predictions for Y for a given
value of X are from the true observed values of Y. This
is the subject of `statistical decision theory` which is outside of our scope,
but we will state the relevant results here.

First, in a `regression setting`, by far the most common loss function
is the so-called `squared error loss`.
And in that case, the best value to predict for a given X
is a conditional expectation, or a conditional average,
of Y given X. So what that means is that what we should predict
is the average of all values of Y that correspond to a given value of X.

Second, in a `classification setting`, we most often
use the so-called `0-1 loss function`, and in that case,
the best classification for a given X is obtained
by classifying observation of the class with the highest
conditional probability given X. In other words, for a given value of X,
we compute the probability of each class and we then
assign the observation to the class with the highest probability.

we are estimating, in the regression setting,
a conditional expectation of Y given a specific value for x.
So this is simply saying that this is a conditional mean, a conditional average
taking over all points that share at the value of x.
This is your regression function.
So if we repeat this for all values of x, we will get a line like this.
And typically we might call this f of x.
That's our regression function.

In a classification setting, we would want to estimate two probabilities.
These are conditional probabilities.
So the first of them would be the probability
that the random variable Y is equal to 0 given the value of x.
And the second probability is for Y equal to 1
given that the random variable, which is the large X, is equal to the small x.
And whichever of these two probabilities is largest,
that's going to be our prediction for a given value of x.


Least squares loss is used to estimate the expected value of outputs, whereas  loss is used to estimate the probability of outputs.