Skip to content

Commit

Permalink
Notes
Browse files Browse the repository at this point in the history
  • Loading branch information
triangle-man committed May 13, 2024
1 parent 8671920 commit 286d212
Showing 1 changed file with 84 additions and 73 deletions.
157 changes: 84 additions & 73 deletions notes/mml.tex
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand{\eg}{\emph{Example:}}
\newcommand{\ie}{\emph{i.e.}}
\newcommand{\isdef}{\stackrel{\text{def}}{=}}
\hyphenation{anti-sym-met-ric}
%%
\author{James Geddes}
Expand Down Expand Up @@ -97,11 +98,11 @@ \section*{Introduction}
inputs, a set, $Y$, of possible outputs, and a collection of $d$ pairs
$(x_i, y_i)\in X\times Y$ (for $i=1,\dots,d$), that comprise the data. For
now, we suppose no extra structure on $X$; however, on $Y$ we imagine
there is some notion of “closeness” (to be made precise later). If
$y_1\in Y$ and $y_2\in Y$ are “close,” we write $y_1 \approx y_2$. (All of this
is somewhat informal.) Consider the challenge of finding a map,
$\hat{f}\colon X\to Y$, having the property that
$\hat{f}(x_i) \approx y_i$ for each~$i$.
there is some notion of “closeness” (to be made precise
later). Consider the challenge of finding a map,
$\hat{f}\colon X\to Y$, having the property that $\hat{f}(x_i)$ is close
to $y_i$ for each~$i$. (All of this is somewhat informal. In
particular, the notion of closeness might differ by data point.)

One immediate snag is that finding such a map is \emph{far too
easy}. Consider:
Expand Down Expand Up @@ -133,65 +134,50 @@ \section*{Introduction}
make the problem tractable. “Simple” might mean “smooth,” or
“low-order” or even “linear.”}

However we define “reasonable,” imagine that, somehow or other, we
have decided on a set of functions, $\mathcal{F}$, and what we are really asking
for is a function taken from \emph{this} set that is in some way
“close” to the data.\sidenote{It turns out to be quite difficult to
get the “size” of $\mathcal{F}$ just right. If there are too few functions to
draw from, we run the risk of not being able to match the real
function; if there are too many, we run the risk of choosing one
that is not physically reasonable just because it is a good match to
the data.} In order to make any further progress we need to say
something more specific about the meaning of “close.”
Whatever one's definition of “reasonable,” imagine that, somehow or
other, there is fixed a particular collection of functions,
$\mathcal{F}$. What is really wanted is a function taken from \emph{this}
collection that is in some way “close” to the data.\sidenote{It turns
out to be quite difficult to get the “size” of $\mathcal{F}$ just right. If
there are too few functions to draw from, we run the risk of not
being able to match the real function; if there are too many, we run
the risk of choosing one that is not physically reasonable just
because it is a good match to the data.} In order to make any
further progress it is now necessary to say something more specific
about the meaning of “close.”

\section{Least squares}

A popular version of “close” runs as follows. First, suppose that the
set of “possible outputs,” $Y$, is the real numbers,~$\setR$. That is,
the values of the $y_i$ and the value of $f(x)$ are all real
numbers. For any particular datum, $(x_i, y_i)$, we might say that
$f(x_i)$ is close to $y_i$ just in case $ f(x_i)-y_i $ is small; and
therefore that $f$ is close to the data when $f(x_i)-y_i$ is small for
every~$i$. To capture this notion of closeness, consider the following
function on $\mathcal{F}$:
set of “possible outputs,” $Y$, is the real
numbers,~$\setR$.\sidenote{On the face of it, this is quite a strong
supposition. For example, it is not true of any of the examples at
the beginning of this note.} One immediate advantage is that there
is a natural notion of the closeness of two real numbers such as $y_i$
and $f(x_i)$; namely the value of $\lvert f(x_i)-y_i\rvert$. In the
version of curve fitting known as \emph{least squares}, the closeness
of a function $f$ to the whole dataset $(x_i, y_i)$ is measured by the
value of the expression
\begin{equation}
\label{eq:square-loss}
L(f) = \sum_{i=1}^d{(f(x_i) - y_i)}^2.
\label{eq:least-squares-sum}
L(f) \isdef \sum_{i=1}^d {\bigl(f(x_i)-y_i\bigr)}^2
\end{equation}
This $L$ is clearly related to the distance of each $y_i$ from the
$f(x_i)$; it is non-negative; and it is zero only when the function
exactly matches the data. A function like $L$ is known as the
\emph{loss function}, and this one is sometimes called the
\emph{quadratic loss} or \emph{squared loss}.\sidenote{One might
attempt to \emph{justify} this choice of $L(f)$ from some
statistical or other principles. In truth, I suspect it is popular
largely because it makes the subsequent calculations tractable.}

Before returning to the question of finding a function close to the
data, we shall rewrite the expression for the loss function in a way
that will be helpful later. The idea is to think of the $y_i$ and the
$f(x_i)$ not as separate, individual values, but rather as two points
in $\setR^d$; the loss function will then be the “distance” between
these two points. For the $y_i$ we shall write, simply,
$\bm{y}=(y_1,\dotsc,y_d)$. What about the $f(x_i)$?

Consider the map $\mathcal{E}_{\bm{x}}$ defined by:
\[
\begin{aligned}
\mathcal{E}_{\bm{x}} \colon \mathcal{F} &\to \setR^d \\
f &\mapsto (f(x_1), \dotsc, f(x_d)).
\end{aligned}
\]
$\mathcal{E}_{\bm{x}}$ is known as the evaluation map.\sidenote{The subscript
$\bm{x}$ is there to remind ourselves that the evaluation map
depends upon the $x$-values of the data. However, note that $\bm{x}$
is \emph{not}, in general, a vector, because $X$ is not, in general,
a vector space. One is perfectly entitled to write, say,
$\bm{x}=(x_1, \dotsc, x_d)$, but what is denoted is a tuple, not a
vector.} For $f\in\mathcal{F}$, the expression
$\mathcal{E}_{\bm{x}}(f)$ is “the value of the function $f$, evaluated on the
inputs, and expressed as an element of~$\setR^d$.”
Figure~\ref{fig:evalmap-on-f} illustrates this construction.
That is, one says that the function $f$ is close to the data just in
case the value of $L(f)$ is small. More generally, a function like $L$
is known as the \emph{loss function} (or sometimes “loss functional,”
to indicate that it depends on a function, although a functional is
also a function) and this particular one is sometimes called the
“quadratic loss” or ”squared loss.”

The above definition of close has some nice properties: it is related
to the distance of each $y_i$ from the $f(x_i)$; it is non-negative;
and it is zero only when the function exactly matches the data. One
sometimes attempts to \emph{justify} this definition, often by appeal
to some statistical principles. However, it is also true that this
particular choice of $L(f)$ can be written in a way that is suggestive
of a further simplification and which makes the problem significantly
more tractable.

\begin{marginfigure}
\begin{center}
\asyinclude[width=4cm, height=4cm, keepAspect=false]{evalmap.asy}
Expand All @@ -201,17 +187,44 @@ \section{Least squares}
measures the distance from this point to the data,
$\bm{y}$.\label{fig:evalmap-on-f}}
\end{marginfigure}

Now we make use of the vector space structure of $setR^d$ to write the
loss function is a vector as the (square of the) Euclidean distance
between $\mathcal{E}_{\bm{x}}(f)$ and~$\bm{y}$. For any point,
$\bm{p}=(p_1,\dotsc, p_d)\in\setR^d$, we write the square of its
“length” as ${\lVert \bm{p} \rVert}^2 = \sum_{i=1}^d p_i^2$, whereupon
the loss function can be written
Recall that $\setR^d$ is the vector space of length-$d$ tuples of
reals, with addition of tuples “element-wise.” Thus, one element
of~$\setR^d$ is the tuple $\bm{y}=(y_1,\dotsc,y_d)$. Another element
of $\setR^d$ is given by the following map:
\[
\begin{aligned}
\mathcal{E}_{\bm{x}} \colon \mathcal{F} &\to \setR^d \\
f &\mapsto (f(x_1), \dotsc, f(x_d)).
\end{aligned}
\]
The subscript $\bm{x}$ (in $\mathcal{E}_{\bm{x}}$) is there as a reminder that
the evaluation map depends upon the data. For fixed data,
$\mathcal{E}_{\bm{x}}$ maps a function, $f$, to the element of
$\setR^d$ given by $(f(x)_1, \dotsc, f(x_d))$. This map is known as
the \emph{evaluation map}. (See figure~\ref{fig:evalmap-on-f} for an
illustratation.)

The idea, now, is to express $L(f)$ as the “distance” between
$\mathcal{E}_{\bm{x}}(f)$ and $\bm{y}$; or, equivalently, as the “length” of
$\mathcal{E}_{\bm{x}}(f)-\bm{y}$. A natural notion of length in
$\setR^d$ is the “Euclidean distance:” for
$\bm{v}=(v_1, v_2, \dotsc, v_d)$, the “length” of $bm{v}$ is given by
${\lVert \bm{v} \rVert}^2 = \sum_{i=1}^d v_i^2$. The expression
$\lVert\cdot\rVert$ is called the \emph{canonical norm}.

However, note that $\bm{x}$ is \emph{not}, in general, a vector,
because $X$ is not, in general, a vector space. One is perfectly
entitled to write, say, $\bm{x}=(x_1, \dotsc, x_d)$, but what is
denoted is a tuple, not a vector.

Now we make use of the vector space structure of $\setR^d$. Now the loss function
can be written
\begin{equation}
\label{eq:norm-loss}
L(f) = {\Vert \mathcal{E}_{\bm{x}}(f) - \bm{y}\rVert }^2.
\end{equation}
The vector space structure is made use of in the subtraction on the
right-hand side of this expression.

We now summarise the discussion to this point. Our problem was to
choose, from a set of functions, $\mathcal{F}$, a particular function,
Expand All @@ -230,14 +243,12 @@ \section{Least squares}
The difference between the form of the loss function in
eq.~\eqref{eq:norm-loss} and the original form,
eq.~\eqref{eq:square-loss}, is just notation. It is suggestive
notation, however. On the right hand side we have concepts from the
space~$\setR^d$ thought of as a vector space: the squared distance,
${\Vert\cdot\rVert}^2$, is a member in good standing of the pantheon of vector
space concepts. It is a simplifcation to make these assumptions for
the domain of the data and the loss function.\sidenote{For example,
none of the examples at the top of this note have the reals as the
domain of the target.} Have we simplified enough to be able to
attack this general problem?
notation, however. On the right-hand side of eq.~\eqref{norm-loss} we
have an expression built from natural vector-space concepts
(subtraction and the norm). It is a simplifcation to make
these assumptions for the domain of the data and the loss
function Have we
simplified enough to be able to attack this general problem?

\section{Linear regression}

Expand Down

0 comments on commit 286d212

Please sign in to comment.