Notes

alan-turing-institute · May 13, 2024 · 286d212 · 286d212
1 parent 8671920
commit 286d212
Showing 1 changed file with 84 additions and 73 deletions.
diff --git a/notes/mml.tex b/notes/mml.tex
@@ -21,6 +21,7 @@
 \DeclareMathOperator*{\argmin}{arg\,min}
 \newcommand{\eg}{\emph{Example:}}
 \newcommand{\ie}{\emph{i.e.}}
+\newcommand{\isdef}{\stackrel{\text{def}}{=}}
 \hyphenation{anti-sym-met-ric}
 %%
 \author{James Geddes}
@@ -97,11 +98,11 @@ \section*{Introduction}
 inputs, a set, $Y$, of possible outputs, and a collection of $d$ pairs
 $(x_i, y_i)\in X\times Y$ (for $i=1,\dots,d$), that comprise the data. For
 now, we suppose no extra structure on $X$; however, on $Y$ we imagine
-there is some notion of “closeness” (to be made precise later). If
-$y_1\in Y$ and $y_2\in Y$ are “close,” we write $y_1 \approx y_2$. (All of this
-is somewhat informal.) Consider the challenge of finding a map,
-$\hat{f}\colon X\to Y$, having the property that
-$\hat{f}(x_i) \approx y_i$ for each~$i$.
+there is some notion of “closeness” (to be made precise
+later). Consider the challenge of finding a map,
+$\hat{f}\colon X\to Y$, having the property that $\hat{f}(x_i)$ is close
+to $y_i$ for each~$i$. (All of this is somewhat informal. In
+particular, the notion of closeness might differ by data point.)
 
 One immediate snag is that finding such a map is \emph{far too
   easy}. Consider:
@@ -133,65 +134,50 @@ \section*{Introduction}
   make the problem tractable. “Simple” might mean “smooth,” or
   “low-order” or even “linear.”}
 
-However we define “reasonable,” imagine that, somehow or other, we
-have decided on a set of functions, $\mathcal{F}$, and what we are really asking
-for is a function taken from \emph{this} set that is in some way
-“close” to the data.\sidenote{It turns out to be quite difficult to
-  get the “size” of $\mathcal{F}$ just right. If there are too few functions to
-  draw from, we run the risk of not being able to match the real
-  function; if there are too many, we run the risk of choosing one
-  that is not physically reasonable just because it is a good match to
-  the data.} In order to make any further progress we need to say
-something more specific about the meaning of “close.”
+Whatever one's definition of “reasonable,” imagine that, somehow or
+other, there is fixed a particular collection of functions,
+$\mathcal{F}$. What is really wanted is a function taken from \emph{this}
+collection that is in some way “close” to the data.\sidenote{It turns
+  out to be quite difficult to get the “size” of $\mathcal{F}$ just right. If
+  there are too few functions to draw from, we run the risk of not
+  being able to match the real function; if there are too many, we run
+  the risk of choosing one that is not physically reasonable just
+  because it is a good match to the data.} In order to make any
+further progress it is now necessary to say something more specific
+about the meaning of “close.”
 
 \section{Least squares}
 
 A popular version of “close” runs as follows. First, suppose that the
-set of “possible outputs,” $Y$, is the real numbers,~$\setR$. That is,
-the values of the $y_i$ and the value of $f(x)$ are all real
-numbers. For any particular datum, $(x_i, y_i)$, we might say that
-$f(x_i)$ is close to $y_i$ just in case $ f(x_i)-y_i $ is small; and
-therefore that $f$ is close to the data when $f(x_i)-y_i$ is small for
-every~$i$. To capture this notion of closeness, consider the following
-function on $\mathcal{F}$:
+set of “possible outputs,” $Y$, is the real
+numbers,~$\setR$.\sidenote{On the face of it, this is quite a strong
+  supposition. For example, it is not true of any of the examples at
+  the beginning of this note.} One immediate advantage is that there
+is a natural notion of the closeness of two real numbers such as $y_i$
+and $f(x_i)$; namely the value of $\lvert f(x_i)-y_i\rvert$. In the
+version of curve fitting known as \emph{least squares}, the closeness
+of a function $f$ to the whole dataset $(x_i, y_i)$ is measured by the
+value of the expression
 \begin{equation}
-  \label{eq:square-loss}  
-  L(f) = \sum_{i=1}^d{(f(x_i) - y_i)}^2.
+  \label{eq:least-squares-sum}
+  L(f) \isdef \sum_{i=1}^d {\bigl(f(x_i)-y_i\bigr)}^2  
 \end{equation}
-This $L$ is clearly related to the distance of each $y_i$ from the
-$f(x_i)$; it is non-negative; and it is zero only when the function
-exactly matches the data. A function like $L$ is known as the
-\emph{loss function}, and this one is sometimes called the
-\emph{quadratic loss} or \emph{squared loss}.\sidenote{One might
-  attempt to \emph{justify} this choice of $L(f)$ from some
-  statistical or other principles. In truth, I suspect it is popular
-  largely because it makes the subsequent calculations tractable.}
-
-Before returning to the question of finding a function close to the
-data, we shall rewrite the expression for the loss function in a way
-that will be helpful later. The idea is to think of the $y_i$ and the
-$f(x_i)$ not as separate, individual values, but rather as two points
-in $\setR^d$; the loss function will then be the “distance” between
-these two points. For the $y_i$ we shall write, simply,
-$\bm{y}=(y_1,\dotsc,y_d)$. What about the $f(x_i)$?
-
-Consider the map $\mathcal{E}_{\bm{x}}$ defined by:
-\[
-  \begin{aligned}
-    \mathcal{E}_{\bm{x}} \colon \mathcal{F} &\to \setR^d \\
-    f &\mapsto (f(x_1), \dotsc, f(x_d)).
-  \end{aligned}
-\]
-$\mathcal{E}_{\bm{x}}$ is known as the evaluation map.\sidenote{The subscript
-  $\bm{x}$ is there to remind ourselves that the evaluation map
-  depends upon the $x$-values of the data. However, note that $\bm{x}$
-  is \emph{not}, in general, a vector, because $X$ is not, in general,
-  a vector space. One is perfectly entitled to write, say,
-  $\bm{x}=(x_1, \dotsc, x_d)$, but what is denoted is a tuple, not a
-  vector.} For $f\in\mathcal{F}$, the expression
-$\mathcal{E}_{\bm{x}}(f)$ is “the value of the function $f$, evaluated on the
-inputs, and expressed as an element of~$\setR^d$.”
-Figure~\ref{fig:evalmap-on-f} illustrates this construction.
+That is, one says that the function $f$ is close to the data just in
+case the value of $L(f)$ is small. More generally, a function like $L$
+is known as the \emph{loss function} (or sometimes “loss functional,”
+to indicate that it depends on a function, although a functional is
+also a function) and this particular one is sometimes called the
+“quadratic loss” or ”squared loss.”
+
+The above definition of close has some nice properties: it is related
+to the distance of each $y_i$ from the $f(x_i)$; it is non-negative;
+and it is zero only when the function exactly matches the data. One
+sometimes attempts to \emph{justify} this definition, often by appeal
+to some statistical principles. However, it is also true that this
+particular choice of $L(f)$ can be written in a way that is suggestive
+of a further simplification and which makes the problem significantly
+more tractable. 
+
 \begin{marginfigure}
   \begin{center}
     \asyinclude[width=4cm, height=4cm, keepAspect=false]{evalmap.asy}
@@ -201,17 +187,44 @@ \section{Least squares}
     measures the distance from this point to the data, 
     $\bm{y}$.\label{fig:evalmap-on-f}}
 \end{marginfigure}
-
-Now we make use of the vector space structure of $setR^d$ to write the
-loss function is a vector as the (square of the) Euclidean distance
-between $\mathcal{E}_{\bm{x}}(f)$ and~$\bm{y}$. For any point,
-$\bm{p}=(p_1,\dotsc, p_d)\in\setR^d$, we write the square of its
-“length” as ${\lVert \bm{p} \rVert}^2 = \sum_{i=1}^d p_i^2$, whereupon
-the loss function can be written
+Recall that $\setR^d$ is the vector space of length-$d$ tuples of
+reals, with addition of tuples “element-wise.” Thus, one element
+of~$\setR^d$ is the tuple $\bm{y}=(y_1,\dotsc,y_d)$. Another element
+of $\setR^d$ is given by the following map:
+\[
+  \begin{aligned}
+    \mathcal{E}_{\bm{x}} \colon \mathcal{F} &\to \setR^d \\
+    f &\mapsto (f(x_1), \dotsc, f(x_d)).
+  \end{aligned}
+\]
+The subscript $\bm{x}$ (in $\mathcal{E}_{\bm{x}}$) is there as a reminder that
+the evaluation map depends upon the data. For fixed data,
+$\mathcal{E}_{\bm{x}}$ maps a function, $f$, to the element of
+$\setR^d$ given by $(f(x)_1, \dotsc, f(x_d))$. This map is known as
+the \emph{evaluation map}. (See figure~\ref{fig:evalmap-on-f} for an
+illustratation.)
+
+The idea, now, is to express $L(f)$ as the “distance” between
+$\mathcal{E}_{\bm{x}}(f)$ and $\bm{y}$; or, equivalently, as the “length” of
+$\mathcal{E}_{\bm{x}}(f)-\bm{y}$. A natural notion of length in
+$\setR^d$ is the “Euclidean distance:” for
+$\bm{v}=(v_1, v_2, \dotsc, v_d)$, the “length” of $bm{v}$ is given by
+${\lVert \bm{v} \rVert}^2 = \sum_{i=1}^d v_i^2$. The expression
+$\lVert\cdot\rVert$ is called the \emph{canonical norm}.
+
+However, note that $\bm{x}$ is \emph{not}, in general, a vector,
+because $X$ is not, in general, a vector space. One is perfectly
+entitled to write, say, $\bm{x}=(x_1, \dotsc, x_d)$, but what is
+denoted is a tuple, not a vector.
+
+Now we make use of the vector space structure of $\setR^d$.  Now the loss function
+can be written
 \begin{equation}
   \label{eq:norm-loss}
   L(f) = {\Vert \mathcal{E}_{\bm{x}}(f) - \bm{y}\rVert }^2.
 \end{equation}
+The vector space structure is made use of in the subtraction on the
+right-hand side of this expression.
 
 We now summarise the discussion to this point. Our problem was to
 choose, from a set of functions, $\mathcal{F}$, a particular function,
@@ -230,14 +243,12 @@ \section{Least squares}
 The difference between the form of the loss function in
 eq.~\eqref{eq:norm-loss} and the original form,
 eq.~\eqref{eq:square-loss}, is just notation. It is suggestive
-notation, however. On the right hand side we have concepts from the
-space~$\setR^d$ thought of as a vector space: the squared distance,
-${\Vert\cdot\rVert}^2$, is a member in good standing of the pantheon of vector
-space concepts. It is a simplifcation to make these assumptions for
-the domain of the data and the loss function.\sidenote{For example,
-  none of the examples at the top of this note have the reals as the
-  domain of the target.} Have we simplified enough to be able to
-attack this general problem?
+notation, however. On the right-hand side of eq.~\eqref{norm-loss} we
+have an expression built from natural vector-space concepts
+(subtraction and the norm). It is a simplifcation to make
+these assumptions for the domain of the data and the loss
+function Have we
+simplified enough to be able to attack this general problem?
 
 \section{Linear regression}