# MATH 405/607 

# Numerical Methods for Differential Equations

[[Instructor: Christoph Ortner]](http://www.math.ubc.ca/~ortner/)  [[course page]](https://github.com/cortner/math405_2022)

# Guest Lecture: Polynomial Regression by Ruo Ning Qiu


# Polynomial Regression 

* Linear Regression 
* Linear Least Square
* Prediction Error

### Literature 

* [Driscoll, Fundamentals of Numerical Computations, Linear Least Square](https://fncbook.github.io/fnc/leastsq/overview.html)

In [None]:
include("math405.jl")

## Fitting functions
Suppose we are interested in an unknown function $f:\R \to \R$. We have seen that we can approximate functions $f$ using polynomial interpolation by matching values of $f$ with the polynomials at the interpolation nodes. In many cases, interpolation is not the best way to fit a function. We can relax the requirement of matching values at some nodes by performing a polynomial expansion as a linear model to fit against $f$ instead.

Define the linear model $\tilde{f}$ to be

$$ \tilde{f}(x) = \sum_{i=0}^N c_i P_i(x) = c^\top P(x)$$

where $c = (c_0, \dots, c_N) \in \R^{N+1}$ is the undetermined parameter vector, $P(x) = (P_0(x), \dots, P_N(x)) \in \R^{N+1}$ and $P_i(x)$ is a polynomial basis of degree $i$ such as the monomial polynomials of $P_i(x) = x^i$, and $N \in \mathbb{N}^+$ is the maximum degree of polynomial basis that the linear model has up to.

## The Linear Least Square Problem
Given a set of $M$ training data $\{(x_j, y_j)\}_{j=1, \dots, M}$ where $y_j = f(x_j)$, we want to find the undetermined parameters $c^*$ in our linear model $\tilde{f}$ that approximates $f$ well. One possible way is to minimize the difference square between the observed values and model fitting values at each training data $x_j$.

$$c^* = \argmin_{c \in \R^n} \sum_{j=1}^M \|y_j - \tilde{f}(x_j)\|^2$$

This minimization/loss function can be written in a matrix multiplication form, similar to the interpolation as a linear system with Vandermonde matrix.

$$ c^* = \argmin_{c \in \R^n} \| Y - Ac \|^2 = \argmin_{c \in \R^n} \| Ac - Y \|^2$$ 
where
$$ A = \begin{pmatrix}
        P_1(x_1) & P_2(x_1) &  \cdots & P_N(x_1)  \\ 
        P_1(x_2) & P_2(x_2) &  \cdots & P_N(x_2)  \\ 
          \vdots & \vdots &        & \vdots \\ 
        P_1(x_M) & P_2(x_M) &  \cdots & P_N(x_M)  \\  
        \end{pmatrix}
$$
called the design matrix, and $Y = (y_1, \dots, y_M) = (f(x_1), \dots, f(x_M))$

## The Normal Equation
We can compute directly the minimizer of the loss function
$$ \phi(c) = \| Ac - Y \|^2$$
as the minimizer must satisfy $\nabla \phi(c) = 0$, so
$$ \phi(c) = c^\top A^\top A c - 2 c^\top A^\top Y + \|Y\|^2$$
we solve for $c^*$ with the following linear system
$$
\begin{align}
A^\top A c^* &= A^\top Y  \\ 
c^* &= (A^\top A)^{-1} A^\top Y
\end{align}
$$


### QR Factorization
We learned that the native way of solving linear system $Ax=b$ is `x = A \ b` in Julia. However, there could be instability in solving the normal equation directly, since the condition number for normal equation is $\kappa(A^\top A) \approx \kappa(A)^2$ that can enlarge the instability. Check this [demo](https://fncbook.github.io/fnc/leastsq/demos/normaleqns-instab.html) for details. 

Thus, we want to perform a QR factorization of $A$ and substitute it into the normal equation, using the fact that every real $M \times N$ matrix $A$  $(M \leq N) $ can be written as $A = QR$, where $Q$ is an $M \times M$ orthogonal matrix and $R$ is an $M \times N$ upper triangular matrix.

$$
\begin{align}
A^\top A c^* &= A^\top Y  \\ 
R^\top Q^\top Q R c^* &= R^\top Q^\top Y \\
R^\top R c^* &=R^\top Q^\top Y \\
R c^* &= Q^\top Y \\
\end{align}
$$
as $Q^\top Q = I$, $R$ is non-singular.

### Cost of QR Factorization vs Solving the Normal Equation Directly
We might wonder if the cost for QR factorization is really worth it to avoid the instability, i.e., how we balance between accuracy and computational cost. Let's compare it with the solving the normal equation with a LU factorization. Suppose $A$ is a $M \times N, M \leq N$ matrix.

$$\rm{COST}(qr(A)) = O(MN^2) $$
vs 
$$\rm{COST}(A^\top A) + \rm{COST}(lu(A)) =  O(MN^2 + N^3)$$

<!-- 2 Scalar Multiple of qr than the other one  -->

Read how to get the count of operations [here](http://www.math.iit.edu/~fass/477577_Chapter_5.pdf).

It is really not more expensive to do a QR factorization, thus we should do it!

## Example
### Revisit: The Witch of Agnesi 

[some historical context](https://mathworld.wolfram.com/WitchofAgnesi.html)

$$
f(x) = \frac{1}{1 + \alpha^2 x^2}
$$ 
has singularities at $m = \pm i \alpha$.

In [None]:
# Define bases
function poly(x::Number, N)
    P = zeros(N)
    for n = 0:(N-1)
        P[n+1] = x^(n)
    end
    return P
end

function cheb(x::Number, N)
    T = zeros(N)
    T[1] = 1 
    T[2] = x 
    for n = 2:(N-1) 
       T[n+1] = 2*x*T[n] - T[n-1]
    end 
    return T 
 end

 function design_matrix(X, basis, N)
    A = zeros(ComplexF64, length(X), N)
    for (i, x) in enumerate(X)
      A[i, :] = basis(x, N)
    end 
    return A
 end

 function predict(x::Float64, basis, c, N)
    B = basis(x, N)
    val = real(sum(c .* B))
    return val
 end
 
 function predict(rr, args...)
    vals = Float64[]
    for r in rr
       v = predict(r, args...)
       push!(vals, v)
    end
    return vals
 end

In [None]:
Random.seed!(405607)
f = x -> 1/(1+25*x^2)

M = 30 # sample size
# generate train data
Xtrain = rand(M) * 2 .- 1 
# generate test data 
Xtest = range(-1, 1, length=300)

basis = poly
NN1 = [5, 8, 10, 20]

xp = range(-1, 1, length=300)
P1 = plot(xp, f.(xp); lw=3, label = "exact",
          size = (400, 400), xlabel = L"x", ylabel = L"f(x)", ylim=[-2, 1.2], legend =:outerbottomright)
Y = f.(Xtrain)
plot!(P1, Xtrain, Y, lw=0, c = 1, m = :o, ms=3, label = "train data")

for (iN, N) in enumerate(NN1)
   A = design_matrix(Xtrain, basis, N)
   Q,R = qr(A)
   c = R \ (Matrix(Q)'*Y)
   plot!(P1, Xtest, predict(Xtest, basis, c, N), c = iN+1, lw=2, label = L"p_{%$(N)}")
end 
display(P1)

In comparison to the interpolation problem with equispaced or chebyshev nodes, we do not have a choice on the training data distribution in practice. Observe that for $p_{20}$, $30$ training data becomes not sufficient as we observe oscillations at the end of the function. 

Let's examine whether the error decreases if we keep increasing the sample size of the training data while fixing the degree of basis to be $20$. The error here is the root-mean-square error of
$$
RMSE = \sum_{x \in X_{\rm test}} \sqrt{\frac{f(x) - \tilde{f}(x)}{|X_{\rm test}|}}
$$

In [None]:
Random.seed!(405607)
N = 20
MM = [ 40, 80, 160, 320, 640, 1000, 2000, 4000]

bases = [ poly ]#, cheb ]

err = zeros(length(bases), length(MM))   
X = rand(20_000) * 2 .- 1        
Xtest = sort(X[end:-1:end-10_000])
Ytest = f.(Xtest)

p1 = plot(xp, f.(xp); lw=1, label = "exact", c=0,
          size = (400, 400), xlabel = L"x", ylabel = L"f(x)", ylim=[-2, 1.2], legend =:outerbottomright)

for (iM, M) in enumerate(MM), (iB, basis) in enumerate(bases) 
   Xtrain = X[1:M]
   Y = f.(Xtrain) 
   A = design_matrix(Xtrain, basis, N)
   Q,R = qr(A)
   c = R \ (Matrix(Q)'*Y)
   Ypred = predict(Xtest, basis, c, N)
   plot!(p1, Xtest, Ypred, c = iM+1, lw=2, label = L"M={%$(M)}")
   err[iB, iM] = norm(Ytest - Ypred, 2) / sqrt(length(Xtest))
end

p2 = plot(; legend=:topright, xlabel="M", ylabel="rmse", yscale=:log10, 
               xscale=:log10)
plot!(p2, MM, err[1,:], m=:o, label="Mono Polynomial", lw=3)
# plot!(p2, MM, err[2,:], m=:o, label="Chebyshev Polynomial", lw=2)
plot(p1, p2)


With a degree of $20$ basis, the minimum root mean square error that the model converges to is around $10^{-2}$. 

What if we increases the degree of basis while feeding enough data?

In [None]:
NN = [5, 7, 10, 13, 17, 21, 25, 30, 35, 40]
MM = NN.^2 

bases = [ poly ]#, cheb ]
err = zeros(length(bases), length(NN))

Xtrain = X[1:10_000]
Xtest = X[end:-1:end-10_000]
Ftest = f.(Xtest)

for (iN, (N, M)) in enumerate(zip(NN, MM))
   XM = X[1:M]
   Y = f.(XM) 
   for (iB, basis) in enumerate(bases)
      A = design_matrix(XM, basis, N)
      Q,R = qr(A)
      c = R \ (Matrix(Q)'*Y)
      err[iB, iN] = norm(Ftest - predict(Xtest, basis, c, N), 2) / sqrt(length(Xtest))
   end
end

plt = plot(; legend=:topright, xlabel="N - degree of basis", ylabel="RMSE", yscale=:log10, title="Increase the degree of basis with sufficient train data")
plot!(plt, NN, err[1,:], m=:o, label="Mono Polynomial", lw=2)
# plot!(plt, NN, err[2,:], m=:o, label="Chebyshev Polynomial", lw=1)
# Plot error decay
plot!(plt, NN, exp.( - asinh(1/25^(1/2)) * NN), ls=:dash, label="expected rate")
display(plt)



## Regularization
To control overfitting, we can add a regularization term to the loss function as
$$ \argmin_{c \in \R^n} \| Ac - Y \|^2 + \lambda \| c \|^2$$
where $\lambda$ is a regularization coefficient that controls the relative strength of regularization. The regularization term here is the sum-of-squares of the parameter vector elements.

There are various kinds of regularization terms. A more general regularizer is defined as
$$\lambda \sum_{i=1}^N | c_i |^q$$
where $q=2$ is the same as before, the quadratic regularizer. $q=1$ is known as the lasso regression that for sufficient large $\lambda$, some of the $c_i$ become $0$ which leads to sparsity in the model that the corresponding basis functions $P_i$ play no role.

In general, we use the quadratic regularizer in favour of the easy closed form solution that minimizes the loss function. Similar to A1 Q3b), the normal equation becomes

$$\begin{align}
(A^\top A + \lambda I) c^* &= A^\top Y  \\ 
c^* &= (A^\top A + \lambda I)^{-1} A^\top Y
\end{align}
$$


In [None]:
Random.seed!(405607)
f = x -> 1/(1+25*x^2)

M = 30 # sample size
λ = 0.1 # regularizer coefficient
# generate train data
Xtrain = rand(M) * 2 .- 1 
# generate test data 
Xtest = range(-1, 1, length=300)

basis = poly
NN1 = [5, 8, 10, 20]

xp = range(-1, 1, length=300)
P1 = plot(xp, f.(xp); lw=3, label = "exact",
          size = (400, 400), xlabel = L"x", ylabel = L"f(x)", ylim=[-2, 1.2], legend =:outerbottomright)
plot!(P1, Xtrain, f.(Xtrain), lw=0, c = 1, m = :o, ms=3, label = "train data")

for (iN, N) in enumerate(NN1)
   Y = f.(Xtrain)
   A = design_matrix(Xtrain, basis, N)
   Reg = λ*I(N)
   A = [A; Reg]
   Y = [Y; zeros(N)]
   Q,R = qr(A)
   c = R \ (Matrix(Q)'*Y)
   plot!(P1, Xtest, predict(Xtest, basis, c, N), c = iN+1, lw=2, label = L"p_{%$(N)}")
end 

display(P1)

Although we are able to regularize the oscillations, we also pollute the solution if we pick $\lambda=0.1$. Choose corresponding regularizers and bases based on the context of the problem that you want to solve. 

## Summary Polynomial Regression
* approximation of general functions by using a linear model of polynomial bases to obtain the least square problem
* the sample size of training data affects the error and could lead to problems like underfitting or overfitting
* error convergence rate is related to the regularity of the target functions

### Further reading
Book: [Pattern Recognition and Machine Learning by Christopher Bishop](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)
* (Ch3.2) Different regularizers
* (Ch3.3) Bayesian Ridge Regression
* (Ch4.3.2) Non-linear Regression: Logistic, etc.