# Numerical Integration II



[//]: # "(Robert Casella 2004 Springer) Monte Carlo Statistical Methods"

[//]: # "_**Importance sampling; MCMC (Gibbs, Metropolis-H); GHK**_"

[//]: # "Here we introduce the method of Monte Carlo Integrations."




## Monte Carlo Integration


> <div class="alert alert-block alert-info">
    In this Jupyter notebook document I labeled and cross-referenced equations using something like <b>\label{eq:myeq1}</b> and <b>\refeq{eq:myeq1}</b>. It may not work correctly if you use VSCode to read the document. Try opening and reading this .ipynb document in web browsers (as it was intended).
</div>


We often start a gentle introduction of a new method using a simple example such as $I = \int_a^b g(x) dx$, where $I$ is the area under $g(x)$ in $[a,b]$. Here, when we are introducing the (Quasi) Monte Carlo method, we will start from a slightly more general (and thus a slightly complicated) problem. The general setup allows us to see the whole picture in a more straightforward way. As you will see, the simple problem above would be a special case of the general problem.


### a general setup


Let's consider an integration problem in the following form:

\begin{equation}
I = \int_\Omega f(x) p(x) dx,\label{eq:main1}
\end{equation}

where 

- $p(x)$ is a nonnegative function which is often referred to as the **weight function**,

- $\Omega$ is the domain of the integral; for example,
  - $\Omega = [a,b]$ where both $a$ and $b$ are finites (proper integral);
  - $\Omega = [-\infty, b], [a, \infty], [-\infty, \infty]$, etc. (improper integral).



> **You:** What? A *weight function*? How could this be a *general* setup? It's not general at all. My integration problem does not have a weight function.
>    
> **Me:** Well, p(x)=1 is also a "nonnegative function" and thus a "weight function". We will see more later.


###### weight functions taking the form of exponential functions

The weight function can take various forms which correspond to different methods of numerical integration. For instance, if $p(x)$ is an exponential function in specific forms, we could apply the quadrature rules (we've seen this before). Two examples:

- If $p(x) = e^{-x^2}$ and $\{a, b\} = \{-\infty, \infty\}$, then we could apply the Gauss-Hermite quadrature rule. 

- If $p(x)= e^0 = 1$ and $\{a, b\} = \{-1, 1\}$, $I$ is the area under the $f(x)$ function between $-1$ and $1$ and can be approximated using the Gauss-Legendre rule.


###### weight functions taking the form of probability density functions

Here, we consider another possibility where $p(x)$ is a **probability density function** of a continuous random variable $X$ with the domain equal to $\Omega$; that is,

\begin{aligned}
 \int_\Omega p(x) dx = 1.
\end{aligned}. 


In this case, the integration problem in \eqref{eq:main1} can be interpreted as the expected value of $f(x)$ where the $x$ is drawn from $\Omega$ with the probability dictated by $p(x)$. That is,

\begin{aligned}
I = \int_\Omega f(x) p(x) dx  =\ E_p[f(x)].
\end{aligned}

Note that we use the subscript $p$ in the notation $E_p[\cdot]$ to indicate that the expectation of $f(x)$ is calculated from $x$ which is drawn based on the density function of $p(x)$. 

> So, the notation $E_p[f(x)]$ conveys two important information: the objective function (i.e., $f(x)$) to which we wish to calculate the expected value, and the way $x$ is generated (i.e., $p(x)$) in calculating the expected value.

> The _expectation_ interpretation is easy to see in the case of discrete random variables, such as $E_\pi[f(x)] = f(x_1)\pi_1 +  f(x_2)\pi_2 + \ldots + f(x_n)\pi_n$, $\sum_{j=1}^n \pi_j = 1$. If the variable is continuous, the analogous notation is 
> $E_p[f(x)] = \int_\Omega f(x) p(x) dx$, where $\int_\Omega p(x)dx = 1$.


###### approximation using the strong law of large numbers

Why we bother to express an integration problem as an expected value? Because the expected value could be approximated using the sample average (empirical average) by the strong law of large numbers:

\begin{align}
I = E_p[f(x)] \approx \frac{1}{n} \sum_{j=1}^n f(x_j), \label{eq:lln}
\end{align}

where the sample $(x_1, x_2, \ldots, x_n)$ is generated according to the density function $p(x)$. In other words, the value of the integration is calculated by drawing a sample of $x_j$s according to $p(x)$, substitute them into $f(x)$ and compute the sample average of $f(x)$.

Note that the sentence "_where the sample ..._" is an important part of the problem statement. You cannot skip it in most cases.





###### Remarks:

- How do we draw the sample of $(x_1, x_2, \ldots, x_n)$ from $p(x)$ in practice?


  - We could draw the sample randomly from $p(x)$ using the various sampling methods we introduced earlier.

    - If $p(x)$ is well defined and the inverse function is available, we could use the inverse sampling method.
    
    - If the inverse function of $p(x)$ is unavailable, we could use the reject-accept method (which produces independent samples but the sampling process could be inefficient and impractical) or various MCMC methods including Gibbs sampling or Metropolis-Hasting sampling (which is efficient but the samples are not independent; asymptotically these samples have the desired distribution). 
 
  - We could take a more strategic sample of $(x_1, x_2, \ldots, x_n)$ using the low discrepancy sequence, which we will introduce later.

#### a special case of $x \sim U(0,1)$


In the case of $I = \int_0^1 f(x) p(x) dx$ where $x$ follows a uniform distribution in $[0,1]$, i.e., $x \sim U(0,1)$, we have $p(x) = \frac{1}{1-0} = 1$ which is trivial. Thus, we often skip the subscript in the expectation notation for such a case and so the problem becomes 

\begin{align}
I = \int_0^1 f(x) p(x) dx = \int_0^1 f(x) dx & =  E[f(x)] \notag \\
      & \approx \frac{1}{n}\sum_{j=1}^n f(x_j), \label{eq:lln_uni}
\end{align}

where the sample $(x_1, x_2, \ldots, x_n)$ is drawn from $U(0,1)$. Therefore, $E[f(x)]$, which is without the subscript, is generally understood as the expectation of $f(x)$ where $x$ is drawn from $U(0,1)$.

Note that \eqref{eq:lln} and \eqref{eq:lln_uni} may look the same. However, the former's $x_i$s sample is drawn based on the density of $p(x)$ while the latter's is drawn from $U(0,1)$.



### a typical integration problem (and a special case of the general setup)

Now we are ready to talk about a more typical integration problem, which may look simpler and could be treated as a special case of the general problem:

\begin{aligned}
 I = \int_\Omega f(x) dx.
\end{aligned}


In this type of problems, there is no $p(x)$ (the weight function) to begin with (or put differently, $p(x)=1$) and the integration's domain is not necessarily $[0,1]$. It is actually a more common problem you may encounter. For example, $I= \int_a^b f(x) dx$ which is the area under the curve of $f(x)$ in $[a,b]$. 
> <div class="alert alert-block alert-success">
<b>Question:</b> If there is no probability density function in the integrand, could we still express the integration as an expected value of f(x)?<br>
<b>Answer:</b> Yes! We simply assign one to it (and make necessary adjustments)!
</div> 

#### finite domain (proper integral)

If the domain of the integration is finite, e.g., $\Omega = [a,b]$ where $a$ and $b$ are finites, we could simply assign a uniform distribution in $[a,b]$ to it: $p(x) = 1/(b-a)$. The uniform probability is handy because the pdf does not depend on $x$. For instance,

\begin{aligned}
I  = \int_a^b f(x) dx & = \int_a^b  \frac{f(x)}{p(x)} p(x) dx \\
  & = (b-a) \int_a^b f(x) p(x) dx = (b-a) E_U[f(x)],
\end{aligned}

where $x \sim U(a,b)$. In this example, we are able to express $I$ as an expected value of $f(x)$ over $x \sim U(a,b)$ multiplied by the "_**volume**_" $(b-a)$. It is an effective approach, because we can then take the advantage of the law of large numbers and use the approximation:

\begin{aligned}
I = \int_a^b f(x) dx = (b-a) E_U[f(x)] \approx (b-a)\left[ \frac{1}{n}\sum_{j=1}^n f(x_j) \right],
\end{aligned}

where the sample $(x_1, x_2, \ldots, x_n)$ is drawn from $U(a,b)$.



#### not finite domain (improper integral)

What if the domain is not finite, such as $[a, \infty]$, $[-\infty,b]$, $[-\infty, \infty]$, etc.? The above method may not seem to work because there is no uniform distribution with unbounded domains. It turns out that we can circumvent the problem by transforming the infinite domain of $x$ to a finite domain of $t$ using the change of variables. 

Just to give you a hint: Suppose the domain is $x \in [a, \infty]$ where $a$ is finite. Let's consider the transformation rule $x = a + t/(1-t)$. If $x=a$, it's easy to see that $t=0$ is a solution. If $x\rightarrow \infty$, we see that $t=1$ would fit the rule. Therefore, using the rule, we transform $x \in [a, \infty]$ to $t \in [0,1]$ which is finite. Provided that the entire integration problem is properly transformed from $x$ to $t$, we could apply the above method again ("assign a probability function to the problem and express the answer as an expected value problem").

> <div class="alert alert-block alert-success">
    Do you see the magic here? Regardless of whether we have a probability density function in the integrand and regardless of whether the domain is finite, we can always express the result as an expected value problem. It then allows us to use the sample average to approximate the solution.
</div>






[//]: # ">> For the sake of completeness:
>>  - We recognize $p(x)$ as a density function:
 \begin{align}
I  = \int_a^b f(x) p(x) dx & = E_{p}[f(x)] \notag \\
  & \approx \frac{1}{n}\sum_{i=1}^n f(x_i),\label{eq:cf1}
\end{align}
 where $x_i$s are drawn according to $p(x)$. 
>> 
>> 
>>  - $p(x)$ is not a weight function or we fail to recognize it as a weight function. Then we assign a uniform distribution to it. 
>> \begin{align}
 I  = \int_a^b f(x) p(x) dx  & = (b-a) \int_a^b \frac{1}{b-a} f(x) p(x) dx \notag \\  
      & =  (b-a) E_U[f(x) p(x)] \notag \\   
      & \approx (b-a)\left[ \frac{1}{n}\sum_{i=1}^n f(x_i) p(x_i) \right], \label{eq:cf2}
 \end{align} 
>>  where $x_i$s are drawn from a uniform distribution $U$ in $[a,b]$.
>>
>> You see, you can use either \eqref{eq:cf1} or \eqref{eq:cf2} to compute the same integration. They differ in how $x_i$s are generated and used: In \eqref{eq:cf1}, they are generated according to $p(x)$. In \eqref{eq:cf2}, they are generated from a uniform distribution and then _adjusted_ in the summation through $p(x)$."

### domain transformation



We've seen that the key to carry out Monte Carlo integration is to use a uniform probability density function to help expressing the solution as an expected value problem. In order to use the uniform pdf, the domain of the problem has to be finite. As it turns out, it would be even better if the transformed domain is not only finite but exactly equal to $[0,1]$. If we can transform all kinds of domains to $[0,1]$, it is kind of standardizing the procedure and would make a given set of tools available to all the problems. To wit,

\begin{align}
I  & = \int_\Omega f(x) dx \label{eq:orig}\\
   & = \int_0^1 g(t) dt = (1-0) \int_0^1 g(t) \frac{1}{1-0} dt  = E[g(t)] \label{eq:tran}\\ 
   & \approx \frac{1}{n}\sum_{j=1}^n g(t_j),\label{eq:summ}
\end{align}

where $t_j$s are drawn from $U(0,1)$. If going from \eqref{eq:orig} to \eqref{eq:tran} is possible regardless of $\Omega$, we can apply all tools that help to draw $t_j$ and calculation the sample average in \eqref{eq:summ}.

#### Remarks

- What kind of sample of $(x_1, x_2, \ldots, x_n)$ from $p(x)$ should be used?


  - We could draw a _**(pseudo) random**_ sample of the $x_i$s, and the estimation is called the _**Monte Carlo integration (MCI)**_.
    - The random sample can be drawn using the inverse transformation method, the rejection sampling, etc..
    - Recall that the random numbers we generated using the computer's RNG are not truly random; they are pseudo-random.


  - We could draw a _**quasi-random**_ sample of the $x_i$s, and the estimation is called the _**Quasi Monte Carlo integration (QMCI)**_.
    - A class of quasi-random numbers is the low-discrepancy sequence (LDS). There are many types of LDS, among them the Halton sequence is a well-known one.
    - QMCI has better convergence rate than the MCI.
    

  - In fact, we could also generate $(x_1, x_2, \ldots, x_n)$ as a equally spaced grid ($x_i = i/n$) which is called the _**rectangle rule**_.
    - The rectangle rule works good if the integration is one-dimensional, but it does not work well for multi-dimensional problems because of the correlations between the sequences. 



#### Domain Transformation


How do we do the transformation? First, we need a transformation rule that maps $x \in [a,b]$ to $t \in [0,1]$, and then we apply the rule and use the changes of variables. Regarding the rule, we need to find one such that if $x=\rho(t)$:

- when $x=a$ the corresponding value is $t=0$, i.e., $a = \rho(0)$; or, $\rho^{-1}(a)=0$;

- when $x=b$ the corresponding value is $t=1$, i.e., $b = \rho(1)$; or, $\rho^{-1}(b)=1$.


If we find such a rule, $x=\rho(t)$ and thus $\rho^{-1}(t) = x$, we apply the change of variables on the equation:

\begin{aligned}
 f(\rho(t)) \rho'(t) = g(t),
\end{aligned}

where $\rho'(t)$ is the Jacobian. 

- As we will show below, if $a$ and $b$ are both finite and a suitable transformation function in this case is $x = a+(b-a)t$ to which the Jacobian is $(b-a)$.

- Note that if $a$ and $b$ are finites, $I = (b-a) E_{U(a,b)}[f(X)] = E[g(t)]$. The *volume* of $(b-a)$ no longer shows using $g(t)$, and the integration is literally the sample average of the *transformed* function $g(t)$. 

- We sometimes see statements which equate the Monte Carlo integration to the sample average of the function. The statement might be too terse that it could be easily misunderstood. Literally speaking, the statement is correct with respect to $g(t)$ but is problematic w.r.t. $f(x)$ because the latter needs the "volume".

The following table provides useful rules of transformation. Note that the rules are not unique; there exist different rules to do the transformation.

$$\mathbf{x\,\ domain}$$ | $$\mathbf{transformation}$$ | $$\mathbf{t\,\ domain}$$ | $$\mathbf{Jacobian}$$ 
 ---     |  ---    | ---      | --- 
$$[a, b]$$            | $$x = a + (b-a)t$$ | $$[0,1]$$ | $$b - a$$
$$[-\infty, \infty]$$ | $$x = \frac{2t-1}{t-t^2}$$        | $$[0,1]$$ | $$\frac{2t^2 - 2t+1}{(t^2 -t)^2}$$
$$[a, \infty]$$       | $$x = a + \frac{t}{1-t}$$    | $$[0,1]$$ | $$\frac{1}{(t-1)^2}$$
$$[-\infty, b]$$      | $$x = b + \frac{t-1}{t}$$    | $$[0,1]$$ | $$\frac{1}{t^2}$$


Here, we give a more complete argument why we prefer transforming the domain to $[0,1]$:

- We know rules of transforming $[a,b]$ to $[0,1]$ for $a$ and $b$ ranging from $-\infty$ to $\infty$, so this part is not difficult.

- After the conversion, the random numbers are all drawn from $[0,1]$ instead of $[a,b]$. Thus, the sample from $[0,1]$ could be repeatedly used for different problems.

- Any multidimensional function with bounds on each variable can be transformed into the unit n-dimensional hypercube, $[0, 1]^d$.

- Because of the above, it's easy to write a computer program to automate the process, from domain and function transformation to random number sampling and to computing the final result.

### review summary

> **We start with a general setup in which there is a probability density function $p(x)$ in the integrand and the density's support is equal to $\Omega$. We introduce the notation of $E_p[f(x)]$ and $E[f(x)]$.**
>> 
>> - Most general:
\begin{align}
I  = \int_\Omega f(x) p(x) dx = E_p[f(x)],\quad  \mbox{$x$ follows $p(x)$},
\end{align}
>> 
>>
>> - If $p(x)$ is the pdf of $U(0,1)$ and $\Omega = [0,1]$:
\begin{align}
I  = \int_0^1 f(x) p(x) dx = \int_0^1 f(x) dx =E[f(x)],\quad  \mbox{$x$ follows $U(0,1)$},\label{eq:general2}
\end{align}
> 
> 
> **We then talk about the common problem where there is no weight function $p(x)$ in the integrand (i.e., $p(x)=1$ regardless of $\Omega$). The trick is to transform the problem from the domain $\Omega$ to the domain $\tilde{\Omega} = [0,1]$. Then, something along the line of \eqref{eq:general2} can be done. The difficult part is in carrying out the transformation.**
> 
> 
>> A special case where $p(x)=1$ (i.e., no weight function) regardless of $\Omega$:
>> 
>> - If $\Omega$ is finite:
\begin{aligned}
I  = \int_a^b f(x) dx = (b-a)E_U[f(x)],\quad  \mbox{$x$ follows $U(a,b)$}.
\end{aligned}
>> 
>>
>> - If $\Omega$ is not only finite but also $[0,1]$:
\begin{aligned}
I  = \int_0^1 f(x) dx = E[f(x)],\quad  \mbox{$x$ follows $U(0,1)$}.
\end{aligned}
>> 
>>
>> - If $\Omega$ is not finite:
>>   - First, we map the problem's domain from $x$ (infinite domain) to $t$ which has a finite domain $\tilde{\Omega} = [a,b]$, preferably $\tilde{\Omega} = [0,1]$, 
>>   - Then,
\begin{aligned}
I  = \int_0^1 g(t) dx = E[g(t)],\quad  \mbox{$t$ follows $U(0,1)$}.
\end{aligned}

## Integration using Importance Sampling

### Motivation


We show in previous sections that we can transform an integration problem into a problem of computing the expected value of the function. It takes three basic forms:
    
 \begin{align}
   I & = \int_\Omega f(x) p(x) dx = E_p[f(x)],\quad  \mbox{$x$ follows $p(x)$}, \label{eq:case1}\\
   I & = \int_a^b f(x)  dx = (b-a) E_U[f(x)], \quad  \mbox{$x$ follows $U(a,b)$}, \label{eq:case2}\\
   I & = \int_\Omega h(x) dx = \int_0^1 g(t) dt = E[g(t)],\quad  \mbox{$t$ follows $U(0,1)$}. \label{eq:case3}
   \end{align}


Though the approach is very useful, it may not be the most efficient method for some problems. Let's consider the following scenarios.


- In regard with \eqref{eq:case2}: The uniform sampling in $[a,b]$ may not be efficient if the value of $f(x)$ mostly comes from a particular region in the support. It would be more efficient if we could sample more heavily in that region and lightly in other region. Instead, the uniform sampling from $U(a,b)$ means that the occurrence of $x$ is assumed to be equally likely for all values in $[a,b]$. 


- In regard with \eqref{eq:case3}: The transform rule used to map $x\in \Omega$ into $t \in [0,1]$ may be quite nonlinear, which would impact the efficiency of the sampling. Thus, if we have a problem in the form of \eqref{eq:case1}, sometimes we don't want to go the route of \eqref{eq:case3) (by making $h(x)\equiv f(x)p(x)$).


- However, even with \eqref{eq:case1}, sometimes $p(x)$ is too difficult to sample from.


The above problems may be circumvented using the importance sampling method.


### How does it work

#### example 1

Consider the issue we mentioned in regard with \eqref{eq:case2}. Suppose we know $f(x)$ is the largest around $x=3$, we could use a _**normalized**_ normal distribution (or, a truncated normal distribution) with the mean equal to $3$ as the density function, which will draw more samples from $x$ around $3$ and less samples elsewhere. Call this density function $q(x)$ where  $\int_a^b q(x) dx =1$. We then have

\begin{aligned}
I = \int_a^b f(x) dx = \int_a^b \frac{f(x)}{q(x)} q(x) dx = E_{q}[h(x)] \approx \frac{1}{n}\sum_{j=1}^n h(x_j),
\end{aligned}

where $h(x) = [f(x)/q(x)]$ and $x_j$s are drawn according to $q(x)$. Here, $q(x)$ is often called a _**proposal distribution**_ or _**sampling distribution**_.

The above is an example of importance sampling. A good choice of $q(x)$ provides better sampling, which results in reduced variance and faster convergence. Therefore, importance sampling is often referred to as a variance reduction method.

#### example 2

Let's consider the issue we mentioned in regard with \eqref{eq:case1} where $p(x)$ is difficult to sample from. Suppose $q2(x)$ is a probability density function ($\int_\Omega q2(x)dx=1$) which is easier to sample. We could transform the problem into one that uses $q2(x)$, not $p(x)$, as the probability density.


\begin{aligned}
I = \int_\Omega f(x)p(x) dx = \int_\Omega \left[\frac{f(x)p(x)}{q2(x)}\right] q2(x) dx = E_{q2}[h(X)],
\end{aligned}

where $h(x) = [f(x)p(x)/q2(x)]$. To approximate the integral,  

\begin{aligned}
I = E_{q2}[h(X)] \approx \frac{1}{n} \sum_{j=1}^n h(x_j),
\end{aligned}

where the $x_j$s are sampled based on $q2(x)$ (rather than $p(x)$). 


### Discussion


An interesting issue to note is that the sampling distribution (i.e., $q(x)$ and $q2(x)$) is not necessarily the exact true distribution of $x$ and is in fact likely to be a biased distribution to $x$. So, would using it bias the estimate? No, it wouldn't. It is because the sampling is weighted to correct for the use of the biased distribution, and the correction ensures that the estimator is unbiased. The weight is given by $p(x)/q2(x)$ which is called the _**likelihood ratio**_. 


Choosing good proposal distributions of $q(x)$ and $q2(x)$ are vital. It gives a simpler expression and efficient sampling. Some wisdoms from the literature (using our last example to illustrate):
- select a $q2(x)$ which comes from the same family of $p(x)$ so that they have similar shapes;
- $q2(x)$ should have thicker tails than $p(x)$, otherwise $h(x)=f(x)p(x)/q2(x)$ may get too large and become unbounded upward;
- $q2(x)$ should be easy to do sampling from.


#### Other Remarks

- GHK simulator is a kind of importance sampling where it uses a truncated normal as the probability function. See also `(Gates 2006 SJ) A Mata Geweke-Hajivassiliou-Keane Multivariate Normal Simulator.pdf`,  `(Benz Bretz 2009) Computation of Multivariate Normal and t Probabilities.pdf`.

