# Run this cell first

In [None]:
# this code enables the automated feedback. If you remove this, you won't get any feedback
# so don't delete this cell!
try:
  import AutoFeedback
except (ModuleNotFoundError, ImportError):
  !pip install AutoFeedback
  import AutoFeedback

try:
  from testsrc import test_main
except (ModuleNotFoundError, ImportError):
  !pip install "git+https://github.com/autofeedback-exercises/exercises.git@main#subdirectory=New-MTH4332/MonteCarlo"
  from testsrc import test_main

def runtest(tlist):
  import unittest
  from contextlib import redirect_stderr
  from os import devnull
  with redirect_stderr(open(devnull, 'w')):
    suite = unittest.TestSuite()
    for tname in tlist:
      suite.addTest(eval(f"test_main.UnitTests.{tname}"))
    runner = unittest.TextTestRunner()
    try:
      runner.run(suite)
    except AssertionError:
      pass


# Introduction

The exercises in this notebook provide an introduction of Monte Carlo simulation, which is a technique that we will be using in the remainder of this course.  The reason for introducing this technique is to provide a computationally tractable method for calculating ensemble averages.  In the first assignment you learned that we calculate these ensemble averages using:

$$
\langle E \rangle = \frac{1}{Z} \sum_i E_i e^{-\beta E_i} \qquad \textrm{where} \qquad Z = \sum_i e^{-\beta E_i}
$$

where the sums run over all the microstates.  You also learned that the number of microstates scales exponentially with the number of particles and that using the expressions above to compute ensemble averages is thus too expensive in the vast majority of instances.  To resolve this problem the next exercise will discuss how we can approximate the ensemble average of the energy as:

$$
\langle E \rangle \approx \frac{1}{T} \sum_{t=1}^T E_t
$$

where the sum runs over a short time series.  

The theory behind a method known as Monte Carlo simulation offers a justification for this approximation.  The exercises below, therefore, explains how this numerical technique can be used to calculate (definite) integrals and how we can thus use it to calculate the ensemble average, which, when state space is continuous and is given by the following expressions:

$$
\langle E \rangle = \frac{1}{Z} \int E\Omega(E) e^{-\beta E} \textrm{d}E \qquad \textrm{where} \qquad Z = \int \Omega(E) e^{-\beta E} \textrm{d}E
$$

where $\Omega(E)$ is the density of states.

## A simple integral we can perform with Monte Carlo


In these exercises we use Monte Carlo to evaluate the following integral:

$$
\int_0^1 \sqrt{1-x^2} \textrm{d}x
$$

We can evaluate this integral exactly by using the substitution $x=\sin(t)$, which converts the integral above to:

$$
\int_0^{\pi/2} \sqrt{1 - \sin^2(t)} \cos(t) \textrm{d}t = \int_0^{\pi/2} \cos^2(t) \textrm{d}t
$$

We arrive at the eqution after the equality here by using Pythagoras theorem $1 = \cos^2(t) + \sin^2(t)$.  Furthermore, by using the double angle formula we can write $\cos(2t) = \cos^2(t) - \sin^2(t) = \cos^2(t) - 1 + \cos^2(t)$.  We can thus rewrite the integral above as:

$$
\frac{1}{2} \int_0^{\pi/2} 1 + \cos(2t) \textrm{d}t = \frac{1}{2} \left[ t + \frac{1}{2} \sin(2t)\right]^{\pi/2}_0 = \frac{\pi}{4}
$$

However, instead of evaluating this integral analytically, we will be evaluating this integral numerically in the exercises that follow.  As always you should start by executing the following cell, which imports the libraries that we need: 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats

You should then watch the following video, which offers a brief reminder of the theory of numerical integration

In [1]:
%%HTML 
<iframe width="560" height="315" src="https://www.youtube.com/embed/64oys2TFaHQ?si=nSrNZDPPmjhus1K-" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

# Numerical integration

By now you have hopefully realised that when we calculate a partition function we are calculating the following (definite) high-dimensional integral:

$$
Z = \int \int \dots \int e^{-\beta H(\mathbf{x},\mathbf{p})} \textrm{d}x_1 \textrm{d}x_2 \dots \textrm{d}x_N \textrm{d}p_1 \textrm{d}p_2 \dots \textrm{d}p_N 
$$

Similarly, an ensemble average is the following ratio between two high-dimensional integrals:

$$
\langle A \rangle = \frac{1}{Z} \int \int \dots \int A(\mathbf{x},\mathbf{p}) e^{-\beta H(\mathbf{x},\mathbf{p})} \textrm{d}x_1 \textrm{d}x_2 \dots \textrm{d}x_N \textrm{d}p_1 \textrm{d}p_2 \dots \textrm{d}p_N
$$

With this and all the mathematics you have learned during your degree in mind, you can hopefully see that the problem that we are going to encounter.  Calculating integrals analytically is only possible when the functions within them are relatively simple.  It is thus not possible to calculate these integrals exactly when the Hamiltonian is complicated.

In order to make progress and to study complex physical systems, we are going to have to learn some numerical recipes that we can use to calculate these integrals.  The aim of this next part of the course is thus to explain some numerical tools that we might employ.

In order to get a sense of how these tools work we are going to revisit the following integral:

$$
\int_0^1 \sqrt{1-x^2} \textrm{d}x
$$

If you plot the integrand here you should see that the curve we are integrating traces out a quarter circle in the $xy$ plane.  If we plot this function we thus do not need to the analytic derivativation I provided above.  We can compute this integral without doing any maths -- it is $\pi/4$.  (I would recommend plotting the function if you are unconvinced that the integrand is indeed a circle).

Having established the value of the integral lets now develop an algorithm that can be used to evaluate it numerically.  As you know the integral is equal to the area under the curve.  The numerical algorithm that we are going to employ in this first exercise is thus going to work as follows:

1. We are going to create a uniformly spaced grid of points that have x values between 0 and 1 and y values between 0 and 1.
2. We are then going to determine whether each of the points on the grid is within the unit circle or not.
3. The final value of the integral will be the total number of grid points that were found to be within the circle divided by the total number of points in our grid.

I have started implementing this algorithm in the code cell on the left.  In particular, I have written the code to generate a uniform grid of points.  There are `npoints` x `npoints` in total in this grid and each pair of neighbouring points along x or y are separated by a distance called `gridspacing`.  Within the double loop I have called a function called `incircle` that takes a pair of coordinates, (`x`,`y`), as input.  You must write this function.  The function should:

1. Return one if the input coordinates are within the unit circle.
2. Return zero if the input coordinates are not within the unit circle.

If you do this you should see that the rest of the code that I have written for you will ensure that the variable called `npoints_in_circle` will be equal to the total number of grid points that sit within the circle.  The final value printed, which is the fraction of grid points that are within the circle, will thus be an approximate value for the integral that we were asked to compute.

In [None]:
def incircle(x,y) :
  # Your code goes here

  return

npoints = 100
gridspacing = 1.0/npoints
npoints_in_circle = 0
for i in range(npoints) :
  x = (i+0.5)*gridspacing
  for j in range(npoints) :
    y = (j+0.5)*gridspacing
    npoints_in_circle = npoints_in_circle + incircle(x,y)

final_integral = npoints_in_circle / (npoints*npoints)

print( final_integral )


In [None]:
runtest(['test_function'])

# Monte Carlo Integration

The algorithm that you just implemented computed an approximate version of the integral.  The total area of the xy-plane was divided into a set of squares.  Each of these squares had an area of `1/(npoints*npoints)`.  We determined whether the centre of each of the squares was within the unit circle or not.  If the centre of the square was within the circle then we assumed the whole square was and thus approximated the integral as `nsquares_inside * area_of_square`.

To understand Monte Carlo the key thing to recognise is that we had to loop over all the little squares and determine whether or not their centres were inside the unit circle.  Obviously, the order in which we go through the squares when doing this doesn't matter.  Furthermore, if, instead of running through all the squares, we selected only N of the squares at random and ran the algorithm on only those squares we would, once we divided the number of squares whose centres were found to be inside the circle by N, obtain a number close to the value of the final integral.  The reason for this being that the ratio between the number of squares whose centre is inside the circle to the total number of squares is constant.

This realisation is the basis of the Monte Carlo algorithm.  As is discussed in the video, in this algorithm:

1. A random grid of points is generated instead of a regular grid of points
2. A function (in this case whether or not the point is within the unit circle) is evaluated at each of these randomly chosen points
3. The average value of the function is evaluated.

As the video explains, calculating averages in this way gives us an approximate value for the integral because the ratio between the area of the circle and a square that encloses it is constant.

In [3]:
%%HTML 
<iframe width="560" height="315" src="https://www.youtube.com/embed/wz3K_7t2spU?si=o9vbE3faUz36lzll" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

Alternatively, because the argument in the video is a bit sketchy, we can use ideas from probability theory to write the expectation (this the same as the quantity we have called the ensemble average in this module) of a function, $A(x)$, as:

$$
\langle A \rangle = \int_{-\infty}^\infty A(x) P(x) \textrm{d}x
$$

 where $P(x)$ is a probability density.  The law of large numbers and central limit theorem tell us that:

$$
\langle A \rangle \approx \frac{1}{N} \sum_{i=1}^N A(X_i)
$$

where each $X_i$ in this expression is a random sample from the distribution with probability density $P(x)$.

With that theory explained lets now turn to what I would like you to actually do.  I would like you to write a function called `circle_estimate`.  This function should take in a single argument `N` and should return an estimate for the area of a quarter circle.  The way the function should calculate this estimate is as follows:

1. `N` pairs of uniform random variables between 0 and 1 should be generated.
2. Each pair of random variables that you generate can be thought of as a set of (x,y) coordinates in the Cartesian plane.  You should thus test whether or not these points are within the unit circle.
3. You should calculate an estimate for the area of the circle by dividing the number of points that were inside the circle by `N`.

Please note that you can generate a uniform random variable between 0 and 1 by using:

```python
U = np.random.uniform(0,1)
```

If you want to understand a little more about the theory of uniform (continuous) random variables you can read the notes [here](https://www.notion.so/Uniform-continuous-random-variable-deeb6302419b4ade9a5b36c8f105b42e).


In [None]:
def circle_estimate(N) :
    # Your code goes here
    
    estimate = 0
    return estimate

# Three estimates for the area of the circle based on a random grid
# of 1000 points are printed here
print( circle_estimate(1000), circle_estimate(1000), circle_estimate(1000) )


In [None]:
runtest(['test_estimate1'])

# Error bars

As discussed in the following video, Monte Carlo simulations rely on the generation of random numbers.  The results from any Monte Carlo simulation is thus random.  When we quote the results that we obtain from a Monte Carlo simulation we __must__ quote provide error bars for the results.  If we do not then the results from our simulations are not [reproducible](https://www.notion.so/Reproducibility-2494dddd51a14d34bddbb40bb32f7ebc).

If you need a refresher on the theory of the sample mean then you can look at the notes you will find [here](https://www.notion.so/Sample-mean-583d58d7001343c68a7956a1b9f19f4b).

In [4]:
%%HTML 
<iframe width="560" height="315" src="https://www.youtube.com/embed/0ISUsHEPxSY?si=kxf1do1bM8H6We-U" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

To start this exercise you will need copy the code that you wrote in the last exericse into the function called `area`.  This function takes a single number `N` as input.  It should then generates `N` pairs of x and y coordinates all of which lie within the unit square.  A test is then performed to determine whether or not each of these points is within the unit circle.  If the point is within the circle then the variable, `nin`, which measures how many of the generated points are within the unit circle, is incremented by one.  At the end of the code the final estimate for the area of the quarter-circle, `nin/N`, is returned.

It is important to note that, because random numbers are used to generate the x and y coordinates, you get a different value every time you repeat the experiment.  You can see this clearly if, after you have completed the `circle` function you run the code in the cell below.  The code that I have written will output three estimates for the area of the circle and you should see that all three estimates are different.  (the relevant part of the output is the first line - the second line should contain a single 0).

The fact that you get different numbers every time you run the code because there is this element of randomness in the coordinates is problematic as it makes it difficult for another researcher to reproduce the results that we obtain.  In other words, if Alice says that she got 0.792 how does Bob know his code is doing the same thing as Alice's if he gets 0.795?  They are almost guaranteed to get different results because running the code involves generating random numbers.

The answer to this conundrum is for Alice and Bob to quote error bars on their values.  The reason for doing so is that if the two codes are the same then the distributions that Alice and Bob are sampling random numbers from should be the same.  By quoting error bars we provide information on the distribution and we can thus make a judgement as to whether the two results that Alice and Bob obtained are the same or not.

In this next exercise, therefore, I would like you to write some code to calculate these error bars.  The most conceptually simple way to compute the error bars is to run the experiment multiple times.  The way this would work for this particular problem is as follows:

1. You call the `area` function multiple times and thus obtain multiple estimates for the area, which you store in a list.
2. You sort the list.
3. You find a value, `l`, that 5% of the values in the sorted list are less than and a value, `u`, that 95% of the values in the sorted list are less than.  You can then state that if the experiment is reperformed there is a probability of 90 % that any new estimate of the area will lie between `l` and `u`.

Your task is thus to implement what I have described above.   In particular, you need to complete the function called `myerrors`.  This function takes two arguments `N` and `M` and it should return three numbers, `l`, `m` and `u`.    Within this function, you should generate `M` estimates for the area of the quarter circle each of which should be computed for a random grid of `N` $(x,y)$ coordinates.  All these estimates should then be stored in a list.  From this list, you will need to extract the following quantities:

1. `l` - a value that 5% of your estimates for the area are less than (the 5th percentile of the distribution).
2. `m` - the median of the estimates for the area that you obtained.
3. `u` - a value that 95% of your estimates for the area are less than (the 95th percentile of the distribution).

Please note that if you have a list called `data` and you would like to extract the 5th percentile of the data within the list you can use:

````
pp = np.percentile( data, 5 )
````

Furthermore, if you use this function you don't actually need to sort the list.


In [None]:
def area(N) :
    nin = 0
    # Your code for estimating the area of the circle goes here

print( area(1000), area(1000), area(1000) )

def myerrors(N,M) :
    # Your code goes here.
    l, m, u = 0, 0, 0

    return l, m, u


print( myerrors(1000, 100) )


In [None]:
runtest(['test_estimate2', 'test_range'])

# Error bars II

When we run Monte Carlo (or molecular dynamics) we __always__ need to make sure we quote the error bars.  If a Monte Carlo calculation is performed and no error bars are reported on the final result is not reproducible.  I am emphasizing this now as if you stay in this field, you will see many papers that use Monte Carlo and that do not quote error bars.  Not quoting error bars these error bars a good way of hiding the crappiness of your underlying simulations and getting to a publishable (but dubious) result.  Be warned!

The problem with the resampling scheme that we used to calculate the error bars in the previous exercise is that it is computationally expensive.  It is expensive because we had to compute multiple averages each of which was computed by generating $N$ randomly chosen coordinates.  For the exercise here, where generating each configuration is computationally cheap, this is not a big problem.  If we are doing computationally expensive things for each of the configurations we are generating this sort of resampling scheme is not an option.

In this final exercise, a computationally inexpensive way of computing the error bars will thus be introduced.  This new technique is based on the central limit theorem, which tells us that an average calculated from $N$ random variables (of most types) with expectation $\mu$ and variance $\sigma^2$ is (to a good approximation) a sample from a [normal distribution](https://www.notion.so/Normal-random-variable-86cae0d838314a3cb8aead626dc6647e) with expectation $\mu$ and variance $\sigma^2/N$.

As the final result that we get from a Monte Carlo code is an average computed from multiple random variables we can thus approximate it as a sample from a normal distribution.  Furthermore, we know the exact functional form for the probability density function for the normal distribution so we can use this function when quoting error bars on the calculated mean.


These ideas are explained in more detail in the following video.

In [5]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/505vxcAewqM?si=QNVFf80DBFgRBQ1b" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

To get started with this exercise you should copy and paste the function `area` that you wrote in your solution to the last exercise.  This code should calculate an estimate for the area of the quarter circle by calculating an average of multiple (Bernoulli) random variables.  You need to modify this function so that it also calculates the variance of these random variables:

$$
S^2 = \frac{N}{N-1} \left[ \frac{1}{N} \sum_{i=1}^N X_i^2 - \left( \frac{1}{N} \sum_{i=1}^N X_i \right)^2 \right]
$$

Computing this variance is important as this quantity appears in the expression that we derive from the central limit theorem for the error bar.  This expression is:

$$
\epsilon = \sqrt{ \frac{S^2}{N} } \Phi^{-1}\left( \frac{p_c + 1}{2} \right)
$$

In this expression, $p_c$ gives the probability that a new estimate for the mean will fall between $\mu - \epsilon$ and $\mu + \epsilon$, where $\mu$ is the true mean.  Just as in the previous exercise the error bar is thus telling us something about a range that the data has a certain probability of falling into.

Your task is to write a modified version of the `area` function that you wrote for the last exercise. In your answer to the previous exercise this function outputted a single variable. The modified version that you write here should output three variables.  The second of these is the average (i.e the same quantity that is currently output by the function you wrote for the previous exercise).  The first variable is the 5th percentile of the distribution, which should be calculated using the formulas above.  The last variable is the 95th percentile of the distribution.  Again please use the formulas above to calculate this quantity. Please note that you can calculate the function $\Phi^{-1}$ using the command:

```python
ss = st.norm.ppf(0.95)
```

In [None]:

print( area(1000) )


In [None]:
runtest(['test_errors'])