# Theoretical background

## Statistical Confidence in MD measurement
Due to the finite time of a molecular dynamics, there is always some degree of correlation within a set of computational measurement. To esimate this correlation and incorporate it in the error analysis, *statistical inefficiency (SI)* is defined based on the relation between the variance of the mean and the autocorrelation function of the quantity of interest, and then, it is measured using different computational techniques ([Tildesley](https://doi.org/10.1093/oso/9780198803195.001.0001), [Rapaport](https://doi.org/10.1017/CBO9780511816581))


Assuming $\{R_N\}$ as the data set of chain size $R$, the mean and variance of the mean are respectively
$$\langle R\rangle=\frac{1}{N}\sum_{i=1}^{N}X_i$$
$$\sigma^2(\langle R\rangle)=\frac{s}{N}\sigma^2(R)$$
where $\sigma$ is the bias-corrected variance of the data set
$$\sigma^2(R)=\frac{1}{N-1}\sum_{i=1}^{N}(R_i-\langle R\rangle)^2$$
and $s$ is the SI and estimated by means of the blocking method ([Flyvbjerg](https://doi.org/10.1063/1.457480)). In this method, the original data set is sequentially chunked in $N_{block}=\{1,2,4,8, 2^{M}\dots <N\}$ blocks. The data in each set of new blocks are the mean of $N_{block}$ data in the original data set. In each blocking tranform, the SE $s$ is measured in the following way
$$s_{block} = \frac{L_{block}\sigma^2_{block}}{\sigma^2(R)}$$
Where $L_{block}$ is the size of blocks after $M-\text{th}$ transfrom and $\sigma^2_{block}$ is variance of the block mean. For $M=1$, we have the trivial value of $1$. Using the blocking method, it is also possible to find error in $\sigma^2_{block}$ 
$$\Delta\sigma^2_{block}=\sqrt{\frac{2}{L_{block}-1}}\sigma^2_{block}$$
 the SI of the original data $s$ is the integer found by fitting a line to the platuae of $s_{block}$ vs $\frac{1}{L_{block}}$ diagram.

In [None]:
%matplotlib inline
import numpy as np
from glob import glob
import  matplotlib.pyplot as plt
import pandas as pd
from polyphys.analyze import analyzer

In [None]:
#tseries = glob("./N*TMon.csv")
database = "/Users/amirhsi_mini/research_data/hns_cubic/N200epshm29nh36ac2nc5969mc8l25dt0.005ndump2000adump5000ens1"
data_file = glob(database + "/*gyrTDna.txt")[0]
data = pd.DataFrame(np.loadtxt(data_file), columns=["t", "gyr"])

In [None]:
#data = pd.read_csv(tseries[0],header=0)
#data

In [None]:
_, axes = plt.subplots(nrows=1, ncols=1, figsize=(16, 9))
axes.grid(True, which="both")
for idx in range(1,2):
    result = analyzer.error_calc_block(data.iloc[:,idx].to_numpy(), './block_analysis')
    result['si'].plot()

    axes.errorbar(result['ntransfroms'], result['si'], yerr=result['si_err'], fmt='--o')
axes.set_xlabel(r"Number of transformation: $n_{block}}$")
axes.set_ylabel(r"Statistical inefficiency: $s(n_{block})$")

In [None]:
a = data.iloc[:,1].to_numpy()
result = analyzer.error_calc_block(a,'./block_analysis')
result['si'].plot()

In [None]:
result

In [None]:
nstep = len(a)
# Simple calculation of average and variance
a_avg   = np.sum(a)/nstep        # Sample average
a       = a - a_avg              # Centre the data
a_var_1 = np.sum(a**2)/(nstep-1) # Bias-corrected sample variance
a_err   = np.sqrt(a_var_1/nstep) # Error estimate neglecting any correlations
print( "{:40}{:15.6f}".format('Sample average value', a_avg ) )
print( 'Deviation should (typically) lie within +/- exact error estimate' )
print( "{:40}{:15.6f}".format('Sample variance', a_var_1 ) )
print( "{:40}{:15.6f}".format('Error estimate neglecting correlations', a_err ) )
print( 'This should be very over-optimistic!' )


nblock = nstep
tblock = 1

print( "{:>15}{:>15}{:>15}{:>15}".format('tblock', 'nblock', 'error estimate', 'estimate of SI') )

while True: # Loop over number, and hence length, of blocks
    nblock = nblock // 2 # Halve the number of blocks, rounding down if nblock is odd
    tblock = tblock*2    # Double the block length
    if nblock < 3:
        break
    a[0:nblock] = ( a[0:2*nblock-1:2] + a[1:2*nblock:2] ) / 2.0 # Blocking transformation, halving the data set
    a_avg       = np.sum ( a[0:nblock] ) / nblock               # Re-compute sample average
    a[0:nblock] = a[0:nblock] - a_avg                           # Re-centre in case of dropped points (odd nblock)
    a_var       = np.sum ( a[0:nblock]**2 ) / (nblock-1)        # Bias-corrected variance of block averages
    a_err       = np.sqrt ( a_var / nblock )                    # Estimate of error from block average variance
    si          = tblock * a_var / a_var_1                      # Statistical inefficiency
    print( "{:15d}{:15d}{:15.6f}{:15.6f}".format(tblock, nblock, a_err, si) )

print('Plateau at large tblock (small nblock)')
print('should agree quite well with exact error estimate')
print('Can plot SI or error**2 against 1/tblock or log2(tblock)')