*The code snippet assumes Anaconda 5.2.0 version of Python virtual environment*

<div class="alert alert-info">
    <h4>Acknowledgement</h4>
    <p>I would like to acknowledge <a href="http://www.michaelpyrcz.com/">Micahel Pyrcz</a>, Associate Professor at the University of Texas at Austin in the Petroleum and Geosystems Engineering, for developing course materials that helped me write this article.</p>
    <p>Check out his <a href="https://www.youtube.com/watch?v=jVRLGOsnYuw">Youtube Lecture on Variogram</a>, and <a href="https://github.com/GeostatsGuy/ExcelNumericalDemos/blob/master/Variogram%20Calc_Model_Demo_v2.0.xlsx">Variogram Excel numerical demo</a> on his Github repo to help yourself better understand the statistical theories and concepts.</p>
</div>

Let's say that you are a spatial data analyst of a gold mining company, and want to know the distribution of gold percentage over 100m x 100m mining area. To understand the characteritics of the rock formations, you take 100 random rock samples from the mining area, but obviously these 100 data points are not enough to estimate gold percentage over every single spatial locations in the area. So you analyze the available data (100 rock samples from random locations) and simulate full 2D-surface plot for gold percentage over the mining area.

 
<div class="row give-margin">
    <div class="col"><img src="jupyter_images/gold_transform.png"></div>
</div>

This 2D surface simulation from sparse spatial data is a sequential process that involved a series of geostatistical techniques. 

Steps:

1. Plot experimental variogram
2. Fit variogram model
3. Apply kriging
4. Apply simulation on top of Kriging
5. Run simulation multiple times and perform additioanl data analyses as needed

In this post, the concepts, theory, and methodology of plotting a **variogram** will be covered. 

## Experimental Variogram: Theory

> **Variogram** is a measure of dissimilarity over a distance. It shows how two data points are correlated from a spatial perspective, and provides useful insights when trying to estimate the value of an unknown location using collected sample data from other locations.

[Tobler's first law of geography](https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography) states that "everything is related to everything else, but near things are more related than distant things." Variogram shows the correlation between two spatial data points over distances. For example, terrains 1 km apart from each other are more likely to be similar than terrains 100 km apart from each other. Oil wells 500 ft apart from each other are more likely to show similar reservoir characteristics than oil wells 5000 ft apart from each other.  

Variogram is a function of variance over distance. It has the following equation and plot:

<div class="row give-margin">
    <div class="col-md-6 col-sm-12"><img src="jupyter_images/basic_variogram.png"></div>
    <div class="col-md-6 col-sm-12 center-center">
        <div style="top: 30%">
            $$\gamma(h) = \frac{1}{2N(h)}\sum_{\alpha =1}^{N(h)}\left ( z(u_{\alpha })-z(u_{\alpha} + h) \right)^2$$  
        </div>
    </div>
</div>

<p><u>Variables Explained</u></p>

$\gamma(h)$ = a measure of dissimilarity vs distance. It is a spatial variance between two data points separated by the distance, $h$.

$N(h)$ = number of all data point pairs separated by the distance, $h$.

$h$ = lag distance. Separation between two data points.

$u_{\alpha }$ = data point on 2D or 3D space at the location, $\alpha$.

$u_{a} + h$ = data point separated from $u_{\alpha }$ by the distance, $h$.

$z(u_{\alpha })$ = numerical value of data point, $u_{\alpha }$

$z(u_{\alpha} + h)$ = numerical value of data point, $u_{\alpha} + h$

$\sigma^2$ = sill. Variance at lag distance, $h$, in which spatial data pairs lose correlation.

<hr>

**Observation 1:** $z(u_{\alpha })$ - $z(u_{\alpha} + h)$

<div class="row give-margin-inline-plot">
    <div class="col-12"><img src="jupyter_images/grid_1.png" style="border: 1px solid;"></div>
</div>

There are two data points on the image: $z(u_{\alpha })$ and $z(u_{\alpha } + h)$. These two points are separated by the lag distance, $h$. The equation for variogram observes the difference between these two data points:

$$z(u_{\alpha })-z(u_{\alpha} + h)$$  

**Observation 2:** $N(h)$

<div class="row give-margin-inline-plot">
    <div class="col-12"><img src="jupyter_images/grid_2.png" style="border: 1px solid;"></div>
</div>

$N(h)$ accounts for <u>all</u> data point pairs that are separated by lag distance $h$. Although only horizontal separation is shown in the image, separation between two data points can be horizontal, vertical, or diagonal. Variogram will calculate the difference between all pairs of data points, $z(u_{\alpha })-z(u_{\alpha} + h)$, that are separated lag distance, $h$.

$$\sum_{\alpha =1}^{N(h)}\left ( z(u_{\alpha })-z(u_{\alpha} + h) \right)^2$$

**Observation 3:** $\gamma (h)$

$\gamma (h)$ denotes for variability of spatial data points at a lag distance, $h$. Recall that variogram accounts for <u>all</u> pairs separated by distance, $h$. It may seem very simple, but one little dot on a variogram plot is actually obtained after iterating for all pairs separated by $h$. 

$\underline{ h = 1m }$
<div class="row">
    <div class="col-12"><img src="jupyter_images/grid_3.png"></div>
</div>

$\underline{ h = 2m }$
<div class="row">
    <div class="col-12"><img src="jupyter_images/grid_4.png"></div>
</div>

$\underline{ h = 3m }$
<div class="row">
    <div class="col-12"><img src="jupyter_images/grid_5.png"></div>
</div>

Observe how there were less data pairs connected by red lines for $h = 3$. As the $h$ increases, there will be fewer number of pairs that are separated by $h4 due to spatial limitation.

**Observation 4:** $\sigma^2$

Sill ($\sigma^2$) is the variance in which spatial data pairs lose correlation. As the distance between two data points increases, it will be less likely that those two data points are related to one another. You may assume that the oil wells separated by 100 ft exibit similar geologic characteristics, but you can't assume the same for a well in Texas and a well in California. Variogram works the similar way.

<div class="alert alert-info">
    <h4>Notes</h4>
    <p>Spatial variance may never reach the sill if there is a trend. Ex: Area trend between well variability</p>
</div>

**Observation 5:** range

Range is a distance in which the spatial variability reaches the sill ($\sigma^2$). Let's say that you are an exploration engineer for drilling a new oil well. You have drilled wells A, B, C, D that are each 100ft, 200ft, 300ft, and 400ft apart from the zone you want to drill a next new well, and want to know if you can use the data from the previously drilled wells. The geostatisticians in your team report that the geologic formation in the region has a <u>range</u> of 350 ft. This means that the rocks in the region lose geologic correlation with one another if they are more than 350 ft apart — you can't use data from well D because it is 400 ft apart.