# Chapter 1: Exploratory Data Analysis

Welcome to the first tutorial for the GeoGals (Geostatistics on Galaxies) package! In this lesson, we will demonstrate how GeoGals can be used to create a *semivariogram* -- a data visualisation tool used to look for (and characterise) the two-dimensional spatial structure of galaxies. For a gentle introduction to the semivariogram and how it relates to other methods used in astronomy to quantify spatial correlations in data, see [this Tutorial](https://arxiv.org/abs/2407.14068).

First, let's begin by importing some important Python packages (if GeoGals is not installed, you can download and install it by running `pip install geogals`).

In [1]:
import geogals as gg
import matplotlib.pyplot as plt
from astropy.io import fits

Next, we need to load some data. For this example, we will be using a metallicity map derived from data collected by [the PHANGS team](https://sites.google.com/view/phangs/home) for the galaxy NGC 1385, computed using the $O_3N_2$ metallicity diagnostic of [Curti et al. 2017](https://ui.adsabs.harvard.edu/abs/2017MNRAS.465.1384C/abstract). This `.hdf` file contains data on how to translate from pixels to sky coordinates, as well as two pieces of data: the metallicity at each pixel, and the uncertainty in the metallicity of each pixel.

In [2]:
# Open Data
data_path = '../../data/'
Z_data    = fits.open(data_path + 'NGC1385_metals.fits')

As well as the `.hdf5` file, we need a little bit of extra *metadata* about our galaxy in question. This is structured as a `dict`, and must contain entries for five fields:

 * `RA` and `DEC` -- the location of the centre of the galaxy;
 * `PA`, the position angle of the galaxy (in degrees);
 * `i`, the inclunation of the galaxy (in degrees);
 * and `D`, the distance to the galaxy (in units of Mpc).

To remember this, just rememeber that before you analyse real galaxy data, you need to know where the galaxy is, and you need to get `PAiD`! You can get these parameters for any specific galaxy from [HyperLEDA](http://atlas.obs-hp.fr/hyperleda/). These parameters are necessary for converting angular separations on the sky to physical spatial separations, using deprojection to mitigate the effects of inclination.

In [3]:
# Input metadata
metadata = {
    'RA':54.3680,
    'DEC':-24.5012,
    'PA': 181.3,
    'i': 44.0,
    'D': 22.7
}

Now we have everything we need to generate and plot our semivariogram for this data. This can be done very quickly even with large datasets (the example we are trying has over 80,000 data points!) thanks to an algorithm that involves the fast Fourier transform, described in [Marcotte96](https://ui.adsabs.harvard.edu/abs/1996CG.....22.1175M/abstract).

In [None]:
# Subtract off radial trend
resid_Z_grid = gg.generate_residual_Z_grid((Z_data[0].data, Z_data[1].data, Z_data[0].header, metadata)

In [4]:
# Generate a semivariogram (with 50 pc bins, keeping all data for now)
semivariogram, separations = gg.fast_semivariogram(resid_Z_grid, Z_data[0].header, meta=metadata, bin_size=50)

AttributeError: The number of `values` elements must match the length of each `sample` dimension.

In [5]:
# Save it
gg.__version__

'0.1.4'

In [4]:
# Plot it

Looking at this plot, we can gather a wealth of information. Firstly, we see that the variance between data points increases as a function of their separation. This makes sense, because regions of the data that are closer to each other ought to be more correlated than pairs of data points that are farther apart. By examining the height of this graph at the smallest separation bin, we can estimate how much of the variance in our data comes from random effects with no spatial correlation, such as shot noise. This semivariogram flattens out at a distance of XXX - this is the size of the largest fluctuations in our field. Beyond this threshold, measurements become unreliable, as there are fewer pairs of data points at greater separations.

In short, a semivariogram reveals that much of the variance that we see in this data comes from spatially correlated sources. In the next Tutorial, we show how we can fit a model to our data that takes into account these spatially correlated fluctuations.