# Investigation of geophysical sensor data to inform priors

Since we don't have a really great idea of what constitutes a good set of priors for real data, here I try my best to sort out what is going on using what I hope will be simple, but robust, assumptions.

In [None]:
import GPy
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
%matplotlib inline

## Noise and length scale characteristics for gravity and magnetism

We've been running with some set of priors for gravity and magnetism, but in all fairness we have no idea what those should be.  We know they're both linear sensors that integrate over rock properties, with a 3-D sensitivity profile that gets broader with depth.  So by fitting a GP to them, we get some idea of the noise, and a lower limit on the relevant length scale.  Since they're on a grid, we could also consider the autocorrelation.

This isn't really meant to be a Bayesian analysis, but it's meant to give us some idea of the order of magnitude of the noise in a model that's flexible enough to respond to changes, but that insists on smoothness so we can pick off the delta-function component of the covariance.

# Data dictionaries

In [None]:
dict_data_set1 = {
    'dir_data': '/Users/davidkohn/dev/obsidian/data/dataset1',
    'grav': {
        'fname': 'gravity_400m_Gascoyne.txt',
        'key_lat': 'Latitude',
        'key_lon': 'Longitude',
        'key_y': 'grid_code',
    },
    'mag': {
        'fname': 'mag_TMI_gascoyne.txt',
        'key_lat': 'Latitude',
        'key_lon': 'Longitude',
        'key_y': 'grid_code',
    },
}

dict_data_set2 = {
    'dir_data': '/Users/davidkohn/dev/obsidian/data/dataset2',
    'grav_north': {
        'fname': 'Gascoyne_North_Bouguer_gravity_400m_XYZ.txt',
        'key_lat': 'X',
        'key_lon': 'Y',
        'key_y': 'GASCOYNE_NORTH_1',
    },
    'grav_south': {
        'fname': 'Gascoyne_South_Bouguer_gravity_500m_XYZ.txt',
        'key_lat': '',
        'key_lon': '',
        'key_y': 'GASCOYNE_SOUTH_1',
    },
    'mag': {
        'fname': 'Bangemall_mag_125m_XYZ.txt',
        'key_lat': '',
        'key_lon': '',
        'key_y': 'MAG_PD',
    },
}

lt_val_grav = 0.05
lt_val_mag = 0.015

# Functions

In [None]:
msg_str0 = 'fpath: {}'
msg_str1 = '  Latitude min: {}\n  Latitude max: {}'
msg_str2 = '  Grid code min: {}\n  Grid code max: {}'
msg_str3 = '  Data shape: {}'
msg_str4 = '  X shape: {}\n  Y shape: {}'

def get_vars(dict_data_set, sub_key):
    dir_data = dict_data_set['dir_data']
    fname_data = dict_data_set[sub_key]['fname']
    fpath_data = os.path.join(dir_data, fname_data)
    key_lat = dict_data_set[sub_key]['key_lat']
    key_lon = dict_data_set[sub_key]['key_lon']
    key_y = dict_data_set[sub_key]['key_y']
    return(fpath_data, key_lat, key_lon, key_y)

def get_data(
    fpath
):
    msg_str = msg_str0.format(fpath)
    print(msg_str)
    data = pd.read_csv(fpath, header=0)
    return(data)

def convert_to_xy(
    data, lt_val, key_lat
):
    msg_str = msg_str1.format(data[key_lat].min(), data[key_lat].max())
    print(msg_str)
    # q1. why add/subtract these values from lat and lon?
    # q2. why only take data less than a certain value?
    data = data[np.abs(data[key_lat] + 24.85) < lt_val]
    data = data[np.abs(data[key_lon] - 116.1) < lt_val]
    msg_str = msg_str2.format(data[key_lat].min(), data[key_lat].max())
    print(msg_str)
    msg_str = msg_str3.format(data.shape)
    print(msg_str)
    return(data)

def run_gp(
    data, key_lat, key_lon, key_y
):
    X = np.array([data[key_lat], data[key_lon]]).T
    Y = np.array([data[key_y]]).T
    kernel = GPy.kern.Matern32(2)
    msg_str = msg_str4.format(X.shape, Y.shape)
    model = GPy.models.GPRegression(X, Y, kernel)
    model.optimize(messages=True)
    fig = plt.figure(figsize = (10, 10))
    f = model.plot()
    print(model)
    return(X, Y)

# Dataset1: grav
This seems pretty weird -- the gravity data seems to have a very long length scale and no obvious noise.  But we can see from the contours that there is some structure.  Not sure what to make of that.

In [None]:
dict_data_set = dict_data_set1
sub_key = 'grav'
lt_val = lt_val_grav

In [None]:
fpath_data, key_lat, key_lon, key_y = get_vars(dict_data_set, sub_key)
data = get_data(fpath_data)
print(data.columns)
data = convert_to_xy(data, lt_val, key_lat)
X, Y = run_gp(data, key_lat, key_lon, key_y)

# Dataset1: mag
Magnetism, on the other hand, has at least some non-zero Gaussian noise to it.  But surely the length scale is kind of out of whack?  And are those repeated points there?

In [None]:
dict_data_set = dict_data_set1
sub_key = 'mag'
lt_val = lt_val_mag

In [None]:
fpath_data, key_lat, key_lon, key_y = get_vars(dict_data_set, sub_key)
data = get_data(fpath_data)
print(data.columns)
data = convert_to_xy(data, lt_val, key_lat)
X, Y = run_gp(data, key_lat, key_lon, key_y)

Oooh looks like they are.  Well, in a way that's useful, if those are real -- in principle they give us the noise scale.  But if it isn't real, it's not clear this would have worked.

In [None]:
# looking at mag X, Y
dX0 = X[:,0].reshape(36,36)[:,0]
print(dX0)
print(dX0[1:] - dX0[:-1])
dX1 = X[:,1].reshape(36,36)[1,:]
print(dX1)
print(dX1[1:] - dX1[:-1])

# Dataset2: grav

In [None]:
dict_data_set = dict_data_set2
sub_key = 'grav_north'
lt_val = lt_val_grav

In [None]:
fpath_data, key_lat, key_lon, key_y = get_vars(dict_data_set, sub_key)
data = get_data(fpath_data)
print(data.columns)
#data = convert_to_xy(data, lt_val, key_lat)
#X, Y = run_gp(data, key_lat, key_lon, key_y)