# Load and merge the surveys into one

Now comes the fun part. We need to load all the surveys and merge them into a single file/`Dataset`. To do that, we need to make sure:

* Metadata is standard for all surveys and following the [CF conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html).
* The `Dataset` is in a usable format.
* Coordinates are all WGS84 for easier manipulation.
* We can access the absolute gravity and do our own corrections.

This should be fun...

In [1]:
from pathlib import Path
import datetime
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
import pandas as pd
from tqdm import tqdm
import pyproj
import pooch

## Inspect one of the datasets

Load one of them and look at the data and metadata that is available.

In [2]:
data_dir = Path("..") / "data" 

In [3]:
xr.open_dataset(data_dir / "000a522322d3c7d2fd97bf91ba7179e6-P198362-point-gravity.nc")

## Load and filter surveys

Load all surveys (can take a little while).

In [4]:
all_surveys = [xr.open_dataset(fname) for fname in tqdm(list(data_dir.glob("*.nc")), ncols=100)]

100%|███████████████████████████████████████████████████████████| 1630/1630 [01:03<00:00, 25.66it/s]


Classify the surveys based on the reliability index (see the metadata in the file above).

In [5]:
reliability = [np.unique(s.reliab_index) for s in all_surveys]

In [6]:
really_bad = [np.any(r == 0) for r in reliability]
bad = [np.any(r == 2) for r in reliability]
not_good = [np.any(r == 3) for r in reliability]
print(f"Really bad: {sum(really_bad)}")
print(f"       Bad: {sum(bad)}")
print(f"  Not good: {sum(not_good)}")

Really bad: 0
       Bad: 32
  Not good: 109


Remove the bad surveys, which seem to not even be recommended for serious use.

In [7]:
surveys = [survey for survey, reliability in zip(all_surveys, reliability) if not np.any(reliability == 2)]
print(len(surveys))

1598


Check the total number of observations left after filtering.

In [8]:
ndata_per_survey = [s.grav.size for s in surveys]
print(sum(ndata_per_survey))

1789365


## Select, convert, and merge

Select only the data fields we're interested in (location, ellipsoid heights, raw gravity, and accuracy measures). 
We'll also convert the datum from GDA94 to WGS84, which is easier to use with most applications.
While we're at it, convert all the data to float32 to save space.

In [9]:
dims = ("point", )
gda_to_wgs = pyproj.Transformer.from_crs("epsg:4283", "epsg:4326", always_xy=True)
datasets = []
for survey in tqdm(surveys, ncols=100):
    # Transform coordinates to WGS84
    lon, lat, h = gda_to_wgs.transform(
        survey.longitude.values, 
        survey.latitude.values, 
        survey.ellipsoidinsthgt.values,
    )
    survey_id = np.full(survey.grav.shape, int(survey.attrs["survey_id"]), dtype=np.uint32)
    dataset = xr.Dataset(
        data_vars={
            "gravity": (dims, survey.grav.data.astype(np.float32) / 10),
            "gravity_accuracy": (dims, survey.gravacc.data / 10),
            "height_error": (dims, survey.ellipsoidinsthgterr.data),
            "reliability_index": (dims, survey.reliab_index.data.astype(np.uint8)),
            "survey_id": (dims, survey_id),
        },
        coords={
            "longitude": (dims, lon),
            "latitude": (dims, lat),
            "height": (dims, h.astype(np.float32)),
        },
    )
    datasets.append(dataset)

100%|███████████████████████████████████████████████████████████| 1598/1598 [00:48<00:00, 33.28it/s]


Now we can merge the datasets into one and set the metadata for the entire collection.

In [10]:
data = xr.concat(datasets, "point")

data.attrs = {
    "Conventions": "CF-1.8",
    "title": "Compilation of gravity ground surveys for Australia",
    "institution": "Commonwealth of Australia (Geoscience Australia)",
    "crs": "WGS84",
    "source": (
        "Compiled from the collection by Wynne, P. 2018. "
        "NetCDF Ground Gravity Point Surveys Collection. Geoscience Australia, Canberra. "
        "https://doi.org/10.26186/5c1987fa17078 "
    ),    
    "uuid": "d6e3c3a8-5a20-4d8b-afca-e55f754e4ce1",
    "license": "Creative Commons Attribution 4.0 International Licence",
    "references": "https://doi.org/10.6084/m9.figshare.13643837",
    "history": (
        "2021-08-24 (v2.0): "
        "Redownloaded and compiled the collection to add the survey ID for each point. "
        "File P200441-point-gravity.nc was not accessible and so is left out of this version. "
        "2020-10-28 (v1.0): "
        "Data with reliability index of 0 or 2 were removed from the compilation. "
        "Coordinates were converted to WGS84. "
        "Gravity was converted to mGal. "
        "Only absolute gravity, position, ellipsoid height and error measures were kept. "
        "Metadata was edited to follow CF conventions more closely. "
    ),     
}
data.gravity.attrs = {
    "long_name": "gravity acceleration",
    "units": "mGal",
    "actual_range": (data.gravity.values.min(), data.gravity.values.max()),
    "ancillary_variables": "gravity_accuracy reliability_index",
    "description": "magnitude of the gravity acceleration vector",
}
data.gravity_accuracy.attrs = {
    "long_name": "accuracy of gravity acceleration",
    "units": "mGal",
    "actual_range": (data.gravity_accuracy.values.min(), data.gravity_accuracy.values.max()),
    "description": "accuracy of the magnitude of the gravity acceleration vector",
}
data.longitude.attrs = {
    "long_name": "longitude",
    "standard_name": "longitude",
    "units": "degrees_east",
    "actual_range": (data.longitude.values.min(), data.longitude.values.max()),
}
data.latitude.attrs = {
    "long_name": "latitude",
    "standard_name": "latitude",
    "units": "degrees_north",
    "actual_range": (data.latitude.values.min(), data.latitude.values.max()),
}
data.height.attrs = {
    "long_name": "geometric height",
    "standard_name": "height_above_reference_ellipsoid",
    "units": "m",
    "actual_range": (data.height.values.min(), data.height.values.max()),
    "description": "height above the WGS84 ellipsoid",
    "ancillary_variables": "height_error",
}
data.height_error.attrs = {
    "long_name": "geometric height error",
    "units": "m",
    "actual_range": (data.height_error.values.min(), data.height_error.values.max()),
    "description": "error in the height above the WGS84 ellipsoid",
}
data.survey_id.attrs = {
    "long_name": "survey identification number",
    "description": "unique numerical identifier of the survey to which each point belongs",
}
data.reliability_index.attrs = {
    "long_name": "station reliability",
    "standard_name": "status_flag",    
    "description": "estimate of gravity station reliability",
    "flag_values": np.arange(10, dtype=np.int8),
    "flag_meanings": (
        "unreliable_data_which_should_not_be_used_pending_remedial_action "
        "insufficient_information_to_accurately_classify_but_still_regarded_as_reliable_data "
        "poorly_controlled_data_which_should_be_used_cautiously "
        "data_with_weak_gravity_position_and_elevation_control "
        "data_with_moderate_gravity_position_and_elevation_control "
        "documented_gravity_ties_levelled_elevations_and_accurately_scaled_positions "
        "a_point_occupied_once_with_well_defined_position_and_elevation "
        "multiple_occupations_at_a_point_with_well_defined_position_and_elevation "
        "multiple_measurements_at_a_point_with_accurate_position_and_elevation "
        "data_measured_numerous_times_with_absolute_geodetic_or_first_order_precision"
        ),
}

# Have a look at the compiled Dataset
data

## Save to netCDF

Export this collection to a file. We'll use integer encoding of some variables to save storage space. 
The errors don't vary largely so we can scale them and store them as 16-bit integers as opposed to 32-bit floats. 
The horizontal coordinates can be stored as 32-bit integers instead of 64-bit floats with roughly centimeter level accuracy. 
Other variables can't be easily compressed this way given their range so we'll leave them as is.

In [11]:
output_nc = Path("..") / "australia-ground-gravity.nc"

In [12]:
data.to_netcdf(
    output_nc, 
    format="NETCDF4",
    encoding={
        "gravity_accuracy": {'dtype': 'int16', 'scale_factor': 0.0001, '_FillValue': -9_999}, 
        "height_error": {'dtype': 'int16', 'scale_factor': 0.001, '_FillValue': -9_999},
        # Roughly cm level accuracy is stored for the horizontal coordinates
        "latitude": {'dtype': 'int32', 'scale_factor': 1e-07, '_FillValue': -999_999_999},
        "longitude": {'dtype': 'int32', 'scale_factor': 1e-07, '_FillValue': -999_999_999},
    },    
)

Get the SHA256 hash of the data for reference.

In [13]:
print(f"sha256:{pooch.file_hash(output_nc)}")
print(f"md5:{pooch.file_hash(output_nc, alg='md5')}")

sha256:d89740b987fc9f9b26bd52b34588521a5f7aaacbbfa4e8ae9e4a93ba413b8f0e
md5:16c94a792003714efee2bdb4f3181d3a


Print the file size.

In [14]:
print(f"{output_nc.name} {output_nc.stat().st_size / 1e6} Mb")

australia-ground-gravity.nc 44.7534 Mb


Load the data back in to check if saving and encoding worked as expected.

In [15]:
data = xr.load_dataset(output_nc)
data

## Save to CSV

Some people may prefer the CSV format, which is plain text and easier to load on different software like GMT or Matlab.

In [16]:
table = data.to_dataframe()
# Set the export precision for each column by converting them to strings
table["gravity"] = table.gravity.map(lambda x: "{:.2f}".format(x))
table["gravity_accuracy"] = table.gravity_accuracy.map(lambda x: "{:.2f}".format(x))
table["height_error"] = table.height_error.map(lambda x: "{:.2f}".format(x))
table["longitude"] = table.longitude.map(lambda x: "{:.8f}".format(x))
table["latitude"] = table.latitude.map(lambda x: "{:.8f}".format(x))
table["height"] = table.height.map(lambda x: "{:.2f}".format(x))
table

Unnamed: 0_level_0,gravity,gravity_accuracy,height_error,reliability_index,survey_id,longitude,latitude,height
point,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,979314.81,0.20,10.06,1,195951,138.62049700,-31.11712900,515.11
1,979322.00,0.20,10.06,1,195951,138.61269700,-31.12832900,470.83
2,979303.88,0.20,10.06,1,195951,138.75519500,-31.09402700,541.83
3,979306.88,0.20,10.06,1,195951,138.76489500,-31.08842700,526.33
4,979320.62,0.20,10.06,1,195951,138.82519400,-31.08632700,462.98
...,...,...,...,...,...,...,...,...
1789360,978682.50,0.20,6.16,4,196111,133.22959900,-24.21855400,602.45
1789361,978698.88,0.20,6.16,4,196111,135.24291400,-23.47853600,457.36
1789362,978721.31,0.20,6.16,4,196111,132.41628000,-24.56023200,565.01
1789363,978956.88,0.20,6.16,4,196111,139.11455500,-25.49683800,103.87


In [17]:
output_csv = Path("..") / "australia-ground-gravity.csv"

In [18]:
table.to_csv(output_csv, index=False, header=True)

Get the SHA256 hash of the data for reference.

In [19]:
print(f"sha256:{pooch.file_hash(output_csv)}")
print(f"md5:{pooch.file_hash(output_csv, alg='md5')}")

sha256:d0d1b8c578cb02325a92747b3806f677b43b2e8a8f72f6de9bff0ba092e0877b
md5:d47fef200d92c682dc8b63fe31b80364


Print the file size.

In [20]:
print(f"{output_csv.name} {output_csv.stat().st_size / 1e6} Mb")

australia-ground-gravity.csv 110.63831 Mb


Read it back in to check that saving worked properly.

In [21]:
pd.read_csv(output_csv)

Unnamed: 0,gravity,gravity_accuracy,height_error,reliability_index,survey_id,longitude,latitude,height
0,979314.81,0.2,10.06,1,195951,138.620497,-31.117129,515.11
1,979322.00,0.2,10.06,1,195951,138.612697,-31.128329,470.83
2,979303.88,0.2,10.06,1,195951,138.755195,-31.094027,541.83
3,979306.88,0.2,10.06,1,195951,138.764895,-31.088427,526.33
4,979320.62,0.2,10.06,1,195951,138.825194,-31.086327,462.98
...,...,...,...,...,...,...,...,...
1789360,978682.50,0.2,6.16,4,196111,133.229599,-24.218554,602.45
1789361,978698.88,0.2,6.16,4,196111,135.242914,-23.478536,457.36
1789362,978721.31,0.2,6.16,4,196111,132.416280,-24.560232,565.01
1789363,978956.88,0.2,6.16,4,196111,139.114555,-25.496838,103.87
