Skip to content

Compare two sets of geodata, reporting on various types of differences

License

Notifications You must be signed in to change notification settings

bcgov/FIT_changedetector

Repository files navigation

FIT Change Detector

Lifecycle:Experimental

GeoBC Foundational Information and Technology (FIT) Section tool for reporting on chages to geodata over time.

Installation

GDAL and Geopandas are required. If gdal and geopandas are already installed to your environment, install with pip:

git clone git@github.com:bcgov/FIT_changedetector.git
cd FIT_changedetector
pip install .

For systems where gdal/geopandas are not already available, installing geopanadas with conda as per the guide is likely the best option.

Usage

Python module

import geopandas
import fit_changedetector as fcd

# read the data
df_a = geopandas.read_file(in_file_a, layer=layer_a)
df_b = geopandas.read_file(in_file_b, layer=layer_b)

# compare the two dataframes
diff = fcd.gdf_diff(
    df_a,
    df_b,
    <primary_key>,
    fields=<fields_to_compare>,
    precision=<precision>,
    suffix_a="a",
    suffix_b="b",
)

The function gdf_diff returns a dictionary with standard keys and geodataframes holding the corresponding records. Dictionary keys:

key description
NEW additions
DELETED deleted records
UNCHANGED unchanged records
MODIFIED_BOTH records where attribute columns and geometries have changed
MODIFIED_ATTR records where attribute columns have changed but geometries have not changed
MODIFIED_GEOM records where geometries have changed but attribute columns have not
MODIFIED_ALL not currently implemented
ALL_CHANGES not currently implemented
MODIFIED_BOTH_OBSLT not currently implemented
MODIFIED_ATTR_OBSLT not currently implemented
MODIFIED_GEOM_OBSLT not currently implemented

Schemas for records contained in NEW, DELETED, UNCHANGED are as per the source data. Schemas for records contained in the MODIFIED keys include values from each input source. For example, these are some "modified attributes" records, with "_a" suffix for values from the primary dataset, and "_b" suffix for values from the secondary dataset:

>>> diff["MODIFIED_ATTR"]
            park_name_a               park_name_b           parkclasscode_a parkclasscode_b geometry
fcd_load_id                                                                                                                                                             
5           Mars Street Park          Jupiter Street Park   NaN             NaN             MULTIPOLYGON (((1196056.257 385205.986, 119607...
6           Mayfair Blue              Mayfair Green         BL              GRN             MULTIPOLYGON (((1195089.488 384997.246, 119508...
7           Quadra Heights Playground                       NaN             NaN             MULTIPOLYGON (((1195238.681 384925.001, 119527...

Because the primary keys are used as the dataframe's index, obtaining the values requires an extra step (rather than referencing the column name):

>>> diff["MODIFIED_ATTR"].index.array.tolist()
['5', '6', '7']

CLI

$ changedetector --help
Usage: changedetector [OPTIONS] COMMAND [ARGS]...

Options:
--version  Show the version and exit.
--help     Show this message and exit.

Commands:
add-hash-key  Read input data, compute hash, write to new file
compare       Compare two datasets

$ changedetector add-hash-key --help
Usage: changedetector add-hash-key [OPTIONS] IN_FILE OUT_FILE

Read input data, compute hash, write to new file

Options:
--in-layer TEXT           Name of layer to add hashed primary key
-nln, --out-layer TEXT    Output layer name
-hk, --hash-key TEXT      Name of new column containing hashed data
-d, --drop-null-geometry  Drop records with null geometry
-hf, --hash-fields TEXT   Comma separated list of fields to include in the
                            hash (not including geometry)
--crs TEXT                Coordinate reference system to use when hashing
                            geometries (eg EPSG:3005)
-v, --verbose             Increase verbosity.
-q, --quiet               Decrease verbosity.
--help                    Show this message and exit.

$ changedetector compare --help
Usage: changedetector compare [OPTIONS] IN_FILE_A IN_FILE_B

Compare two datasets

Options:
--layer-a TEXT            Name of layer to use within in_file_a
--layer-b TEXT            Name of layer to use within in_file_b
-f, --fields TEXT         Comma separated list of fields to compare (do not
                            include primary key)
-o, --out-path PATH       Output path
-pk, --primary-key TEXT   Comma separated list of primary key column(s),
                            common to both datasets
-hk, --hash-key TEXT      Name of new column to add as hash key
-hf, --hash-fields TEXT   Comma separated list of fields to include in the
                            hash (in addition to geometry)
-p, --precision FLOAT     Coordinate precision for geometry hash and
                            comparison. Default=0.01
-a, --suffix-a TEXT       Suffix to append to column names from data source
                            A when comparing attributes
-b, --suffix-b TEXT       Suffix to append to column names from data source
                            B when comparing attributes
-d, --drop-null-geometry  Drop records with null geometry
--crs TEXT                Coordinate reference system to use when hashing
                            geometries (eg EPSG:3005)
-v, --verbose             Increase verbosity.
-q, --quiet               Decrease verbosity.
--help                    Show this message and exit.

Examples:

Compare the test datasets using their known primary key:

$ changedetector compare -v \
    tests/data/parks_a.geojson \
    tests/data/parks_b.geojson \
    -pk fcd_load_id 

Compare the test datasets, using a hash of geometry and the column park_name as synthetic primary key, written to new_hash_column:

$ changedetector compare -v \
    tests/data/parks_a.geojson \
    tests/data/parks_b.geojson \
    -hf park_name \
    -hk new_hash_column

Output is always to a new file geodatabase changedetector.gdb, in the folder specified by --out-path, defaulting to the current working directory.

Development and testing

Presuming that GDAL is already installed to your system:

$ git clone git@github.com:bcgov/FIT_changedetector.git
$ cd FIT_changedetector
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -e .[test]
(.venv) $ py.test

About

Compare two sets of geodata, reporting on various types of differences

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages