Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow grid_to_table to take DataArrays #235

Merged
merged 2 commits into from Mar 6, 2020
Merged

Conversation

leouieda
Copy link
Member

@leouieda leouieda commented Mar 2, 2020

Previously would fail because we rely on getting the data variable names
from the Dataset. Need to check if it's a Dataset (has the data_vars
attribute) and set the data column name to differently if it's not.
The data column name will be DataArray.name or "scalars" (if it's None).

Fixes #225

Reminders:

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst and the base __init__.py file for the package.
  • Write detailed docstrings for all functions/classes/methods. It often helps to design better code if you write the docstrings first.
  • If adding new functionality, add an example to the docstring, gallery, and/or tutorials.
  • Add your full name, affiliation, and ORCID (optional) to the AUTHORS.md file (if you haven't already) in case you'd like to be listed as an author on the Zenodo archive of the next release.

Previously would fail because we rely on getting the data variable name
from the Dataset. Need to check if it's a Dataset (has the `data_vars`
attribute) and set the data column name to "scalars" if it's not.

Fixes #225
@leouieda leouieda merged commit 1725ef9 into master Mar 6, 2020
@leouieda leouieda deleted the grid_to_table_dataarray branch March 6, 2020 11:27
jessepisel added a commit to jessepisel/verde that referenced this pull request Oct 28, 2020
* Bug fix in grid_to_table for wrong order of Dataset coords (fatiando#229)

We were relying on the order of the coords attribute of the data set to meshgrid 
the coordinates. This is wrong because xarray takes the coordinate order from 
dims instead, which is what we should also do. Took this chance to refactor the 
code and included a test that would have caught the bug.

* Update files from fatiando/contributing (fatiando#233)

Changes have been made in https://github.com/fatiando/contributing to:
MAINTENANCE.md
Update the copies in this repository to match.
See fatiando/community@91699a0

* Use newer Mac on Azure Pipelines (fatiando#234)

Changed from 10.13 (High Sierra) to 10.15 (Catalina) because Azure will
remove the 10.13 VMs soon.

* Allow grid_to_table to take DataArrays (fatiando#235)

Previously would fail because we rely on getting the data variable name
from the Dataset. Need to check if it's a Dataset (has the `data_vars`
attribute) and set the data column name to "scalars" if it's not.

Fixes fatiando#225

* Add function for rolling windows on irregular data (fatiando#236)

The `rolling_window` function returns the indices of points falling on
rolling/moving windows. It works for irregularly sampled data and is a
nice complement to `xarray.DataArray.rolling`  (which only works on
grids). Uses a KDTree to select points but is not compatible with
pykdtree (it's missing the method we need).

* Add function to split data on an expanding window (fatiando#238)

For selecting data in windows of various size around a center point.
Sizes do not have to be increasing necessarily. Returns only the indices
of points falling inside each window.

* Explicitly set PR trigger for all branches on Azure (fatiando#244)

For some reason, Azure Pipelines CI wasn't starting up PRs, only on
master and tags. Explicitly setting a trigger for PRs against any branch
seems to work.

* Fix bug for 3+ coordinates in windowing functions (fatiando#243)

The `rolling_window` and `expanding_window` functions fail if given more 2 coordinates.
If we had full support for N-D coordinates, they could be N-D windowing functions
but until that time, they will have to be 2D (horizontal) only functions. Add test cases using
`extra_coords` in `grid_coordinates` to generate a third coordinate.

Fixes fatiando#242

* Generalize coordinate systems in BaseGridder docs (fatiando#240)

Make docstrings of `BaseGridder` more general, without specifying
a coordinate system. Add `dims` as a class variable that will be used by
default on `grid`, `scatter` and `profile` methods.

* Add a convex hull masking function (fatiando#237)

Use scipy's Delaunay triangulation to determine points that are outside
the convex hull of the data points. Useful to mask out wildly
extrapolated points from grids. It's easier to use than `distance_mask`
because it doesn't require distance calculations (and the associated
projections). Use the new function in some examples and tutorials.
Normalize the coordinates prior to calculations to avoid errors
with floating point conversion when triangulating (as suggested by 
Ari Hartikainen).

Fixes fatiando#126

* Fix bug with profile distances being unprojected (fatiando#231)

`BaseGridder.profile` has the option to apply a projection to the
coordinates before predicting so we can pass geographic coordinates to
cartesian gridders. In these cases, the distance along the profile is
calculated by `profile_coordinates` with the unprojected coordinates (in
the geographic case it would be degrees). The profile point calculation
is also done assuming that coordinates are Cartesian, which is clearly
wrong if inputs are longitude and latitude. To fix this, project the input
points prior to passing to `profile_coordinates`. This means that the 
distances are Cartesian and generation of profile points is also 
Cartesian (as is assumed by `profile_coordinates`). The generated 
coordinates are projected back so that the user gets longitude and 
latitude but distances are still projected Cartesian. Add a warning in the
docstring of the modified behaviour due to this bug and a section in the
tutorial on projections.

* Update TravisCI configuration to new formats (fatiando#247)

Travis started warning that some options in the config were deprecated.
Update them to the new format to avoid crashing builds later on.

* Add function to project 2D gridded data (fatiando#246)

The `project_grid` function transforms a grid to the given projection. 
It re-samples the data using `ScipyGridder` (by default) and runs a 
blocked mean (optional) to avoid aliasing when the points aren't evenly 
distributed in the projected coordinates (like in polar projections). 
Applies a `convexhul_mask` to the grid always to avoid extrapolation to 
points that had no original data (this is done by scipy for linear and 
cubic interpolation already). The function only works for 
`xarray.DataArray`s (not `Dataset`) because I couldn't find an easy way 
to guarantee that a variables have the same dimensions. Move the new 
function and `project_region` to the new module `verde/projections.py`.

* Changelog entry for v1.4.0 (fatiando#248)

New minor release with new features and bug fixes. No breaking changes
or deprecations.

* Fix broken TravisCI deploy (fatiando#250)

The docs and configuration validation tool had recommended using
`cleanup: false` instead of `skip_cleanup: true` when deploying. Turns
out Travis decided to ignore the new parameter and our Github Pages
deploy was broken since the docs were deleted before we could deploy.
Revert back to the old argument until Travis really removes it.

* Add a block version of ShuffleSplit cross-validator (fatiando#251)

Cross-validation of spatial data with random splits can often lead to 
overestimation of accuracy (see https://doi.org/10.1111/ecog.02881 
for a nice overview). To account for this, the splits can be done by 
spatial blocks. In this case, the data are partitioned into blocks and 
blocks are split randomly. That guarantees some independence of 
measurements between blocks assigned to test/train sets. This 
change adds the `BlockShuffleSplit` cross-validator which does the 
random splitting of blocks in a scikit-learn compatible class. It can be 
used with `verde.cross_val_score` like any other scikit-learn cross-
validator. In fact, it can be used entirely outside of Verde for any type 
of machine learning on spatial data. The class takes care of balancing 
the random splits so that the right amount of train/test data are 
included in each (since blocks can have wildly different numbers of 
points).

* Add optional argument to least_squares to copy Jacobian matrix (fatiando#255)

Add new copy_jacobian optional argument to least_squares to avoid
scaling Jacobian matrix inplace. Add warning on its docstring warning
about memory consumption when copy_jacobian is True. Add test function
for this new feature.

* Update files from fatiando/contributing (fatiando#256)

Changes have been made in https://github.com/fatiando/contributing to:
MAINTENANCE.md
Update the copies in this repository to match.
See fatiando/community@340dafd

* Add a blocked option for train_test_split (fatiando#253)

Add option `blocked=False` to `verde.train_test_split` to split along 
blocks instead of randomly. Uses `BlockShuffleSplit` if `True` instead 
of the scikit-learn `ShuffleSplit`. Add a gallery example comparing 
the two options.

* Fix typo in README contributing section (fatiando#258)

Equality -> Equally

* Pin pylint to 2.4.* (fatiando#259)

Version 2.5.0 introduced a bug with the multiprocessing option. This
happens often enough that we should have pylint pinned and only upgrade
when we choose to do it.

* Update tests to latest release of Scikit Learn (fatiando#267)

Change the expected representation of any gridder based on the new
default behaviour of Scikit Learn. After 0.23, Scikit Learn only shows
those parameters whose default value has been changed when giving
a string representation of the gridder. See
scikit-learn/scikit-learn#17061 and its changelog for more information.

* Gridders apply projections using only the first two coordinates (fatiando#264)

Create a private function used by BaseGridder that applies projections only on the first two
elements of the coordinates, but returns the whole set of coordinates, replacing the first two with their projected versions. Add a test function that passes extra_coords to include more than two coordinates when predicting through the grid, scatter and profile methods.

* Remove "blocked" argument from train_test_split (fatiando#257)

It was a bit redundant since we have to specify `shape` or `spacing`
along with it. So we can just ask for `spacing` or `shape` and assume
`blocked=True` if they are provided. Also fix the missing description in
the gallery example and make slight changes to its wording.

Co-authored-by: Santiago Soler <santiago.r.soler@gmail.com>

* Add extra_coords on BaseGridder generated objects (fatiando#265)

After predicting through the profile, scatter and grid methods,
BaseGridder returns generated objects like xr.Dataset or pd.DataFrame.
Any extra coordinate passed through the extra_coords keyword argument
is now included on those objects as new columns or non-dimension
coordinates (depending on the generated object).

* Add a blocked K-Fold cross-validator (fatiando#254)

The `BlockKFold` class splits the data into spatial blocks and does
K-fold cross-validation splits on the blocks. Refactor code from 
`BlockShuffleSplit` into a base class that can be shared with the
new class. Folds are balanced by making splits based on the number
of data points in each fold instead of the number of blocks.

* Pin Cartopy to 0.17.* until we sort out problems (fatiando#270)

For some reason, the coast lines features are plotting on the entire
figure instead of only the land or ocean. Might be an error on our side.
For now, pin cartopy on CI to fix our docs.

* Changelog entry for v1.5.0 (fatiando#271)

Includes the new cross-validators and fixes to our base classes to
accommodate Harmonica.

* Allow specifing the scoring function in cross_val_score (fatiando#273)

Currently, `cross_val_score` only uses the estimator's `.score` method (which
is R² for all our gridders). But sometimes we want to use a different metric,
like MSE or MAE. These changes allow passing in any scikit-learn compatible
scorer through a new `scoring` parameter. This can be a string or callable,
like in scikit-learn's `cross_val_score`. Needed to delegate score computation
to a separate function in `verde.base.utils` to reuse in `cross_val_score` and
a dummy class to be able to use the scikit-learn scorers correctly.

* Format the doc/conf.py file with Black (fatiando#275)

Ran black on `doc/conf.py` and added it to the `black` and `flake8` runs
in the main Makefile. Removed the setting of `sys.path` since we're
installing the packaged in edit mode so it should be importable without
that hack.

* Require Black >=20.8b1 (fatiando#284)

After version 19, black changed the formatting slightly in a few files.
This updates the format and sets a minimum version of black in the
development environment and the CI.

* Fix Cartopy issue and require >= 0.18 (fatiando#283)

Unpin Cartopy because Cartopy 0.17.0 has a compatibility issue with Matplotlib > 3.3.0.
Remove the unneeded tight_layout that was causing errors.
Replace the filled polygons used for plotting land and ocean for the coastlines.

* Fix typos on docstrings (fatiando#281)

Fix typo on description of data_names parameters on BaseGridder.grid
method. Fix typo on description of BaseGridder.project_coordinates
method.

* Raise errors for invalid rolling windows arguments (fatiando#280)

Raise a specific error if no spacing or shape is passed to
rolling_window(). Raise a specific error if window size is larger than
the smaller dimension of the region. Add tests for these new methods
that include matching error messages to identify which error has been
raised.

* Remove configuration files for unused CI (fatiando#292)

We no longer use Stickler, Codacy, or CodeClimate. So there is no reason
to keep these files around.

* Update files from fatiando/contributing (fatiando#294)

Changes have been made in https://github.com/fatiando/contributing to:
CONTRIBUTING.md
Update the copies in this repository to match.
See fatiando/community@e899f3f

* Add function to convert grid to Dataset (fatiando#282)

Add a function that converts coordinates and data as np.arrays to
xarray.Dataset. This would make handling grids easier, with the option of
efficiently saving them as netCDF files. Make BaseGridder.grid method to use
this function instead of building the xarray.Dataset by itself. Add new
check_data_names function that converts a str into a tuple with a single
str. Make BaseGridder.grid, BaseGridder.scatter and BaseGridder.profile
able to take a single str for the data_names argument. Add test function
for default data_names. Change data_names on all examples and tutorials to
a single str.

* Refactor get_data_names and related check functions (fatiando#295)

Refactor `get_data_names` as a private method of `BaseGridder`: simplify
`get_data_names` logic, make `get_data_names` to use `check_data_names` to
convert single string to a tuple and check that `data_names` has the same
number of elements as data. Add `data_names_defaults` as a class attribute of
`BaseGridder`. Move `check_extra_coords_names` to `verde/base/utils.py` along
with the other check functions. Make `check_data_names` to raise error if
`data_names` is `None`. But `BaseGridder.get_data_names` will return default
values if `data_names` is `None`. Rename test functions for `make_xarray_grid`.

Co-authored-by: Leonardo Uieda <leouieda@gmail.com>
Co-authored-by: Fatiando a Terra Bot <50936856+fatiando-bot@users.noreply.github.com>
Co-authored-by: Santiago Soler <santiago.r.soler@gmail.com>
Co-authored-by: Rowan Cockett <rowanc1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow grid_to_table to take DataArray as well
1 participant