Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstructured grid datasets #1251

Merged
merged 1 commit into from
Dec 5, 2023
Merged

Conversation

dbochkov-flexcompute
Copy link
Contributor

@dbochkov-flexcompute dbochkov-flexcompute commented Nov 14, 2023

This PR introduces two new datasets TriangularGridDataset and TetrahedralGridDataset for 2d and 3d unstructured data. I think the easiest way to understand basic capabilities and purposes is to go through this demonstration notebook https://github.com/flexcompute-readthedocs/tidy3d-docs/blob/daniil/unstructured/docs/source/notebooks/UnstructuredData.ipynb. Overall I tried to follow functionality of xarray data arrays, but obviously I could do that so far only on a basic level. Some specific points I wanted to discuss:

  • This is based on package vtk, which is a substantial package. Is it ok to include it in requirements by default?
  • There is a little bit of name clashing with TriangleMeshDataArray and TriangleMeshDataset which are used for STL import. I guess if we strictly follow convention that Mesh in the class name means surface mesh, while Grid means discretization grid then it should be ok.
  • Defining PointDataArray, CellDataArray, and IndexedDataArray is somewhat clunky. They are basically 2d/1d indexed arrays, that is coords are typically specified using np.arange(n). I could use our numpy-based arrays, but I believe those are stored in json string?

Next steps would be (most likely as a separate PR):

  • switch heat monitor data to use these classes
  • make custom mediums be able to accept unstructured data

To review those next steps I will probably involve @weiliangjin2021

tagging @momchil-flex to keep in loop

Copy link
Collaborator

@tylerflex tylerflex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dbochkov-flexcompute looks good to me. A few minor comments, but mostly not sure about adding vtk as a basic requirement vs an optional one. Might want to check with MC about the size of the requirement. Also slightly worried there could be some system-dependent installation instructions so let's just make sure that's not the case before adding it.

@@ -12,3 +12,4 @@ pydantic==2.*
PyYAML
dask
toml
vtk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large is vtk? And is it relatively easy to install? You might need to check with MC to make sure that it's small enough because I think we were trying to make the basic requirements as minimal as possible (at least this was true a few months ago, maybe things have changed).

If it's too large, or difficult to install on some systems, we might need to treat it as an optional requirement like trimesh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installation package for vtk is about 100mb. It doesn't seem to have any installation issues (just pip install vtk) on different systems, which is probably not surprising since it's a very mature package (especially its C++ version). But yeah, I think it probably makes sense to treat it in the same way as trimesh, especially since it would be required only if you are working with heat solver or in rare cases when the user wants to import unstructured data from other solvers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of the basic requirements is that you should be able to load and validate simulations, even if you can't do anything else. Generally, I do think it would be best if we can encapsulate vtk imports in a VTK_AVAILABLE kind of handling similar to trimesh. Do you think it can be possible to load simulations with unstructured data if vtk is not installed? Maybe some validators would have to be skipped if not VTK_AVAILABLE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that should be possible. VTK is really needed only for performing operations like slicing/clipping/interpolating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched vtk to be an optional package. Similar to trimesh, functions that require vtk will raise an error if it's not available. And similar to trimesh, it is attempted to be imported in types.py. So, in the rest of the code from .types import vtk and if vtk == None: is used. Also managed to add a test that checks that if vtk is not available instances can still be created and validated but most of functionality is disabled. I did it in the following way:

  • test_datasets.py contains tests that check behavior for both when vtk is and is not available via if vtk is None:. However, running pytest test_datasets.py will check only one side of that depending whether vtk is installed in the python environment
  • so, there is a separate file _test_datasets_no_vtk.py that imports the same tests but prepend them with a monkeypatch fixture that mocks vtk is not available. So, running pytest _test_datasets_no_vtk.py checks that everything works as expected when vtk is not installed. It has to be run as a separate pytest session otherwise vtk gets imported in other tests and remains loaded (hence, changes to test_local.py and tox.ini)

tidy3d/components/data/data_array.py Outdated Show resolved Hide resolved
tidy3d/components/data/dataset.py Outdated Show resolved Hide resolved
_dims = ("index", "axis")


class CellDataArray(DataArray):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this a bit? so CellDataArray[i,j] is the j-th vertex of cell i? And what does it store? an integer index into another cell? or into the point data array? When I worked on unstructured meshes I would usually have a 3D array with dims cell_index, vertex_index, axis. It seems that maybe this kind of data structure could be constructed out of this cell data array + the point data array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly, it stores indices of points that constitute the cell. And the representation that you mentioned can definitely be constructed from PointDataArray + CellDataArray

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding the intended usage, but in the notebook you attached, in the first cell, there is

tri_grid_cells = td.CellDataArray(
    [[0, 1, 2], [1, 2, 3]],
    coords=dict(cell_index=np.arange(2), vertex_index=np.arange(3)),
)

but under what circumstance will the vertex_index not be np.arange(3) for a triangle? I guess it would be different only if the cell represents a triangle vs. tetrahedron?

tidy3d/components/data/dataset.py Outdated Show resolved Hide resolved
tidy3d/components/data/dataset.py Outdated Show resolved Hide resolved
@dbochkov-flexcompute
Copy link
Contributor Author

dbochkov-flexcompute commented Nov 14, 2023

Another open-ended discussion that I wanted to bring up: currently I implemented these datasets in such a way that they store only one data array, similarly to SpatialDataArray. But the difference with our usual data arrays is that the grid/coords data is in fact larger than the stored field data: in SpatialDataArray grid is defined by one dimensional coordinate arrays, while in unstructured data arrays we store coordinates of each grid point + information about how they connected into cells. So, if we want to define multiple fields on the same unstructured grid, we have to create a separate UnstructuredGridDataset for each of them, and each of these datasets will contain full information about the same unstructured grid, which seems wasteful. In VTK, and what probably makes sense overall, an unstructured grid instance can carry unlimited number of data arrays (scalar or vector). So we could make our unstructured datasets to do the same. That is, field values would be a dict of arrays instead of one array. But I am not sure how to fit that in our model. Let us consider a specific example of defining a custom medium using unstructured arrays

unstructured_perm = TetrahedralGridDataset(values=perm_values, points=grid_points, cells=grid_cells)
unstructured_cond = TetrahedralGridDataset(values=cond_values, points=grid_points, cells=grid_cells)

med = CustomMedium(
    permittivity=unstructured_perm,
    conductivity=unstructured_cond,
)

so as you could see both unstructured_perm and unstructured_cond contain the same info regarding grid (grid_points, grid_cells) and that info is always larger then data values themselves and could be very large overall. We could pack them into a single dataset

unstructured_grid = TetrahedralGridDataset(
    values={"permittivity": perm_values, "conductivity": cond_values}, 
    points=grid_points, 
    cells=grid_cells,
)

but then the issue is how to pass them to CustomMedium. Any thoughts?

@momchil-flex
Copy link
Collaborator

Yeah no good solution comes to mind, at least that would not involve restructuring of the CustomMedium, and probably breaking backwards compatibility. Well, maybe one option is to allow optional points and grid_cells as in

med = CustomMedium(
    permittivity=perm_values,
    conductivity=cond_values,
    points=grid_points, 
    cells=grid_cells,
)

and do some validations that those are provided when permittivity and conductivity is are IndexedDataArray. But it seems pretty clunky... But maybe still better than replicating all the cells/points data?

@shashwat-sh
Copy link
Contributor

but then the issue is how to pass them to CustomMedium. Any thoughts?

Would it make sense to store it the way you suggest (i.e. values is a dict of multiple data arrays) but use convenience methods to "unpack" it into separate data arrays in the "old" way right when defining the custom medium? So that the storage bloat is local and temporary?

@momchil-flex
Copy link
Collaborator

but then the issue is how to pass them to CustomMedium. Any thoughts?

Would it make sense to store it the way you suggest (i.e. values is a dict of multiple data arrays) but use convenience methods to "unpack" it into separate data arrays in the "old" way right when defining the custom medium? So that the storage bloat is local and temporary?

The problem I believe is that you then want to e.g. pass this CustomMedium to a simulation, and send that to our servers, for example. So it really needs to be a modification to the schema.

@dbochkov-flexcompute dbochkov-flexcompute force-pushed the daniil/unstructured branch 3 times, most recently from cc821fb to 144c9d1 Compare November 15, 2023 01:23
@dbochkov-flexcompute
Copy link
Contributor Author

I guess another workaround for eliminating repeated grid storage is to change our workflow for sampling material with perturbation models. Specifically, instead of explicitly converting all PerturbationMedium's into CustomMedium's in the frontend, defer it to the backend. For example, Simulation can have a field perturbation_fields which contains either a dict of usual SpatialDataArrays (temperature, electron_density, hole_density) or a unstructured grid dataset with a dict of data arrays (as I mentioned above). It does mean that for plotting and local mode solver conversion PerturbationMedium -> CustomMedium must happen on the fly every time they are called, but that shouldn't be too bad given those need only 2D fields.

It would also help with lowering storage from overlapping structures. For example, in this notebook https://docs.flexcompute.com/projects/tidy3d/en/v2.5.0rc2/notebooks/ThermallyTunedRingResonator.html the custom medium associated with the ring structure carries values in the entire bounding box of the ring while in the interior of the ring only values of the background medium are relevant

Copy link
Contributor

@shashwat-sh shashwat-sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this will be super useful! It's really neat that you've added those slicing features, I think that'll be a gift that keeps giving in the future.

A couple general comments here, and some minor ones below:

  • If we're using vtk, do we still need Gmsh at all? I think vtk also has some basic mesh generation abilities, and the mesh doesn't need to be super fancy for heat if I understood correctly? What's more is that one could then just use vtk to generate surface meshes for the rotations etc. that Lucas was working on, rather than the homemade mesher which he wanted to replace anyway. Might be worth checking if vtk can be used for all of these and then not worry about license stuff with Gmsh? This is a "think about in the future" comment, not directly related to this PR.
  • Are there a lot of situations where we expect someone to use an UnstructuredGridDataset without VTK? I'm asking because it seems to me a bit square-peg-in-round-hole-ish to "force" the unstructured format stored internally in VTK into an xarray format which isn't really designed for this. Why not just make the UnstructuredGridDataset a wrapper around VTK that directly stores and accesses data in VTK's native internal representation? For the custom medium discussion, I guess the methods to and from SpatialDataArray are still needed, so maybe xarray is the way to go for our specific needs, in which case feel free to ignore this comment. But thinking from the perspective of "what's the best way to store unstructured data in general", it seems strange to create our own xarray representation and then go back and forth from VTK representations, when VTK is already designed to store unstructured data.

_dims = ("index", "axis")


class CellDataArray(DataArray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding the intended usage, but in the notebook you attached, in the first cell, there is

tri_grid_cells = td.CellDataArray(
    [[0, 1, 2], [1, 2, 3]],
    coords=dict(cell_index=np.arange(2), vertex_index=np.arange(3)),
)

but under what circumstance will the vertex_index not be np.arange(3) for a triangle? I guess it would be different only if the cell represents a triangle vs. tetrahedron?

@@ -571,5 +614,8 @@ class ChargeDataArray(DataArray):
TriangleMeshDataArray,
HeatDataArray,
ChargeDataArray,
PointDataArray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general concern I had was how to make sure that the definitions of one data array are consistent with its parent, e.g. the CellDataArray shouldn't have indices outside the associated PointDataArray. But I see that you have several validators in the dataset class which should take care of this. But is it possible that someone messes with the coords etc. of an existing TriangularGridDataset to yield a new one, but skips the validation? I guess even if this is possible, we don't need to worry about it because if someone does try this, they're playing with fire anyway?

vtk_ind_size = vtkIdTypeArray().GetDataTypeSize()
# using val.astype(np.int32/64) directly causes issues when dataarray are later checked ==
if vtk_ind_size == 4:
return CellDataArray(val.data.astype(np.int32, copy=False), coords=val.coords)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understood this part: what would happen if there are a lot of points which exceed what can be represented by int32? I guess a related question is: what decides the type used by vtk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that vtk uses int64 for indices, but our github tests showed that on windows for some reason int32 is used. Perhaps, related to hardware and system configuration? I wonder if you can control that when compiling vtk from source, but not sure if there is any control when pip install vtk

if vtk is None:
return offsets

vtk_ind_size = vtkIdTypeArray().GetDataTypeSize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is known at import time and the check is repeated a couple times. Does it make sense instead to define a VTK_INT or something globally as int32 or int64?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will add that

@dbochkov-flexcompute
Copy link
Contributor Author

If we're using vtk, do we still need Gmsh at all? I think vtk also has some basic mesh generation abilities, and the mesh doesn't need to be super fancy for heat if I understood correctly? What's more is that one could then just use vtk to generate surface meshes for the rotations etc. that Lucas was working on, rather than the homemade mesher which he wanted to replace anyway. Might be worth checking if vtk can be used for all of these and then not worry about license stuff with Gmsh? This is a "think about in the future" comment, not directly related to this PR.

Hm, I don't think vtk really has any meshing capabilities. Maybe only surface discretization of primitives and boolean operations, but no volumetric mesh generation.

Are there a lot of situations where we expect someone to use an UnstructuredGridDataset without VTK? I'm asking because it seems to me a bit square-peg-in-round-hole-ish to "force" the unstructured format stored internally in VTK into an xarray format which isn't really designed for this. Why not just make the UnstructuredGridDataset a wrapper around VTK that directly stores and accesses data in VTK's native internal representation? For the custom medium discussion, I guess the methods to and from SpatialDataArray are still needed, so maybe xarray is the way to go for our specific needs, in which case feel free to ignore this comment. But thinking from the perspective of "what's the best way to store unstructured data in general", it seems strange to create our own xarray representation and then go back and forth from VTK representations, when VTK is already designed to store unstructured data.

Yeah, I though about it too, but still choose going the xarray route since we already have a robust hdf reading/writing for that.

  • I don't think there is a big performance hit for transitions between vtk and xarray, since in the way it's done currently both representations point at the same memory. In fact, one must be careful not to pass an array to a vtk instance and then release it. If that happens performing operations on this vtk instance will produce nonsensical results.
  • Using such decomposed representation should decrease file sizes. For example, native vtk instances always stores 3d coordinates of points, while for our TriangularGridDataset we can eliminate one coordinate. Another example is that native vtk native instances store offsets and cell type for each cell, while we restrict those to triangles and tetrahedra.
  • Another serendipitous convenience of using xarrays is that we don't need vtk to create/load/validate these new classes, which might be beneficial for our web services.

@shashwat-sh
Copy link
Contributor

Hm, I don't think vtk really has any meshing capabilities. Maybe only surface discretization of primitives and boolean operations, but no volumetric mesh generation.

Oh ok, that's surprising!

Yeah, I though about it too, but still choose going the xarray route since we already have a robust hdf reading/writing for that.

Ok, yeah those points make sense given our specific needs.

@dbochkov-flexcompute dbochkov-flexcompute force-pushed the daniil/unstructured branch 6 times, most recently from 38920d6 to 91a670d Compare November 29, 2023 15:23
@dbochkov-flexcompute
Copy link
Contributor Author

This is ready for final review. Changes since initial review:

  • made vtk package optional. Added tests that check that everything work as expected with and without vtk. (separate commit)
  • unified checking for vtk id type per Shashwat's suggestion
  • added reflection operation to unstructured datasets (to use in symmetry_expanded_copy)
  • implemented arithmetic operations for unstructured datasets. Basically everything gets redirected to field .values as suggested in numpy documentation.
  • added options unstructured: bool and conformal: bool to heat monitors to request data on unstructured grid. If conformal=True the monitor geometry will be taken into account during meshing, otherwise data will be extracted by simply cutting through the solver grid.
  • ^ correspondingly, heat monitor data now stores either a usual SpatialDataArray or an unstructured dataset
  • ^ support of plotting both Cartesian and unstructured data in HeatSimulationData.plot_field()
  • Fixed propagation of solver_version in Batch

Note that, custom mediums do not support yet unstructured datasets, that's for another PR. So, if only unstructured data for temperature is available, one needs to first interpolate it into SpatialDataArray by hands.

Usage is demonstrated in these updated notebooks:

Copy link
Contributor

@shashwat-sh shashwat-sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a few minor comments.

One general comment: do you think it would be helpful to have a comment or docstring somewhere, or somehow include in docs, exactly what functionality do we still have without vtk? I think having basically a list of things that make use of vtk will help a user decide beforehand if they need it or not. Could be in the FAQ section of the docs, or maybe in the docstring of one of the unstructured mesh base classes?

coords = list(self.coords.values())
data = np.array(self.data)

if center == coords[axis].data[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this check with some tolerance for floating point issues? or not necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not critical here (we can always have two points very close to each other), but you're right, it's better to have some tolerance here. Fixed.


shape = np.array(np.shape(data))
old_len = shape[axis]
shape[axis] = 2 * old_len - num_duplicates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen if center is set in such a way that it falls "inside" the region where the original data is defined? i.e. there's an overlap in the original region and the reflected region? I guess realistically, in the intended usage of this method, this would never happen. But since this method is externally visible, I wonder if it's worth having a check to either error or somehow handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Also added frontend tests checking reflect method

@@ -95,7 +97,7 @@ def plot_field(
Name of :class:`.TemperatureMonitorData` to plot.
val : Literal['real', 'abs', 'abs^2'] = 'real'
Which part of the field to plot.
scale : Literal['lin', 'dB']
scale : Literal['lin', 'log']
Plot in linear or logarithmic (dB) scale.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you clarify whether it's dB or log so that the type string is consistent with the explanation? I think saying logarithmic (dB) could be confusing because dB is logarithmic with some factor applied. Might be good to just write out the expression to avoid any confusion, e.g. 10 * log (power)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, just a leftover, thanks for catching it

field_name = "|T|², K²"
else:
field_name = "T, K"

if scale == "log":
field_data = np.log10(np.abs(field_data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it's pure log, so I guess the string above can be updated to remove the (dB)

conformal: bool = pd.Field(
False,
title="Conformal Monitor Meshing",
description="Mesh monitor domain exactly. Note that this option affects the computational "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate for the user's sake what it means to mesh the domain exactly? So if we have unstructured=True and conformal=False, I guess it would take the mesh based on the heat domain's mesh, and if conformal=True then the heat mesh is adjusted based on the monitor? But if unstructured=False, then does conformal have an effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed description to this

description="If ``True`` the heat simulation mesh will conform to the monitor's geometry. "
"While this can be set for both Cartesian and unstructured monitors, it bears higher "
"significance for the latter ones. Effectively, setting ``conformal = True`` for "
"unstructured monitors (``unstructured = True``) ensures that returned temperature values "
"will not be obtained by interpolation during postprocessing but rather directly "
"transferred from the computational grid."

@dbochkov-flexcompute
Copy link
Contributor Author

One general comment: do you think it would be helpful to have a comment or docstring somewhere, or somehow include in docs, exactly what functionality do we still have without vtk? I think having basically a list of things that make use of vtk will help a user decide beforehand if they need it or not. Could be in the FAQ section of the docs, or maybe in the docstring of one of the unstructured mesh base classes?

added this to unstructured datasets docstrings:

Note
----
To use full functionality of unstructured datasets one must install ``vtk`` package (``pip
install tidy3d[vtk]`` or ``pip install vtk``). Otherwise the functionality of unstructured
datasets is limited to creation, writing to/loading from a file, and arithmetic manipulations.

@dbochkov-flexcompute dbochkov-flexcompute merged commit 2d869d4 into pre/2.5 Dec 5, 2023
1 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants