Parallelize triangulation #619

travisdriver · 2023-04-11T20:05:05Z

Running on my machine with 8 workers for door-12, runtime goes from 1.24 minutes without parallelism to 1.03 with parallelism. I would expect larger datasets to have even more significant runtime reductions.

akshay-krishnan

I think for a better speedup and to run on bigger datasets we will need to batch this. Instead of 1 delayed operation per track we would need 1 delayed operation for a batch of tracks (say 10-50). each delayed operation adds some latency/overhead on dask, and we want to make sure the added overhead is << time for each batch triangulation.

akshay-krishnan · 2023-04-11T21:07:13Z

gtsfm/data_association/data_assoc.py

+        # Loop through tracks and and generate delayed triangulation tasks.
+        triangulation_results = []
+        for track_2d in tracks_2d:
+            triangulation_results.append(dask.delayed(point3d_initializer.triangulate)(track_2d))


can we do an nout=3 in dask.delayed, and fill in sfm_tracks, avg_track_reproj_errors, triangulation_exit_codes here instead of one more loop?

Surprisingly, separating the variables before calling compute is slower. The following code took 1.17 minutes:

def run_triangulation( self, cameras: Dict[int, gtsfm_types.CAMERA_TYPE], tracks_2d: List[SfmTrack2d], ) -> Tuple[List[Delayed], List[Delayed], List[Delayed]]: """Performs triangulation of 2D tracks in parallel. Ref: https://docs.dask.org/en/stable/delayed-best-practices.html#compute-on-lots-of-computation-at-once Args: cameras: list of cameras wrapped up as Delayed. tracks_2d: list of tracks wrapped up as Delayed. Returns: sfm_tracks: List of triangulated tracks. avg_track_repoj_errors: List of average reprojection errors per track. triangulation_exit_codes: exit codes for each triangulation call. """ # Initialize 3D landmark for each track point3d_initializer = Point3dInitializer(cameras, self.triangulation_options) # Loop through tracks and and generate delayed triangulation tasks. delayed_sfm_tracks, delayed_avg_track_reproj_errors, delayed_triangulation_exit_codes = [], [], [] for track_2d in tracks_2d: delayed_sfm_track, delayed_avg_track_reproj_error, delayed_triangulation_exit_code = dask.delayed( point3d_initializer.triangulate, nout=3 )(track_2d) delayed_sfm_tracks.append(delayed_sfm_track) delayed_avg_track_reproj_errors.append(delayed_avg_track_reproj_error) delayed_triangulation_exit_codes.append(delayed_triangulation_exit_code) # Perform triangulation in parallel. sfm_tracks, avg_track_reproj_errors, triangulation_exit_codes = dask.compute( delayed_sfm_tracks, delayed_avg_track_reproj_errors, delayed_triangulation_exit_codes ) return sfm_tracks, avg_track_reproj_errors, triangulation_exit_codes

Maybe the scheduler isn't intelligent enough to realize the three inputs are from the same function call?

gtsfm/data_association/data_assoc.py

akshay-krishnan · 2023-04-11T21:12:03Z

Let me know what you think of batching, it should be easy to implement. Once you have that I can run on larger datasets. We also want to make sure it works on the cluster.

travisdriver · 2023-04-11T22:55:58Z

I think for a better speedup and to run on bigger datasets we will need to batch this. Instead of 1 delayed operation per track we would need 1 delayed operation for a batch of tracks (say 10-50). each delayed operation adds some latency/overhead on dask, and we want to make sure the added overhead is << time for each batch triangulation.

Great suggestion. I added the MAX_DELAYED_TRIANGULATION_CALLS variable to decide the batch size, i.e., batch_size = int(np.ceil(len(tracks_2d) / MAX_DELAYED_TRIANGULATION_CALLS)). Setting max calls to 1e4 runs in 1.02 minutes, and 1e3 and 1e2 max calls runs in 0.99 minutes. I kept the default value at 1e3.

gtsfm/data_association/data_assoc.py

johnwlambert · 2023-04-12T02:49:50Z

gtsfm/data_association/data_assoc.py

@@ -196,6 +204,68 @@ def run(

        return connected_data, data_assoc_metrics

+    def run_triangulation(


What's the difference between run_triangulation() and run_da()? is one parallelized, and the other is not?

Technically, neither does any data association, since we broke DSF into another module, so can we use a different name than run_da()? (I know the module is mis-named currently)

Technically it's still associating landmarks to keypoints. The class is also called DataAssociation, so we'd need to change the whole class name as well. Maybe we can change the naming conventions in another PR?

I see, yeah I suppose it depends a bit on our definition of data association. Agreed it's orthogonal to this PR, but I still am a bit lost by the function naming, for the new function we're adding. Do we mean run_da vs. run_da_batched? or run_triangulation vs. run_triangulation_batched?

Here's one definition of data association from Szeliski, that would suggest we rename this module:

In SFM, “data association” is the problem of determining correspondences, either between feature points or whole images. In the case of whole images, it can be seen as the validity of hypotheses that image pairs contain
sets of matching features corresponding to the same 3D points.

http://szeliski.org/papers/Roberts_DuplicateSfM_CVPR11.pdf

(but of course module renaming can be for the future)

Batching is only enforced for triangulation when the MAX_DELAYED_TRIANGULATION_CALLS threshold is violated. I think a simple if-else statement is sufficient as opposed to a separate run function.

I think the function naming is straightforward. run_triangulation triangulates the 2D tracks. run_da performs the DA step (or at least what we were previously referring to as DA), i.e., validating the triangulated tracks and building the input to BA.

Yes, data association is the process of determining correspondences. This module still plays a role in determining correspondences, as we filter correspondences based on the triangulation geometry.

You're more than welcome to rename this module to whatever you'd like in a separate PR :)

travisdriver · 2023-04-12T14:37:13Z

@stepanyanhayk Would you mind running this on the cluster and letting us know whether or not it fixes the DA runtime issues?

johnwlambert · 2023-04-12T15:01:06Z

gtsfm/data_association/data_assoc.py

+        # Unpack results.
+        sfm_tracks, avg_track_reproj_errors, triangulation_exit_codes = [], [], []
+        if batch_size == 1:
+            for result in triangulation_results:


nit: may be more readable as

for (sfm_track, avg_track_reproj_error,exit_code) in triangulation_results: sfm_tracks.append(sfm_track) avg_track_reproj_errors.append(avg_track_reproj_error) triangulation_exit_codes.append(exit_code)

johnwlambert · 2023-04-14T03:06:35Z

gtsfm/data_association/data_assoc.py

@@ -115,9 +122,12 @@ def run(
        exit_codes_wrt_computed: List[TriangulationExitCode] = []
        per_accepted_track_avg_errors = []
        per_rejected_track_avg_errors = []
-        for track_2d in tracks_2d:
+        assert len(tracks_2d) == len(sfm_tracks)
+        for j in range(len(tracks_2d)):


nit: since we don't triangulate anymore here, I believe the comment below needs to be updated

johnwlambert · 2023-04-14T03:07:32Z

gtsfm/data_association/data_assoc.py

        cameras_gt: List[Optional[gtsfm_types.CALIBRATION_TYPE]],
        relative_pose_priors: Dict[Tuple[int, int], Optional[PosePrior]],
        images: Optional[List[Image]] = None,
    ) -> Tuple[GtsfmData, GtsfmMetricsGroup]:
-        """Perform the data association.
+        """Perform the data association and compute metrics.


Since we don't triangulate here, and don't compute correspondences here, I propose we totally rename this method to compute_da_metrics().

(and edit the docstring also to remove the mention of "Perform the data association", since all we do is compute metrics here know if I understand correctly)

Re-reading this, looks like we do also validate the tracks here, and compute a largest connected component. How about we consider renaming this as

def validate_tracks_and_compute_metrics()

or

def validate_triangulated_tracks_and_compute_metrics() """ """

or

def assemble_gtsfm_data_from_tracks()

How about this as a docstring:
"""Assembles GtsfmData, validates triangulated tracks, computes metrics.

Only the largest connected component of cameras, represented in 3d tracks, is retained.
"""

I think these are good suggestions, but again I think it makes more sense to save this for another PR in which you can refactor the module names and functions to your liking.

johnwlambert · 2023-04-14T03:08:57Z

gtsfm/data_association/data_assoc.py


        Args:
            num_images: Number of images in the scene.
            cameras: dictionary, with image index -> camera mapping.
            tracks_2d: list of 2D tracks.
+            sfm_tracks: List of triangulated tracks.
+            avg_track_repoj_errors: List of average reprojection errors per track.


nit: typo on "reproj errors"

johnwlambert

Thanks Travis!

stepanyanhayk · 2023-04-14T05:07:00Z

Hi @travisdriver,

Thanks for this PR. I run your changes against the master on the cluster and it is much faster!

Skydio-32, deep_front_end, total wall time:

Master: 1.78m
This branch: 1.21m

Also, here is a screenshot from the dask dashboard demonstrating that the tasks are indeed distributed between all the machines.

akshay-krishnan

not sure if this is covered by unit tests, please check artifacts for at least one dataset before merging, just to be sure nothing breaks.

travisdriver · 2023-04-14T20:06:44Z

not sure if this is covered by unit tests, please check artifacts for at least one dataset before merging, just to be sure nothing breaks.

visual_comparison_dashboard.zip

akshay-krishnan · 2023-04-14T21:44:44Z

Are all differences in that visualization above coming from randomness? Or did we change anything in this PR?

travisdriver added 3 commits April 11, 2023 15:16

Runs

dda4af0

Working

4ec82dd

Remove commented code.

cf9bce4

travisdriver requested review from johnwlambert and akshay-krishnan April 11, 2023 20:07

akshay-krishnan reviewed Apr 11, 2023

View reviewed changes

travisdriver added 3 commits April 11, 2023 18:57

Add batching

fd99df3

Remove unused variables

d060f2a

Fixing unit tests

727ac09

johnwlambert reviewed Apr 12, 2023

View reviewed changes

gtsfm/data_association/data_assoc.py Outdated Show resolved Hide resolved

johnwlambert reviewed Apr 12, 2023

View reviewed changes

gtsfm/data_association/data_assoc.py Outdated Show resolved Hide resolved

johnwlambert reviewed Apr 12, 2023

View reviewed changes

gtsfm/data_association/data_assoc.py Outdated Show resolved Hide resolved

johnwlambert reviewed Apr 12, 2023

View reviewed changes

John's comments

ecb26bb

johnwlambert reviewed Apr 12, 2023

View reviewed changes

johnwlambert reviewed Apr 14, 2023

View reviewed changes

johnwlambert approved these changes Apr 14, 2023

View reviewed changes

travisdriver mentioned this pull request Apr 14, 2023

Cleanup-only PR, no functional changes #623

Merged

John's comments round 2

e9bb0ef

akshay-krishnan approved these changes Apr 14, 2023

View reviewed changes

travisdriver merged commit d81eeb5 into master Apr 14, 2023

travisdriver deleted the parallel-triang branch April 14, 2023 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize triangulation #619

Parallelize triangulation #619

travisdriver commented Apr 11, 2023 •

edited

Loading

akshay-krishnan left a comment

akshay-krishnan Apr 11, 2023

travisdriver Apr 11, 2023

akshay-krishnan commented Apr 11, 2023

travisdriver commented Apr 11, 2023

johnwlambert Apr 12, 2023

travisdriver Apr 12, 2023

johnwlambert Apr 12, 2023

johnwlambert Apr 12, 2023

johnwlambert Apr 12, 2023

travisdriver Apr 12, 2023 •

edited

Loading

travisdriver commented Apr 12, 2023

johnwlambert Apr 12, 2023

travisdriver Apr 14, 2023

johnwlambert Apr 14, 2023

travisdriver Apr 14, 2023

johnwlambert Apr 14, 2023

johnwlambert Apr 14, 2023

johnwlambert Apr 14, 2023

travisdriver Apr 14, 2023

johnwlambert Apr 14, 2023

travisdriver Apr 14, 2023

johnwlambert left a comment

stepanyanhayk commented Apr 14, 2023 •

edited

Loading

akshay-krishnan left a comment

travisdriver commented Apr 14, 2023

akshay-krishnan commented Apr 14, 2023

		@@ -196,6 +204,68 @@ def run(

		return connected_data, data_assoc_metrics

		def run_triangulation(

Parallelize triangulation #619

Parallelize triangulation #619

Conversation

travisdriver commented Apr 11, 2023 • edited Loading

akshay-krishnan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akshay-krishnan commented Apr 11, 2023

travisdriver commented Apr 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdriver Apr 12, 2023 • edited Loading

Choose a reason for hiding this comment

travisdriver commented Apr 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnwlambert left a comment

Choose a reason for hiding this comment

stepanyanhayk commented Apr 14, 2023 • edited Loading

akshay-krishnan left a comment

Choose a reason for hiding this comment

travisdriver commented Apr 14, 2023

akshay-krishnan commented Apr 14, 2023

travisdriver commented Apr 11, 2023 •

edited

Loading

travisdriver Apr 12, 2023 •

edited

Loading

stepanyanhayk commented Apr 14, 2023 •

edited

Loading