Generate old regional analysis results by trevorgerhardt · Pull Request #957 · conveyal/r5

trevorgerhardt · 2025-02-26T14:17:02Z

Generate all final results for old regional analyses with a task so that we can eliminate old code which was required to handle them. See #956 for the R5 code that can be simplified.

Regional analysis results had two changes to how they were stored over the years. However, the last change was several years ago. We still handle the old style of results in multiple locations in code, but if we migrate the database entries and pre-generate all of the regional analysis results we would be able to eliminate those code paths. This will make future changes easier to make.

This should be run on staging and production before #956 is merged.

abyrd · 2025-03-05T07:00:20Z

src/main/java/com/conveyal/analysis/GenerateRegionalAnalysisResults.java

+    @Override
+    public void action(ProgressListener progressListener) throws Exception {
+        DBObject query = QueryBuilder.start().or(
+                QueryBuilder.start("travleTimePercentiles").is(null).get(),


Typo in field name: travelTimePercentiles

Good catch 🙏, fixed in b85e57c

abyrd · 2025-03-05T08:03:50Z

It does seem like a good simplification to get rid of this fallback code for very old formats. Removing some of these fields means old worker versions that rely on them would break, but these are very old versions that are already effectively deprecated and shouldn't be used.

Just thinking through how this works: In recent years we generate multi-cutoff results files, from which single-cutoff files are generated in multiple formats. The multi-cutoff files are never used directly; they are only used to generate single-cutoff files on demand. For old analyses where multi-cutoff results files do not exist (or are named differently because analyses were limited to one set of destinations), this PR doesn't attempt to generate new-style results files. It just hits all code paths where multi-cutoff results would normally be used to derive single-cutoff results files, hitting the fallback code that works in the absence of new-style results files. Once that's done, it should be possible to remove that fallback code. This is assuming we've exhaustively hit all the ways the functions containing the fallback code can be invoked.

What's the planned way to run this on staging and production before #956 is merged? I guess we would perform one deployment of the patched version, observe the logs, and perform another deployment with the post-956 version soon thereafter if everything looked good?

abyrd · 2025-03-05T08:05:04Z

src/main/java/com/conveyal/analysis/GenerateRegionalAnalysisResults.java

+                regionalAnalysis.cutoffsMinutes = cutoffs;
+                regionalAnalysis.travelTimePercentiles = percentiles;
+                regionalAnalysis.destinationPointSetIds = destinationPointSetIds;
+                Persistence.regionalAnalyses.put(regionalAnalysis);


#956 mentions a MongoDB migration, and it looks like this PR is intended to perform that migration (via Java code) while generating the missing single-cutoff results files. If so we might want to mention that in #956 for future reference.

Modified the description of #956 to reference this PR and recommend that this one should be merged and run first.

abyrd · 2025-03-05T08:17:16Z

src/main/java/com/conveyal/analysis/GenerateRegionalAnalysisResults.java

+                    OpportunityDataset destinations = Persistence.opportunityDatasets.get(destinationPointSetId);
+                    for (int cutoffMinutes : cutoffs) {
+                        for (int percentile : percentiles) {
+                            for (FileStorageFormat format : FileStorageFormat.values()) {


It looks like this is iterating through all 12 or so FileStorageFormat values and calling getSingleCutoffGrid with all of them. The getSingleCutoffGrid method only recognizes and handles three of those formats (GRID, PNG, and GEOTIFF) and does not write anything to the output stream for any other cases. So it looks like this might leave the output stream open and try to move an underlying empty file into storage with each of the nine other extensions.

Updated to use only those three formats in b85e57c

abyrd · 2025-03-05T08:20:03Z

src/main/java/com/conveyal/analysis/GenerateRegionalAnalysisResults.java

+        ).get();
+        try (DBCursor<RegionalAnalysis> cursor = Persistence.regionalAnalyses.find(query)) {
+            while (cursor.hasNext()) {
+                RegionalAnalysis regionalAnalysis = cursor.next();


It might be good to add some logging in here so we can verify the effects of the migration when it runs. Or maybe better, enable DEBUG level logging when we run the migration to hit the log statements inside getSingleCutoffGrid.

Added a few log statements in c2cc8f4

Thanks for adding some logging. To clarify, the underlying reason I was interested in logging was to observe whether the effects of the migration matched expectations. Rather than just counts, something that would tip us off if for example records were being missed due to an incorrect field name, or extra files were being produced etc. This is why I mentioned enabling debug for getSingleCutoffGrid as it would log each new file created and stored. Maybe we could also log the key fields of each regionalAnalysis record that was migrated.

trevorgerhardt · 2025-03-05T10:01:42Z

What's the planned way to run this on staging and production before #956 is merged? I guess we would perform one deployment of the patched version, observe the logs, and perform another deployment with the post-956 version soon thereafter if everything looked good?

As of right now, I think this approach would be best. There are changes in the dev branch that have not been deployed from PRs #953 and #954, however those changes are very minor.

If we take that approach, #956 could also remove the code in this PR.

Generate all final results for old regional analyses with a task so that we can eliminate old cold to handle them.

Not all `FileStorageFormat`s are valid single cutoff grid formats.

`getSingleCutoffGrid` can be run more than once on the same inputs. We should perform the database changes after all of the files are generated in case there is an error during one of the runs so that we could just run this again if needed.

ansoncfit

One thing to consider: how will the result generation handle cases where the destination opportunity dataset referenced by a regional analysis has been deleted? It should be possible to generate the single cutoff grids using just the id of the destination dataset. The name of the dataset is only required for the resultHumanFilename. Fallbacks for generating human-readable filenames when referenced data are missing would be a separate change. I just wanted to flag this edge case now in case it causes failures with the migration.

@trevorgerhardt, how does the following sound?

Log at the DEBUG level for the method in question as @abyrd suggested above (add <logger name="com.conveyal.analysis.controllers.RegionalAnalysisController" level="DEBUG" /> to logback.xml?). Or if you think logging regionalAnalysis._id and regionalAnalysis.accessGroup as currently proposed (plus the existing INFO and WARN level statements in getSingleCutoffGrid will be enough detail, that's fine.
Take a snapshot of the staging database
Run the migration and check results on staging, reporting back on log statements and time required

abyrd · 2025-03-14T05:48:48Z

We may also want to do a dry run or equivalent read-only database query on production without changing anything, just to see how many records / results are concerned here and how old they are. We may be addressing a limited number of very old results. If current active users confirm that these results are quite dated and will rarely or never be used, it may be reasonable to abandon them or do a partial conversion that does not cover every edge case.

Refactor methods so that `getSingleCutoffGrid` does not require the opportunity dataset to still exist in the database.

trevorgerhardt · 2025-03-16T11:29:10Z

I modified the log statements to use .info instead of .debug. Additionally, I refactored the static getSingleCutoffGrid method to only require the destination point set string ID. The full database entry was only retrieved to generate a human name for downloading, which we don't need to do here (but would be its own problem that already exists).

There are currently 167 entries that match the MongoDB query on staging. I will take a snapshot and then run the migration there.

trevorgerhardt · 2025-04-02T03:32:42Z

The changes in this PR were run on staging and from a local machine on production data with the results stored in our production S3 bucket. Closing.

trevorgerhardt requested a review from ansoncfit February 27, 2025 04:50

abyrd reviewed Mar 5, 2025

View reviewed changes

trevorgerhardt mentioned this pull request Mar 5, 2025

Refactor: Remove deprecated / unused regional analysis fields #956

Merged

trevorgerhardt added 4 commits March 5, 2025 11:02

Generate old regional analysis results

2824868

Generate all final results for old regional analyses with a task so that we can eliminate old cold to handle them.

Select from the three valid formats

a739574

Not all `FileStorageFormat`s are valid single cutoff grid formats.

Add log statements to track results

0ba43b0

Perform database changes after file changes

353f2de

`getSingleCutoffGrid` can be run more than once on the same inputs. We should perform the database changes after all of the files are generated in case there is an error during one of the runs so that we could just run this again if needed.

trevorgerhardt force-pushed the generate-old-regional-analysis-results branch from bafcfab to 353f2de Compare March 5, 2025 10:02

trevorgerhardt enabled auto-merge (rebase) March 5, 2025 10:02

trevorgerhardt requested a review from abyrd March 5, 2025 10:02

ansoncfit reviewed Mar 13, 2025

View reviewed changes

Allow passing just the destination point set ID

e612426

Refactor methods so that `getSingleCutoffGrid` does not require the opportunity dataset to still exist in the database.

trevorgerhardt added 6 commits March 16, 2025 20:29

Check if arrays are null before checking the length

16ef247

Convert entire query to an array before processing

d80c4b1

Exclude scenario when querying analyses

d32c647

Skip DB migration step for now

2a8e6f8

Handle and log errors while generating files

9ea7a36

Skip deleted analyses

3763f62

trevorgerhardt closed this Apr 2, 2025

auto-merge was automatically disabled April 2, 2025 03:32
Pull request was closed

Conversation

trevorgerhardt commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abyrd commented Mar 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trevorgerhardt commented Mar 5, 2025

Uh oh!

ansoncfit left a comment

Choose a reason for hiding this comment

Uh oh!

abyrd commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trevorgerhardt commented Mar 16, 2025

Uh oh!

trevorgerhardt commented Apr 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trevorgerhardt commented Feb 26, 2025 •

edited

Loading

abyrd commented Mar 14, 2025 •

edited

Loading