Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celerify HydroShare Export #2897

Merged
merged 9 commits into from Aug 15, 2018
Merged

Celerify HydroShare Export #2897

merged 9 commits into from Aug 15, 2018

Conversation

rajadain
Copy link
Member

@rajadain rajadain commented Aug 1, 2018

Overview

Since exporting multiple files to HydroShare can take a long time, and projects with multiple scenarios can balloon the number of files really quickly, the exports would time out in some cases.

To mitigate that, we wrap the creation and updating methods for HydroShare resources in Celery Tasks, and return a Job ID instead. This frees the export from the constrains of a request-response cycle, while also not tying up a gunicorn worker process.

Celery Tasks have their own timeouts however, and these are set to timeout after 5 minutes. The front-end which polls for these results has the same timeout.

Since these exports take so long, as users add changes in the front-end that can keep being queued for export. But if each export takes ~4 minutes, and each single change triggers an export, we're looking at potentially hours of exports queued up. If the browser is closed, or the user navigates away during this time, the queued exports would be dropped silently. Furthermore, there is no visibility of this queue, or any sort of progress of the exports. Thus, the autosync behavior is too obfuscated with potential data loss for users, which is why the final commit disables it. The autosync options are removed from the UI, all new projects default to it disabled, and a migration turns it off for all currently-syncing HydroShare resources.

To assist or prompt users to export to HydroShare, we can think about adding a "dirty" or "stale" flag to a HydroShare resource, that can be set when any change to a project is made, and unset when a HydroShare export finishes. This is left to a future task.

The above has been incorporated here by comparing the last HydroShare export date with the last scenario modification date. If a scenario has been modified more recently than the last export, we add a message to the HydroShare Export modal saying so.

Connects #2885

Demo

image

https://beta.hydroshare.org/resource/d1f4f20cd6344b7a809552164cc512a6

image

Notes

I also tried cutting every piece of the export into a Celery Task. That work is vieweable in this diff hydroshare-celerification-2.patch. This was suboptimal because since Celery doesn't allow sharing complex objects between tasks, the HydroShare Rest Client could not be shared, and each task would have to create its own which involved making OAuth requests to HydroShare, further slowing down the process. And while splitting up into smaller tasks allowed us to have an unspecified timeout at the Celery level (since each individual task could finish under the default 90s timeout), we would still have a timeout in the front-end, so there is no getting rid of timeouts completely.

HydroShare Recommendations

There are potential improvements that can be made to the HydroShare API which would help us significantly:

  1. Support Bulk File Uploads. Since the most time is spent uploading a file at a time, even though all the files are ready at once, the ability to support bulk file uploads would speed up the process significantly. Probably by an order of magnitude or more. Support Bulk File Uploads hydroshare/hydroshare#2859 Support Bulk File Uploads hydroshare/hs_restclient#86

  2. Allow Overwriting Files. Currently, if a file already exists in HydroShare, you cannot add it again. We have to first delete that file, then add it. This leads to updates taking twice as long as creation, because for each file we must first delete it then add it. If adding files supported overwriting, this could be done in half the time. Support Overwriting Files hydroshare/hydroshare#2860 Support Overwriting Files hydroshare/hs_restclient#87

  3. Speed Up API Responses. Currently each API request takes ~1s to turn around. If this was faster the export would be too.

Testing Instructions

  • Populate the MMW_HYDROSHARE_SECRET_KEY in the app and worker VMs using the "HydroShare Beta" item in LastPass
  • Go to https://beta.hydroshare.org/ and setup an account if you don't already have one
  • Check out this branch, bundle, vagrant ssh app -c 'sudo restart mmw-app' and vagrant ssh worker -c 'sudo service celeryd restart'
  • Go to :8000/ and log in to an account
  • Go to "My Account" → "Linked Accounts" and link to HydroShare (it should automatically connect to the beta site)
  • Make a new project. Add a bunch of scenarios to it (I did 11).
  • Export to HydroShare. Ensure the export completes successfully.
  • Once exported, re-export to HydroShare. Ensure the export completes successfully.
  • Make a change to any scenario. Open the HydroShare export dialog. Ensure it has text indicating that there are new changes.
  • Test both TR-55 and MapShed projects.

@rajadain
Copy link
Member Author

rajadain commented Aug 3, 2018

Hmm, just noticed that the above demo project doesn't have GMS files. Will look into that.

@rajadain
Copy link
Member Author

rajadain commented Aug 7, 2018

One strategy that could work for notifying users that they should re-export to HydroShare is to compare the last HydroShare export date with the most recent scenario update date. If a scenario was updated after the export date, then we know there are new changes. Might need to confirm that the scenario update date is available on the front end.

Previously we would use a User Model to get an instance of
a HydroShare Service Client for them. In this effort we're
offloading some of this work to Celery Tasks, which do not
support passing non-serializable Models as parameters. So,
we will be forced to use simple data types.
These mostly follow the create and update paths in the
view with the addition of spawning their own copies of
HydroShare Service, since that cannot be given to them
across the Celery serialization boundary.
Now, in the case of POST we run a Celery job instead of
doing it all within a single request. This sends back a
job_id that can be polled for. This polling will behave
exactly as other such instances.
A HydroShareExportTaskModel is added to poll for results,
set to poll every 6 seconds since this is relatively slow
process. A task model gives us two promises: one for when
the initial request is made, and another for when polling
stops.

We handle errors such as connectivity problems etc in the
start promise failure. We handle response from HydroShare
in the polling promise resolution. The exportToHydroShare
method returns the two promises combined, to resolve when
they are both ready. This isn't currently used, but keeps
the API same as before when we returned the AJAX promise.
Since sending each file takes a few seconds, and when
multiple scenarios are present in a project they have
a large number of files, which can take a lot of time
to send, we increase the timeout of both Celery tasks
for creating and updating resources, as well as front
end that will poll to a corresponding degree.
Since now the HydroShare Export can potentially take
quite a while, during which the user can keep making
changes, with Autosync enabled that would simply add
each change to the queue, leading to sitations where
the app would constantly be exporting to HydroShare.

Furthermore, because this export queue isn't visible
to the user, they can navigate away, blocking export
from completion.

Adding a visible queue that is user-manipulatable is
currently out of scope. Instead, our compromise here
is to disable autosyncing completely, and relying on
the user to export manually when they want.
When we switched to using job uuids for transferring
MapShed results for Subbasin, we didn't update this
HydroShare export functionality which had been broken.

By switching to using the MapShed Job UUID, we enable
GMS file exports to HydroShare again.
@rajadain rajadain force-pushed the tt/hydroshare-celerification branch from 08044c7 to 4c96a44 Compare August 9, 2018 18:21
@rajadain
Copy link
Member Author

rajadain commented Aug 9, 2018

Rebased on develop, added a commit that fixes the GMS export.

@rajadain
Copy link
Member Author

Added a commit that indicates if a project has changed since the last export.

@caseycesari
Copy link
Member

Taking a look.

Copy link
Member

@caseycesari caseycesari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complex change, so I may have done something wrong, but I'm unable to share a project on Hydroshare. When I click the share button, I get the following error:

image

@rajadain
Copy link
Member Author

Think you may have found a genuine bug. Looking into it now.

@rajadain
Copy link
Member Author

That was a great catch. I had only been testing with projects that had previously been exported to HydroShare. Doing so with a new project tripped on not having hydroshare defined on the project model the first time. Now we test for that before proceeding. Also renamed some variables to read better.

Ready for another look.

Copy link
Member

@caseycesari caseycesari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me now. Nice job. I was able to upload projects with 8 scenarios before the exports started timing out. Since these exports take a long time, consider adding a temporary message while the upload is in progress letting the user know that the upload may take a while.

@caseycesari caseycesari assigned rajadain and unassigned caseycesari Aug 14, 2018
@rajadain
Copy link
Member Author

Added a final commit that only re-exports scenarios that have changed, or have not been exported before. If you could test this one last thing, I think we'll be good to go for the near future.

Recommended tests:

  • Update branch. bundle
  • Open existing project that has been exported already. Open the dev tools and network tab.
  • Re-export the project. Inspect the request. Ensure it only has Analyze files (which will be skipped on the server side), and no scenario files.
  • Add a new scenario. Re-export the project. Inspect the request. Ensure it has the new scenario files.
  • Edit an existing scenario (by adding a modification or something). Re-export the project. Inspect the request. Ensure it has the modified scenario files.
  • Make a new project and export to HydroShare. Inspect the request. Ensure it has all scenario files.

@rajadain
Copy link
Member Author

Since re-exporting existing projects takes longer than exporting them the first time (because all files have to be checked if they exist in HydroShare, and if so, deleted, then re-exported), limiting the number of files we send on the re-export significantly speeds up the process.

Copy link
Member

@caseycesari caseycesari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements. Performance is much better.

@rajadain
Copy link
Member Author

Thanks for the multiple thorough reviews! Going to merge this now.

@rajadain
Copy link
Member Author

Will squash fixups before merging.

Since we no longer autosync projects, and the user must
be prompted to export explicitly, we add a message in the
MultiShare Modal under the HydroShare section saying if
the project has changed since last export.

This message is only visible if the date of last export
is less recent than the latest modified_at date of all
the scenarios in the project. Thus, if any scenario is
modified after an export, this message will be shown.

To allow this, we now always update the client scenario
model with the modified_at date from the server.
Previously we always exported all scenarios for
simplicity. Unfortunately, as we bump up against
the limit of how many files can be exported before
a timeout, especially when the record already exists
and each export checks for a file, deletes it, and
then exports it again, being judicious about which
files we send ensures that we almost never time
out.
@rajadain rajadain force-pushed the tt/hydroshare-celerification branch from 2570370 to 20261d5 Compare August 15, 2018 16:52
@rajadain rajadain merged commit c944636 into develop Aug 15, 2018
@rajadain rajadain deleted the tt/hydroshare-celerification branch August 15, 2018 16:53
rajadain added a commit that referenced this pull request Dec 27, 2018
These were previously not required because all the HydroShare
export processing was done in Django from withiin App. However,
in #2897 it was changed to be done in Celery from within Worker.
Without these variables being initialized correctly, the export
does not work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants