Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing a scheduler interval when a interval action still using it would crash the scheduler #2520

Closed
novuscy opened this issue Apr 27, 2020 · 14 comments · Fixed by #2552
Closed
Assignees
Labels
1-low priority denoting isolated changes bug Something isn't working geneva hanoi Hanoi release support-services
Milestone

Comments

@novuscy
Copy link

novuscy commented Apr 27, 2020

🐞 Bug Report

Affected Services

support-scheduler

Description and Minimal Reproduction

If I remove an interval when an interval action still using it, but leave the action untouched, the scheduler won't start any more next time. I had to manually delete keys associated with the action in the redis database to make the scheduler work again.

Is there a mechanism preventing an interval being removed when it's still used by some actions? something like that you can't remove a profile when it's used by some devices.

🔥 Exception or Error


level=INFO ts=2020-04-27T08:03:21.127215Z app=edgex-support-scheduler source=loader.go:144 name:=scrub-aged-events message="scheduler could not find a interval with interval name : midnight" msg="problem adding support-scheduler interval action"
level=ERROR ts=2020-04-27T08:03:21.127264Z app=edgex-support-scheduler source=init.go:63 msg="Failed to load schedules and events scheduler could not find a interval with interval name : midnight"
level=INFO ts=2020-04-27T08:03:21.12741Z app=edgex-support-scheduler source=database.go:159 msg="Database disconnected"

🌍 Your Environment

Deployment Environment:
go 1.14
EdgeX Version:
master

Anything else relevant?

@novuscy novuscy added the bug Something isn't working label Apr 27, 2020
@AnthonyMBonafide AnthonyMBonafide added 1-low priority denoting isolated changes geneva support-services labels Apr 28, 2020
@AnthonyMBonafide AnthonyMBonafide added this to New Issues in Core WG via automation Apr 28, 2020
@AnthonyMBonafide AnthonyMBonafide added this to the EdgeX Geneva milestone Apr 28, 2020
@AnthonyMBonafide AnthonyMBonafide moved this from New Issues to Bugs Backlog in Core WG Apr 28, 2020
@akramtexas
Copy link
Member

@novuscy I am unable to recreate this issue. Can you please share more-detailed steps that you carried out when you saw this issue? Thanks.

@novuscy
Copy link
Author

novuscy commented Apr 29, 2020

@novuscy I am unable to recreate this issue. Can you please share more-detailed steps that you carried out when you saw this issue? Thanks.

  1. start a new edgex with core services and scheduler, the default config of scheduler contains midnight interval and 2 actions.
  2. get the ID of the midnight interval.
  3. delete the midnight interval by ID using DELETE method, it returned true and midnight interval was deleted.
  4. restart edgex, message="scheduler could not find a interval with interval name : midnight" msg="problem adding support-scheduler interval action" showed up and scheduler won't connect to the database .

If you delete the midnight interval by NAME, edgex actually prevents you doing that with an error of "interval still in use by intervalAction(s) name: midnight", but nothing stops you delete the interval by ID.

@novuscy
Copy link
Author

novuscy commented Apr 29, 2020

same thing happened when I used edgex-cli db purge to clean the database.

@novuscy
Copy link
Author

novuscy commented Apr 29, 2020

And if I update a interval action with a new interval, the old interval seemed somehow still linked to the action, when I try to delete the old interval, an error says that the old one is still in use, and the old interval could not be deleted neither by ID nor by NAME.

@akramtexas
Copy link
Member

@novuscy Your response to my query is much appreciated 👍 This helps - I'll investigate, and will follow up, thanks.

@akramtexas
Copy link
Member

akramtexas commented Apr 29, 2020

@novuscy I followed the steps, for example:

127.0.0.1:6379> DEL 892084a1-6972-4dd6-990f-da9bfd8b0d26
(integer) 1
127.0.0.1:6379> DEL 892084a1-6972-4dd6-990f-da9bfd8b0d26
(integer) 0
  • But after that, support-scheduler service starts just fine:
level=INFO ts=2020-04-29T14:26:26.986004Z app=edgex-support-scheduler source=loader.go:38 msg="loading intervals, interval actions ..."
level=INFO ts=2020-04-29T14:26:26.986847Z app=edgex-core-data source=init.go:122 msg="Connected to zero Message Bus @ tcp://*:5563 publishing on 'events' topic"
level=INFO ts=2020-04-29T14:26:26.986884Z app=edgex-core-data source=telemetry.go:86 msg="Telemetry starting"
level=INFO ts=2020-04-29T14:26:26.986909Z app=edgex-core-data source=httpserver.go:89 msg="Web server starting (localhost:48080)"
level=INFO ts=2020-04-29T14:26:26.986932Z app=edgex-core-data source=message.go:50 msg="Service dependencies resolved..."
level=INFO ts=2020-04-29T14:26:26.987033Z app=edgex-core-data source=message.go:51 msg="Starting edgex-core-data master "
level=INFO ts=2020-04-29T14:26:26.98706Z app=edgex-core-data source=message.go:55 msg="This is the Core Data Microservice"
level=INFO ts=2020-04-29T14:26:26.987068Z app=edgex-core-data source=message.go:58 msg="Service started in: 3.144534ms"
level=INFO ts=2020-04-29T14:26:26.987924Z app=edgex-support-scheduler source=schedule.go:162 msg="added the interval with id : a1f2280e-67a8-4505-a7a0-1de44341b7f8 into the scheduler queue"
level=INFO ts=2020-04-29T14:26:26.987958Z app=edgex-support-scheduler source=loader.go:130 name=midnight id=a1f2280e-67a8-4505-a7a0-1de44341b7f8 msg="added interval"
level=INFO ts=2020-04-29T14:26:26.988302Z app=edgex-support-scheduler source=schedule.go:315 msg="added the intervalAction with id: 7784c4e4-3e7e-4468-aaec-2afadcb03d73 to interal: midnight into the queue"
level=INFO ts=2020-04-29T14:26:26.988322Z app=edgex-support-scheduler source=loader.go:152 name=scrub-pushed-events id=7784c4e4-3e7e-4468-aaec-2afadcb03d73 msg="added interval action"
level=INFO ts=2020-04-29T14:26:26.988333Z app=edgex-support-scheduler source=schedule.go:315 msg="added the intervalAction with id: cfe0a814-0824-43b5-b3f1-b428a31b8042 to interal: midnight into the queue"
level=INFO ts=2020-04-29T14:26:26.988342Z app=edgex-support-scheduler source=loader.go:152 name=scrub-aged-events id=cfe0a814-0824-43b5-b3f1-b428a31b8042 msg="added interval action"
level=INFO ts=2020-04-29T14:26:26.988352Z app=edgex-support-scheduler source=loader.go:58 msg="finished loading intervals, interval actions"
level=INFO ts=2020-04-29T14:26:26.98837Z app=edgex-support-scheduler source=telemetry.go:86 msg="Telemetry starting"
level=INFO ts=2020-04-29T14:26:26.988443Z app=edgex-support-scheduler source=httpserver.go:89 msg="Web server starting (localhost:48085)"
level=INFO ts=2020-04-29T14:26:26.988466Z app=edgex-support-scheduler source=message.go:50 msg="Service dependencies resolved..."
level=INFO ts=2020-04-29T14:26:26.988544Z app=edgex-support-scheduler source=message.go:51 msg="Starting edgex-support-scheduler master "
level=INFO ts=2020-04-29T14:26:26.988565Z app=edgex-support-scheduler source=message.go:55 msg="This is the Support Scheduler Microservice"
level=INFO ts=2020-04-29T14:26:26.988577Z app=edgex-support-scheduler source=message.go:58 msg="Service started in: 5.421422ms"
  • So I'm not seeing the issue you have reported.
  • Please fee free to reach out to me via Slack and we can compare notes to make this go faster (I couldn't find you on the Edgex Slack channel... My coordinates on Slack are @Akram Ahmad)

@akramtexas
Copy link
Member

  • Adding some details about getting the ID of the midnight interval, followed by deleting the midnight interval by ID, using DEL method that I mention and outline above:
{
  "id": "c92f450c-9c2d-4832-a7c6-d74dff82ba42",
  "created": 1588169225983,
  "modified": 1588169225983,
  "name": "scrub-aged-events",
  "interval": "midnight",
  "target": "core-data",
  "protocol": "http",
  "httpMethod": "DELETE",
  "address": "localhost",
  "port": 48080,
  "path": "/api/v1/event/removeold/age/604800000"
}

And:

{
    "id": "a48fad40-6b87-4b7c-9548-54f074dd3b2d",
    "created": 1588169225983,
    "modified": 1588169225983,
    "name": "scrub-pushed-events",
    "interval": "midnight",
    "target": "core-data",
    "protocol": "http",
    "httpMethod": "DELETE",
    "address": "localhost",
    "port": 48080,
    "path": "/api/v1/event/scrub"
}

@novuscy
Copy link
Author

novuscy commented Apr 30, 2020

Thanks for your investigation into this, but I was using REST API to delete the interval ,not the redis cli.

@akramtexas
Copy link
Member

@novuscy I will follow up, thanks.

@akramtexas
Copy link
Member

akramtexas commented Apr 30, 2020

  • Circling right back as I was triaging another issue. Yes, I have recreated what you observed:
level=INFO ts=2020-04-30T19:44:23.401166Z app=edgex-support-scheduler source=loader.go:144 name:=scrub-pushed-events message="scheduler could not find a interval with interval name : midnight" msg="problem adding support-scheduler interval action"
level=ERROR ts=2020-04-30T19:44:23.401193Z app=edgex-support-scheduler source=init.go:63 msg="Failed to load schedules and events scheduler could not find a interval with interval name : midnight"
  • Investigating root cause 👍

@akramtexas
Copy link
Member

The root cause has two parts:

  • The portion of the application which performs the delete does not first perform a check as to whether the Interval is in use.
  • The REST handler does not understand and handle errors where the Interval is in use.

@akramtexas
Copy link
Member

akramtexas commented May 5, 2020

While working on a fix, and thanks to input from @AnthonyMBonafide, a possible race condition has been discovered surrounding the area related to this issue. That should separately be investigated at some point in time.

@akramtexas
Copy link
Member

  • I have put together a fix for this bug, functionally tested the fix, and am currently working on wrapping up the newly-added unit test cases, following which a PR will be submitted ✅

@akramtexas
Copy link
Member

akramtexas commented May 13, 2020

  • Referring to the previous comment (above), we decided that there may be a bit much work remaining (unit testing, etc.) for this fix to make it into the Geneva release.
  • With that in mind, I am documenting here the workaround to this issue, noting that the impact of this bug is low (for anyone who might use the support-scheduler service).
  • First, a recommendation for the user: Use the REST call to delete by NAME (instead of the REST call to delete scheduler interval by ID)
  • Providing here a related issue to be aware of, related to the deleting an in-use interval and how it puts the system in an invalid state, that issue being related to consecutive delete operations, in light of how, when an in-memory interval is deleted, the DB persists it nonetheless: Delete an In Use Interval Puts System In An Invalid State #2554
  • The problem here (i.e. in Issue-2520) is itself documented in detail in the Description atop the issue.
  • The underlying problem is that the REST call to delete by ID does not perform a check to ascertain that the interval is currently in use.
  • Here is the recommended path for the user to take

interacting with the support-scheduler service.

  • At the outset of interacting with the support-scheduler service, first use the REST API to get the entire interval information, and use the interval's NAME, instead of the ID (for deleting the interval).
  • For example, let's say that the ID happens to have the value 71e5c882-61b8-4d98-a3b7-2aa68d60c5e6.
  • Then issue this cURL command (using the ID value above, to get the full interval information:
curl --location --request GET 'http://localhost:48085/api/v1/interval/71e5c882-61b8-4d98-a3b7-2aa68d60c5e6'
  • For example, let's say we get the following response on issuing the GET call above:
{
    "Timestamps": {
        "created": 1589304811946,
        "modified": 1589304811946
    },
    "id": "71e5c882-61b8-4d98-a3b7-2aa68d60c5e6",
    "name": "midnight",
    "start": "20180101T000000",
    "frequency": "24h"
}
  • From this response received, make a note of the interval's "name", which is "midnight", so we don't lose that information.
  • Use "midnight" to delete scheduler interval (i.e. delete by 'Name').
  • But suppose that deletion of interval (by ID) has already been done.
  • In that scenario, re-add the interval.
  • Restart the scheduler service.

@AnthonyMBonafide AnthonyMBonafide added the hanoi Hanoi release label May 21, 2020
Core WG automation moved this from Bugs Backlog to Done May 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1-low priority denoting isolated changes bug Something isn't working geneva hanoi Hanoi release support-services
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants