Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore python environment state after container restart due to OOM #213

Merged
merged 9 commits into from
Jan 19, 2022

Conversation

sanketsudake
Copy link
Member

@sanketsudake sanketsudake commented Dec 23, 2021

Signed-off-by: Sanket Sudake sanketsudake@gmail.com

  • When the function pod start, we mount volume /userfunc which is persisted across reboot.
  • When a pod gets specialized, whatever function specialization info is received we can store it as the state in /userfunc/store.json.
  • state.json is persistent across reboots container restarts and also function artifacts etc.
  • So if the container gets restarted due to OOM, we can load it again after the server start, if store.json exists.
  • By referring store.json pod can be specialized again and should be ready to serve the requests.
  • We are also capturing different signals like SIGTERM, SIGINT, on receiving these signals we clean up the state.

This PR reduce 502 errors after an environment container restarts. The same approach can be used across other environments.


Testing Done

code.py

import os
import json

USERFUNCVOL = os.environ.get("USERFUNCVOL", "/userfunc")

def store_first():
    json.dump({"first_call": "call"}, open(os.path.join(USERFUNCVOL, "first.json"), "w"))

def check_specialize_info_exists2():
    return os.path.exists(os.path.join(USERFUNCVOL, "first.json"))

def main():
    print("Function call received")
    if check_specialize_info_exists2():
        print("Specialization info exists")
        # 2nd function call should return this
        return "Main2"
    store_first()
    a = []
    while True:
        a.append(' ' * 10**6) #This line will create a OOM in seconds
    # First function call should OOM
    return "Main1"

Function creation and invocation

$ fission env create --name python --image tripples/python-env:sanket-dev --mincpu 40 --maxcpu 80   --minmemory 64 --maxmemory 96  --poolsize 1 --version 3
$ fission fn create --name oom-fn --env python --code code.py
$ cat code.py

$ fission fn test --name oom-fn # First call to function
Error: Error calling function oom-fn: 500; Please try again or fix the error: error sending request to function
Error: error getting function response
$ fission fn test --name oom-fn # Second call to function
Main2

Function pod logs

+ poolmgr-python-default-20826-5c5d4d6686-hc2jf › python
+ poolmgr-python-default-20826-5c5d4d6686-hc2jf › fetcher
fetcher {"level":"info","ts":"2021-12-24T06:30:29.714Z","caller":"otel/provider.go:202","msg":"OTEL_EXPORTER_OTLP_ENDPOINT not set, skipping Opentelemtry tracing"}
fetcher {"level":"info","ts":"2021-12-24T06:30:29.716Z","caller":"fetcher/main.go:34","msg":"fetcher ready to receive requests"}
python 2021-12-24 06:30:33,713 - INFO - Starting bjoern based server
fetcher {"level":"info","ts":"2021-12-24T06:37:28.013Z","logger":"fetcher","caller":"fetcher/fetcher.go:690","msg":"successfully placed","trace_id":"d624958ace2d4fc3d957fba1872fc483","location":"/userfunc/deployarchive"}
fetcher {"level":"info","ts":"2021-12-24T06:37:28.013Z","logger":"fetcher","caller":"fetcher/fetcher.go:243","msg":"calling environment v2 specialization endpoint","trace_id":"d624958ace2d4fc3d957fba1872fc483"}
python 2021-12-24 06:37:28,017 - INFO - specialize called with  filepath = "/userfunc/deployarchive"   handler = ""
python 2021-12-24 06:37:28,023 - DEBUG - moduleName = "main"    funcName = "main"
fetcher {"level":"info","ts":"2021-12-24T06:37:28.034Z","logger":"fetcher","caller":"fetcher/fetcher.go:735","msg":"specialize request done","trace_id":"d624958ace2d4fc3d957fba1872fc483","elapsed_time":0.032958641}
poolmgr-python-default-20826-5c5d4d6686-hc2jf python Function call received
- poolmgr-python-default-20826-5c5d4d6686-hc2jf › python
+ poolmgr-python-default-20826-5c5d4d6686-hc2jf › python
python 2021-12-24 06:38:01,229 - INFO - Found state.json
python 2021-12-24 06:38:01,322 - INFO - specialize called with  filepath = "/userfunc/deployarchive"   handler = ""
python 2021-12-24 06:38:01,323 - DEBUG - moduleName = "main"    funcName = "main"
python 2021-12-24 06:38:01,324 - INFO - Loaded user function {'filepath': '/userfunc/deployarchive', 'functionName': '', 'url': '', 'FunctionMetadata': {'name': 'oom-fn', 'namespace': 'default', 'selfLink': '/apis/fission.io/v1/namespaces/default/functions/oom-fn', 'uid': 'cabb5151-4998-463c-bcf2-9e099ef88a36', 'resourceVersion': '22563', 'generation': 2, 'creationTimestamp': '2021-12-24T06:30:08Z', 'managedFields': [{'manager': 'fission-bundle', 'operation': 'Update', 'apiVersion': 'fission.io/v1', 'time': '2021-12-24T06:30:08Z', 'fieldsType': 'FieldsV1', 'fieldsV1': {'f:spec': {'.': {}, 'f:InvokeStrategy': {'.': {}, 'f:ExecutionStrategy': {'.': {}, 'f:ExecutorType': {}, 'f:MaxScale': {}, 'f:MinScale': {}, 'f:SpecializationTimeout': {}, 'f:TargetCPUPercent': {}}, 'f:StrategyType': {}}, 'f:concurrency': {}, 'f:environment': {'.': {}, 'f:name': {}, 'f:namespace': {}}, 'f:functionTimeout': {}, 'f:idletimeout': {}, 'f:package': {'.': {}, 'f:packageref': {'.': {}, 'f:name': {}, 'f:namespace': {}, 'f:resourceversion': {}}}, 'f:requestsPerPod': {}, 'f:resources': {}}}}]}, 'envVersion': 3}
python 2021-12-24 06:38:01,327 - INFO - Starting bjoern based server
python Function call received # Call after python container restart
python Specialization info exists
python Function call received
python Specialization info exists
python Function call received
python Specialization info exists

@sanketsudake sanketsudake changed the title Capture exit signal with python environment Restore python environment state after container restart due to OOM Dec 24, 2021
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Signed-off-by: Sanket Sudake <sanketsudake@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant