Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewDeployment - Package update issue in 0.8.0 #723

Closed
vishal-biyani opened this issue Jun 5, 2018 · 10 comments

Comments

@vishal-biyani
Copy link
Collaborator

commented Jun 5, 2018

After 0.8.0 release there is an issue with package update for new deployment executor functions which was seen consistently after multiple retries. (Log of failure: https://api.travis-ci.org/v3/job/388178258/log.txt)

The same test has worked in immediately before builds (https://api.travis-ci.org/v3/job/387779126/log.txt)

This issue is to investigate & track the same

Currently following tests are failing consistently:

--- PASSED /test_fn_update/test_resource_change.sh
*** FAILED /test_fn_update/test_nd_pkg_update.sh
*** FAILED /test_fn_update/test_configmap_update.sh
*** FAILED /test_fn_update/test_secret_update.sh
--- PASSED /test_fn_update/test_poolmgr_nd.sh
--- PASSED /test_logging/test_function_logs.sh
@smruthi2187

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2018

I think this is why tests are failing on our CI cluster. Referring to this log : https://api.travis-ci.org/v3/job/389024752/log.txt for the explanation.

  1. First, a function, (in this log): hellopython-360269385 is created with executor type NDM. The resource version ends up being : 41632529 . Also a service, hellopython-360269385-usbzmidi gets created with this selector :
"selector": {
                    "environmentName": "python-358862774",
                    "environmentUid": "5494865e-69e6-11e8-91db-42010a80010f",
                    "executorInstanceId": "usbzmidi",
                    "executorType": "newdeploy",
                    "functionName": "hellopython-360269385",
                    "functionUid": "54a90c0e-69e6-11e8-91db-42010a80010f"
                }
  1. Now, when we invoke the function at route : /hellopython-360269385, everything works as expected.
  2. Next, Function gets updated to have new code ( from hello world to hello fission ). The resource version ends up being 41632617.
    4.When we invoke the function at route /hellopython-360269385, router does recognize the different resource version and requests executor for a new service. New pod is spun up with the latest code. the old pod is still terminating. But the service url returned to router remains the same : hellopython-360269385-usbzmidi(as expected).
  3. Now, router makes a http request to this service url hellopython-360269385-usbzmidi. But since the old pod is still terminating, the service is still routing to old pod ip that is terminating. So the response received is “hello world” instead of “hello fission”.

I ran a few tests on CI, where I waited for old pod to get deleted fully and then invoked the function with the http request, everything worked correctly. I.e. I did get “Hello Fission”

Two solutions i can think of,

  1. we can update the service to have selector of resource version of function pod inside fnUpdate function of newdeploymgr.go after we do a deploy.updateDeployment. ( this means when we do a deployment.update, we need to wait to grab the new resource version, attach it to pod labels and then update service object to include this label in the selector.)
    Or,
  2. For environments that are used by new-deploy functions, the envGracePeriod should always be 0 so the old pod gets deleted as soon as the delete request is made.

Now, I dunno how this test worked all this while. May be the old pod was getting deleted faster ( even before we issued curl request for the updated function ) so we never ran into this issue.

@vishal-biyani

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 8, 2018

Based on findings above and PR #730 proved that grace period made the bug disappear - but the underlying problem is something else.

Once update to function code happens, the old pod goes in terminating state and new pods comes up almost within next 10-20 seconds. The terminating pod remains in that state for durartion of grace period (6 minutes as from last tests)

$kubectl get pods -w -n f-func-e7675c -o wide
NAME      READY     STATUS   RESTARTS                                  AGE  IP
hellopython-794508904-i4iwiygr-5b86dc5cf8-4nlck  2/2  Running  0       1m   10.52.5.150
hellopython-794508904-i4iwiygr-d97c9767-5pclb    2/2  Terminating   0  1m   10.52.3.106

The endpoint is almost instantly populated with new POD IP address (In some cases it takes a second or two, but gets populated pretty fast):

2018/06/08:10:46:05 Updating function with updated package
package 'test-deploy-pkg-zip-k7hr' updated
function 'hellopython-794508904' updated
2018/06/08:10:46:05 Waiting for deployment to update
Doing an HTTP GET on the function's route
Checking for valid response
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    13    0    13    0     0      0      0 --:--:--  0:00:15 --:--:--     3
Resp Hello, world!
Name:         hellopython-794508904-i4iwiygr
Namespace:    f-func-e7675c
Labels:       environmentName=python-793686619
              environmentUid=2076d2a4-6b09-11e8-91db-42010a80010f
              executorInstanceId=i4iwiygr
              executorType=newdeploy
              functionName=hellopython-794508904
              functionUid=20890536-6b09-11e8-91db-42010a80010f
Annotations:  <none>
Subsets:
  Addresses:          10.52.5.150
  NotReadyAddresses:  <none>
  Ports:
    Name              Port  Protocol
    ----              ----  --------
    runtime-env-port  8888  TCP
    fetcher-port      8000  TCP
Events:  <none>

So then we look at IPTables on each node. 10:46:05 is when the pod update was fired. Within 15-20 seconds, all nodes recieve the new IP to be populated in the IP table rules of all 6 nodes:

Fri Jun 8 10:46:23 UTC 2018:
Chain KUBE-SEP-WFCAE6PG3KLQWPF3 (1 references)
target     prot opt source               destination         
KUBE-MARK-MASQ  all  --  10.52.5.150          anywhere             /* f-func-e7675c/hellopython-794508904-i4iwiygr:runtime-env-port */
DNAT       tcp  --  anywhere             anywhere             /* f-func-e7675c/hellopython-794508904-i4iwiygr:runtime-env-port */ tcp to:10.52.5.150:8888

Also the IP address of old terminating pod was last seen at 10:46:15 in node iptables rules.

Even sending a curl request to service IP/address gives correct and expected result:

curl 10.55.251.148
Hello, fission!

So it does not look like an issue with iptables update propogation. Investgating more.

vishal-biyani added a commit that referenced this issue Jun 8, 2018

soamvasani added a commit that referenced this issue Jun 8, 2018

@vishal-biyani

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 9, 2018

After ruling out the iptables / Kubernetes networking as a problem, another test was done to see if the router is caching something. Although the router says it is reaching to service, it is probably caching the IP address of target pod at some level (At go lang library level).

To confirm this hypothesis here is the test conducted: after the function is updated in test_nd_pkg_update.sh - and the test for new value starts, we kill router. If there is some cache in router, then it will get cleared and we should get the new result (Hello Fission!). As confirmed by test in build https://travis-ci.org/fission/fission/builds/389743949 - this works, i.e. router is indeed caching the IP of pod somewhere and using that to connect.

After updating the function but before killing the router:

Events:  <none>
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    13  100    13    0     0   2298      0 --:--:-- --:--:-- --:--:--  2600

Hello, world!

Then we kill the router pod:

$ kubectl get pods -n f-9553bb
NAME                              READY     STATUS    RESTARTS   AGE
router-6f486fd6f4-vfbfq           1/1       Running   0          4m

$ kubectl delete pods router-6f486fd6f4-vfbfq -n f-9553bb
pod "router-6f486fd6f4-vfbfq" deleted

$ kubectl get pods -n f-9553bb
NAME                              READY     STATUS        RESTARTS   AGE
router-6f486fd6f4-bc5dm           1/1       Running       0          13s

This confirms the hypothesis and some cache from go lang used in router is culprit.

Code in functionHandler.go around line 160 needs further deep dive at go lang library level to determine how it is being cached and how that can be avoided.

		resp, err = transport.RoundTrip(req)
@vishal-biyani

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 14, 2018

As @smruthi2187 noted in https://api.travis-ci.org/v3/job/389041757/log.txt although the terminationGracePeriodSeconds was 360 seconds, the function was reaching new pod much faster.

100    13  100    13    0     0      1      0  0:00:13  0:00:12  0:00:01     2
100    13  100    13    0     0      1      0  0:00:13  0:00:12  0:00:01     3
Hello, world!
2018/06/07:01:25:31 Updating the archive
updating: hello.py (stored 0%)
2018/06/07:01:25:31 Updating function with updated package
package 'test-deploy-pkg-zip-vvnb' updated
function 'hellopython-479816342' updated
2018/06/07:01:25:31 Waiting for deployment to update and pod to get recycled
Doing an HTTP GET on the function's route
Checking for valid response
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    15  100    15    0     0    563      0 --:--:-- --:--:-- --:--:--   576
Hello, fission!
2018/06/07:01:28:01 Update function for new deployment executor passed
[SUCCESS]: /test_fn_update/test_nd_pkg_update.sh
~/gopath/src/github.com/fission/fission

So although the change in keep alive settings in #742 make it work, we need to investigate the root cause and see if anything at Linux IP table level is causing a cache. The plan is to use conntrack for investigating and getting some data.

@vishal-biyani vishal-biyani added this to the 0.9.0 milestone Jun 14, 2018

@vishal-biyani

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 20, 2018

Conntrack logs did not reveal much information in addition to what iptables already showed us. The current fix in #742 sets DisableKeepAlives to true thus removing all the caching - which is not good as it will cause performance degradation. Benchmarks with the setting enabled and disabled can be found here.

An alternate solution is to use CloseIdleConnections() on the transport only in case of function updates. Overall flow for this change looks like below:

  1. On a function update which changes the IP address of pod, executor needs to close all idle connections in router pods
  2. Executor finds all pods behind the router service and calls an endpoint to flush connections on all router pods
  3. All idle connections are closed (Need to check if we can do this only for that function/route level)
  4. All further incoming requests for the function will use a fresh connection.

vishal-biyani added a commit that referenced this issue Jun 21, 2018

Disable caching of connections from router to function pods (#742)
A temporary fix for #723 which needs to be optimized in future. Disables caching of connections from the router to function pod/service.
@vishal-biyani

This comment has been minimized.

Copy link
Collaborator Author

commented Jun 21, 2018

As discussed here we will for now use DisableKeepAlives and optimize later with solution listed in previous comment.

@life1347

This comment has been minimized.

Copy link
Member

commented Jun 27, 2018

@vishal-biyani , from the issue (golang/go#13998) and code change (https://go-review.googlesource.com/c/go/+/22670/2/src/net/http/transport.go#146), looks like we only need to set up keep-alive timeout by transport.IdleConnTimeout to solve the problem.
(Wait for CI build result: https://travis-ci.org/fission/fission/builds/397229448)

@life1347

This comment has been minimized.

Copy link
Member

commented Jun 27, 2018

Also, there is another japanese post mentioned the DNS result cache problem: https://shogo82148.github.io/blog/2017/01/14/re-golang-dns-cache/

@soamvasani soamvasani removed this from the 0.9.0 milestone Jun 29, 2018

@soamvasani

This comment has been minimized.

Copy link
Member

commented Jun 29, 2018

#742 seems to be a good enough fix for this release.

@vishal-biyani

This comment has been minimized.

Copy link
Collaborator Author

commented Nov 15, 2018

I am closing this for now but we can revisit this in future if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.