Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NA] failed to update serviceName: DEFAULT_GROUP@@prod-zipkin #3876

Closed
hanzhen1989 opened this issue Sep 21, 2020 · 14 comments · Fixed by #4864 or #4866
Closed

[NA] failed to update serviceName: DEFAULT_GROUP@@prod-zipkin #3876

hanzhen1989 opened this issue Sep 21, 2020 · 14 comments · Fixed by #4864 or #4866
Assignees
Labels
contribution welcome kind/bug Category issues or prs related to bug. kind/research

Comments

@hanzhen1989
Copy link

Describe the bug
服务中没有zipkin,但是偶尔会报错,无法更新zipkin这个服务,报错信息:
[NA] failed to update serviceName: DEFAULT_GROUP@@prod-zipkin.xxxxx.com java.lang.NullPointerException: null at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) at com.alibaba.nacos.client.naming.core.HostReactor.processServiceJson(HostReactor.java:128) at com.alibaba.nacos.client.naming.core.HostReactor.updateServiceNow(HostReactor.java:333) at com.alibaba.nacos.client.naming.core.HostReactor$UpdateTask.run(HostReactor.java:397) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Desktop (please complete the following information):

  • OS:阿里云的k8s
  • Version nacos-client 1.3.1,spring-cloud-starter-alibaba-nacos-discovery版本为2.1.2.RELEASE
  • Module naming
@KomachiSion
Copy link
Collaborator

What's your version of nacos-server?

@hanzhen1989
Copy link
Author

nacos-server的版本是 1.3.1

@realJackSun
Copy link
Collaborator

Is the 'zipkin' blocked due to network problems? I think you should check whether health check fails.

@hanzhen1989
Copy link
Author

Is the 'zipkin' blocked due to network problems? I think you should check whether health check fails.

没太看明白您的意思呢?我的服务中是没有zipkin的,但是不知道为什么客户端获取到了zipkin,并且还要对他进行更新,并且按理说就算是更新也不会报错。

@KomachiSion
Copy link
Collaborator

这其实是两个问题

  1. 为什么会更新zipkin
  • 可能您自己的应用没有,但是您引入的组件可能向nacos订阅了zipkin
  1. 更新报错
  • 这里可能是bug,返回response中 servicename为null,具体的需要复现一下,定位到什么场景会返回空服务名。

关于第二个问题,可能需要您帮我们调用一下api,看下返回值是什么

curl -X GET '${ip}:${port}/nacos/v1/ns/instance/list?serviceName=DEFAULT_GROUP@@prod-zipkin

@hanzhen1989
Copy link
Author

返回值是:
{
"name": "DEFAULT_GROUP@@prod-zipkin.xxxxx.com",
"clusters": "",
"hosts": []
}
不是一直有问题,很低的概率报这个错。

@KomachiSion
Copy link
Collaborator

我从没有出现过这个问题。。 也没有想到什么情况下会出现这个问题。只能先看看如何复现出来。。

@chuntaojun
Copy link
Collaborator

请贴出你的项目的配置文件,以及seluth的版本信息

@hanzhen1989
Copy link
Author

sleuth是2.1.1RELEASE
配置文件:
spring:
#zipkin配置
zipkin:
base-url: http://prod-zipkin.xxxx.com
sleuth:
web:
client:
enabled: true
sampler:
# 默认的采样比率为0.1,不能看到所有请求数据
# 更改采样比率为1,就能看到所有的请求数据了,但是这样会增加接口调用延迟
probability: 1.0
cloud:
nacos:
discovery:
# nacos注册中心地址
server-addr: ip:port
# 注册到那个命名空间
namespace: faaaecea-3e1d-484d-b65f-174ff2c9c091
# 用户名和密码后续迁移到启动命令中
username: nacos

actuator暴露监控端点

management:
endpoints:
web:
exposure:
include: '*'

feign 配置

feign:
hystrix:
enabled: true
okhttp:
enabled: true
httpclient:
enabled: false
client:
config:
default:
connectTimeout: 10000
readTimeout: 10000
#GZIP压缩请求
compression:
request:
enabled: true
response:
enabled: true

hystrix 配置

hystrix:
command:
default:
execution:
isolation:
#线程池隔离 or 信号量隔离
strategy: SEMAPHORE
thread:
timeoutInMilliseconds: 60000
semaphore:
maxConcurrentRequests: 1000
shareSecurityContext: true

#请求处理的超时时间
ribbon:
ReadTimeout: 10000
ConnectTimeout: 10000
#ribbon刷新时间
ServerListRefreshInterval: 5000

@daisy1949
Copy link

我也出现了同样的问题,我们有evaluate这个服务
failed to update serviceName: DEFAULT_GROUP@@evaluate
java.lang.NullPointerException: null
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936) ~[?:1.8.0_212]
at com.alibaba.nacos.client.naming.core.HostReactor.processServiceJson(HostReactor.java:128) ~[nacos-client-1.3.1.jar!/:?]
at com.alibaba.nacos.client.naming.core.HostReactor.updateServiceNow(HostReactor.java:333) [nacos-client-1.3.1.jar!/:?]
at com.alibaba.nacos.client.naming.core.HostReactor$UpdateTask.run(HostReactor.java:397) [nacos-client-1.3.1.jar!/:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_212]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_212]
at java.util.concur"

@KomachiSion
Copy link
Collaborator

it may be caused by InstanceController line 755 of 1.4.0.
The server will return an empty json result. If check reachProtectThreshold.

@KomachiSion KomachiSion added kind/bug Category issues or prs related to bug. contribution welcome labels Nov 16, 2020
@realJackSun
Copy link
Collaborator

@i will solve it@

@realJackSun
Copy link
Collaborator

realJackSun commented Feb 3, 2021

it may be caused by InstanceController line 755 of 1.4.0.
The server will return an empty json result. If check reachProtectThreshold.

I think @KomachiSion is right.
This problem occurs while the field "isCheck" is set to true.

The 1.4.0 logic is as following:

          if ((float) ipMap.get(Boolean.TRUE).size() / srvedIPs.size() <= threshold) {
            
            Loggers.SRV_LOG.warn("protect threshold reached, return all ips, service: {}", serviceName);
            if (isCheck) {
                result.put("reachProtectThreshold", true);
            }
            
            ipMap.get(Boolean.TRUE).addAll(ipMap.get(Boolean.FALSE));
            ipMap.get(Boolean.FALSE).clear();
        }
        if (isCheck) {
            result.put("protectThreshold", service.getProtectThreshold());
            result.put("reachLocalSiteCallThreshold", false);
            return JacksonUtils.createEmptyJsonNode();
        }
        
        ArrayNode hosts = JacksonUtils.createEmptyArrayNode();
        
        for (Map.Entry<Boolean, List<Instance>> entry : ipMap.entrySet()) {
            .......
        }
        
        result.replace("hosts", hosts);
        ......
        result.put("name", serviceName);
        ......
        return result;

If the isCheck=true, this code part will be executed

        if (isCheck) {
            result.put("protectThreshold", service.getProtectThreshold());
            result.put("reachLocalSiteCallThreshold", false);
            return JacksonUtils.createEmptyJsonNode();
        }

In this case, the client get an empty json response, and in the following HostReactor.java code part, throws NullPointerException.

    public ServiceInfo processServiceJson(String json) {
        ServiceInfo serviceInfo = JacksonUtils.toObj(json, ServiceInfo.class);
        ServiceInfo oldService = serviceInfoMap.get(serviceInfo.getKey());

To solve this problem, There are two ways:

1. modify server code
delete the following line:

return JacksonUtils.createEmptyJsonNode();

After deleting it, the logic will be, if the protection threshold is reached, return all of the ips(including healthy and unhealthy instances). In this way, the a part of the QPS are going to the unhealthy instances and lost, but the healthy instances are protected.

  1. modify client code
    modify this line
    public ServiceInfo processServiceJson(String json) {
        ServiceInfo serviceInfo = JacksonUtils.toObj(json, ServiceInfo.class);
        ServiceInfo oldService = serviceInfoMap.get(serviceInfo.getKey());

to

    public ServiceInfo processServiceJson(String json) {
        ServiceInfo serviceInfo = JacksonUtils.toObj(json, ServiceInfo.class);
        if (serviceInfo.getKey() == null) {
             oldService = null;
        } else {
              ServiceInfo oldService = serviceInfoMap.get(serviceInfo.getKey());
       }

@realJackSun
Copy link
Collaborator

PR#4864 correspond to solution 1, modify server code
PR#4866 correspond to solution 2 modify client code

realJackSun added a commit to realJackSun/nacos that referenced this issue Feb 4, 2021
@KomachiSion KomachiSion reopened this Feb 4, 2021
realJackSun added a commit to realJackSun/nacos that referenced this issue Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution welcome kind/bug Category issues or prs related to bug. kind/research
Projects
None yet
5 participants