Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you move configuration from cluster A to cluster B ? #10714

Closed
fabltd opened this issue May 16, 2023 · 28 comments · Fixed by #10884
Closed

How do you move configuration from cluster A to cluster B ? #10714

fabltd opened this issue May 16, 2023 · 28 comments · Fixed by #10884
Assignees
Labels

Comments

@fabltd
Copy link

fabltd commented May 16, 2023

Hi

I have EMQX V5 installed in our dev cluster and would like to migrate the config to the prod cluster.

Whilst I can find the configuration in the pod. I cannot find out where this is stored and shared between the replicas. Its also not documented?

How do you move configuration from cluster A to cluster B ?

Thank you

@fabltd fabltd added the BUG label May 16, 2023
@Rory-Z Rory-Z added help wanted and removed BUG labels May 16, 2023
@Rory-Z Rory-Z transferred this issue from emqx/emqx-operator May 16, 2023
@Rory-Z Rory-Z changed the title Documentation for K8s Operator How do you move configuration from cluster A to cluster B ? May 16, 2023
@HJianBo
Copy link
Member

HJianBo commented May 16, 2023

Hi, @fabltd Do you know which version of EMQX you are currently using?

The current version (5.0.x) does not have a configuration migration feature. If you need to migrate the configuration in this version, you will need to do it manually.

For example, manually merge /opt/emqx/etc/emqx.conf and /opt/emqx/data/configs/cluster.hocon (or /opt/emqx/data/configs/cluster-override.conf) inside the emqx container, and copy them to the new node's /opt/emqx/etc/emqx.conf file


Updates: Added a Feature Request label. We will try delivering this kind of functionality in v5.1.0

@fabltd
Copy link
Author

fabltd commented May 17, 2023

Hi

Yes using V5.0.25. I can see that the data is not persistant its using an empty dir.

I have implmented the following config below to implment persistance. The PVC is created and I see 3 x disks.

However EMQX won't start its in a continued CrashLoopBackoff due to the following error:

mkdir: cannot create directory ‘/opt/emqx/data/configs’: Permission denied

Any idea why it cannot write?

apiVersion: apps.emqx.io/v2alpha1
kind: EMQX
metadata:
    name: emqx
# Core
spec:
    image: emqx/emqx:5.0.25
    coreTemplate:
      spec:
        volumeClaimTemplates:
          storageClassName: standard
          resources:
            requests:
              storage: 20Mi
          accessModes:
            - ReadWriteOnce
        replicas: 3
# BootStrap Config
    bootstrapConfig: |
        dashboard {
          default_username: "admin"
          default_password: "public"
        }
# Dashboard
    dashboardServiceTemplate:
      metadata:
        name: emqx-dashboard
      spec:
        type: NodePort
        selector:
          apps.emqx.io/db-role: core
        ports:
          - name: "dashboard-listeners-http-bind"
            protocol: TCP
            port: 18083
            targetPort: 18083
            nodePort: 30008
# Listeners
    listenersServiceTemplate:
      metadata:
        name: emqx-listeners
      spec:
        type: LoadBalancer
        ports:
          - name: "tcp-default"
            protocol: TCP
            port: 1883
            targetPort: 1883

@Rory-Z
Copy link
Member

Rory-Z commented May 17, 2023

Hi @fabltd Please check this: emqx/emqx-operator#716

@fabltd
Copy link
Author

fabltd commented May 17, 2023

@Rory-Z - Thanks that fixed - why is it not in the docs?

@Rory-Z
Copy link
Member

Rory-Z commented May 17, 2023

Hi @fabltd This is in document: https://docs.emqx.com/en/emqx-operator/latest/deployment/on-aws-eks.html#quickly-deploy-an-emqx-cluster

Or please let me know where is document for you read, maybe we missed

@fabltd
Copy link
Author

fabltd commented May 17, 2023

@Rory-Z

Yes its not mentioned in the link above or here:

https://github.com/emqx/emqx-operator/blob/main/docs/en_US/tasks/configure-emqx-persistence.md

It should be added to this doc?

@Rory-Z
Copy link
Member

Rory-Z commented May 17, 2023

You can create a new PR for emqx/emqx-operator.git main-2.1 branch.
In emqx/emqx-operator.git main branch, the EMQX Operator already add default value for podSecurityContext, but it's not release

@fabltd
Copy link
Author

fabltd commented May 17, 2023

@Rory-Z not sure if you can help with the orginal ask:

In my dev cluster the config appears to all be in a file called cluster-override.conf.

I have copied this to the prod cluster but and restarted the pods but none of my dev rules are showing?

Any idea.

@Rory-Z
Copy link
Member

Rory-Z commented May 17, 2023

@Rory-Z not sure if you can help with the orginal ask:

In my dev cluster the config appears to all be in a file called cluster-override.conf.

I have copied this to the prod cluster but and restarted the pods but none of my dev rules are showing?

Any idea.

Copy cluster-override.conf is right way, but I'm also don't know why the rule is miss.
@zhongwencool Any ideas ?

@zhongwencool
Copy link
Member

you should stop all nodes, then copy cluster-override.conf,
otherwise the restart node will copy the old running node's cluster-override.conf

@Rory-Z
Copy link
Member

Rory-Z commented May 17, 2023

you should stop all nodes, then copy cluster-override.conf, otherwise the restart node will copy the old running node's cluster-override.conf

Maybe can copy cluster-overwrite.conf content to .spec.bootstrapConfig in apps.emqx.io/v2alpha1 EMQX ( for EMQX bare node, it's etc/emqx.conf ), and create a new cluster ?

@fabltd
Copy link
Author

fabltd commented May 18, 2023

I was unable to get the cluster-overwrite.conf to work. I understand later releases of V5 have moved to the file

cluster.hocon

As a test I built a test cluster and configured some options. Following this I built a new cluster and migragted the .hocon file.

However this did not work as expected the new cluster gives the following error:

500 INTERNAL_ERROR:error, function_clause, [{emqx_rule_engine_api,'-get_rule_metrics/1-fun-0-',['emqx@10.20.0.7',#{counters => #{},gauges => #{},rate => #{current => 0.0,last5m => 0.0,max => 0.0},slides => #{}}],[{file,"emqx_rule_engine_api.erl"},{line,524}]},{emqx_rule_engine_api,'-get_rule_metrics/1-lc$^1/1-0-',3,[{file,"emqx_rule_engine_api.erl"},{line,567}]},{emqx_rule_engine_api,'/rules/:id/metrics',2,[{file,"emqx_rule_engine_api.erl"},{line,426}]},{minirest_handler,apply_callback,3,[{file,"minirest_handler.erl"},{line,111}]},{minirest_handler,handle,2,[{file,"minirest_handler.erl"},{line,44}]},{minirest_handler,init,2,[{file,"minirest_handler.erl"},{line,27}]},{cowboy_handler,execute,2,[{file,"cowboy_handler.erl"},{line,41}]},{cowboy_stream_h,execute,3,[{file,"cowboy_stream_h.erl"},{line,318}]},{cowboy_stream_h,request_process,3,[{file,"cowboy_stream_h.erl"},{line,302}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]

@fabltd
Copy link
Author

fabltd commented May 18, 2023

The above error results in the dashboard being unresponsive and I am unable to make changes to the rules engine.

@HJianBo HJianBo

Hi, @fabltd Do you know which version of EMQX you are currently using?

The current version (5.0.x) does not have a configuration migration feature. If you need to migrate the configuration in this version, you will need to do it manually.

For example, manually merge /opt/emqx/etc/emqx.conf and /opt/emqx/data/configs/cluster.hocon (or /opt/emqx/data/configs/cluster-override.conf) inside the emqx container, and copy them to the new node's /opt/emqx/etc/emqx.conf file

Updates: Added a Feature Request label. We will try delivering this kind of functionality in v5.1.0

This does not work.

Each time the crash occours.

Steps to reproduce

Copy config emqx-core-0:data/configs
Copy certs to emqx-core-0:data/certs

No other files are copied.

All pods restarted:

kubectl -n mqtt rollout restart statefulset emqx-core

Crash seen in dashboard when going to Flows view.

@fabltd
Copy link
Author

fabltd commented May 18, 2023

@Rory-Z Any idea?

@Rory-Z
Copy link
Member

Rory-Z commented May 18, 2023

@Rory-Z Any idea?

I have no idea, I think the function_clause error is the EMQX bug

@JimMoen
Copy link
Member

JimMoen commented May 18, 2023

See stack trace

['emqx@10.20.0.7',#{counters => #{},gauges => #{},rate => #{current => 0.0,last5m => 0.0,max => 0.0},slides => #{}}]

It seems get metrics failed on the node emqx@10.20.0.7. Are you sure the Rule has created on all nodes?

@HJianBo
Copy link
Member

HJianBo commented May 18, 2023

Can you check if there are any error logs when each EMQX node starts up?

And query through this interface List All Rules on each node to see if the rules you specified have been correctly created?

@fabltd
Copy link
Author

fabltd commented May 18, 2023

I just add the config as was suggested and this happens.

It crashes the rules engine.

@HJianBo
Copy link
Member

HJianBo commented May 18, 2023

Could you please share the .hocon configuration if it's possible

@fabltd
Copy link
Author

fabltd commented May 18, 2023 via email

@HJianBo
Copy link
Member

HJianBo commented May 18, 2023

Yes, of course, heeejianbo@gmail.com

@fabltd
Copy link
Author

fabltd commented May 18, 2023

Emailed. - Let me know if you would like access to the cluster is running in Google Cloud.

@fabltd
Copy link
Author

fabltd commented May 22, 2023

Thanks for fixing - How do I update to the fixed version?

@fabltd
Copy link
Author

fabltd commented May 24, 2023

Looks like after setting up from scratch there is still an issue with metrics. Both my - replicant nodes have crashed.

initial call: mria_rlog_replica:init/1, pid: <0.2102.0>, registered_name: '$mria_meta_shard', exit: {{timeout,{gen_server,call,[mria_lb,{probe,'emqx@emqx-core-2.emqx-headless.mqtt.svc.cluster.local','$mria_meta_shard'}]}},[{gen_server,call,2,[{file,"gen_server.erl"},{line,239}]},{mria_rlog,subscribe,4,[{file,"mria_rlog.erl"},{line,167}]},{mria_rlog_replica,try_connect,3,[{file,"mria_rlog_replica.erl"},{line,395}]},{mria_rlog_replica,handle_reconnect,1,[{file,"mria_rlog_replica.erl"},{line,341}]},{gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,1205}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}, ancestors: [<0.2101.0>,mria_shards_sup,mria_rlog_sup,mria_sup,<0.1902.0>], message_queue_len: 0, messages: [], links: [<0.2101.0>], dictionary: [{rand_seed,{#{bits => 58,jump => #Fun<rand.3.92093067>,next => #Fun<rand.0.92093067>,type => exsss,uniform => #Fun<rand.1.92093067>,uniform_n => #Fun<rand.2.92093067>},[244355015406896546|90618611208143776]}},{'$logger_metadata$',#{domain => [mria,rlog,replica],shard => '$mria_meta_shard'}}], trap_exit: true, status: running, heap_size: 6772, stack_size: 29, reductions: 12103; neighbours:

Any ideas. This all worked in Dev but crashes in prod.

@fabltd
Copy link
Author

fabltd commented May 24, 2023

This error is shown in the dashboard

500 NODE_DOWN:bad rpc call 'emqx@10.140.1.3', Reason {'EXIT', {badarg, [{ets,select_count, [emqx_activated_alarm, [{'$1',[],[true]}]], [{error_info, #{cause => id, module => erl_stdlib_errors}}]}, {emqx_mgmt_api, '-counting_total_fun/1-fun-0-',2, [{file,"emqx_mgmt_api.erl"}, {line,357}]}, {emqx_mgmt_api, maybe_apply_total_query,2, [{file,"emqx_mgmt_api.erl"}, {line,333}]}, {emqx_mgmt_api,do_select,2, [{file,"emqx_mgmt_api.erl"}, {line,299}]}, {emqx_mgmt_api,do_query,2,[]}]}}

This is the IP if the failed replicant pod.

Restarting all core pods and then replicate seems to have the replicate running again.

@fabltd
Copy link
Author

fabltd commented May 30, 2023

@HJianBo

I am still having issues. I built a new instlall from scratch. It worked for a few days now its showing the following error again:

500 INTERNAL_ERROR:error, function_clause, [{emqx_rule_engine_api,'-get_rule_metrics/1-fun-0-',['emqx@10.140.5.6',#{counters => #{},gauges => #{},rate => #{current => 0.0,last5m => 0.0,max => 0.0},slides => #{}}],[{file,"emqx_rule_engine_api.erl"},{line,524}]},{emqx_rule_engine_api,'-get_rule_metrics/1-lc$^1/1-0-',3,[{file,"emqx_rule_engine_api.erl"},{line,567}]},{emqx_rule_engine_api,'-get_rule_metrics/1-lc$^1/1-0-',3,[{file,"emqx_rule_engine_api.erl"},{line,568}]},{emqx_rule_engine_api,'/rules/:id/metrics',2,[{file,"emqx_rule_engine_api.erl"},{line,426}]},{minirest_handler,apply_callback,3,[{file,"minirest_handler.erl"},{line,111}]},{minirest_handler,handle,2,[{file,"minirest_handler.erl"},{line,44}]},{minirest_handler,init,2,[{file,"minirest_handler.erl"},{line,27}]},{cowboy_handler,execute,2,[{file,"cowboy_handler.erl"},{line,41}]},{cowboy_stream_h,execute,3,[{file,"cowboy_stream_h.erl"},{line,318}]},{cowboy_stream_h,request_process,3,[{file,"cowboy_stream_h.erl"},{line,302}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]

@fabltd
Copy link
Author

fabltd commented May 30, 2023

@HJianBo - I have updated to 5.0.26 I note the releases say the metrics issue should be fixed but its still occouring

@thalesmg
Copy link
Contributor

Hi @fabltd , thanks for the logs.

The fix mentioned in the changelog you saw was for the bridges API, but the crash you encountered was in the rule engine API. We'll fix it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants