Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-2588. Consolidate compose environments #238

Closed
wants to merge 8 commits into from

Conversation

@adoroszlai
Copy link
Contributor

adoroszlai commented Nov 20, 2019

What changes were proposed in this pull request?

There are a few slightly different sample docker compose environments: ozone, ozoneperf, ozones3, ozone-recon. This change proposes to merge these 4 by minor additions to ozoneperf:

  1. add recon service from ozone-recon
  2. run GDPR and S3 tests
  3. expose datanode web port (eg. for profiling)

Plus: also run ozone-shell test (from basic suite), which is currently run only in ozonesecure

I also propose to rename ozoneperf to ozone for simplicity.

Consolidating these 4 environments would slightly reduce both code duplication and the time needed for acceptance tests.

https://issues.apache.org/jira/browse/HDDS-2588

How was this patch tested?

Ran acceptance test in ozone dir. Generated keys using freon, verified that Jaeger, Prometheus, Grafana reflect the operations.

Clean CI in private branch.

@dineshchitlangia dineshchitlangia requested a review from elek Nov 20, 2019
@@ -20,14 +20,22 @@ OZONE-SITE.XML_ozone.scm.names=scm
OZONE-SITE.XML_ozone.scm.datanode.id.dir=/data
OZONE-SITE.XML_ozone.scm.block.client.address=scm
OZONE-SITE.XML_ozone.metadata.dirs=/data/metadata
OZONE-SITE.XML_ozone.handler.type=distributed
OZONE-SITE.XML_ozone.recon.db.dir=/data/metadata/recon
OZONE-SITE.XML_ozone.recon.om.db.dir=/data/metadata/recon

This comment has been minimized.

Copy link
@bharatviswa504

bharatviswa504 Nov 21, 2019

Contributor

/data/metadata/reon -> /data/metadata/om

This comment has been minimized.

Copy link
@adoroszlai

adoroszlai Nov 21, 2019

Author Contributor

Thanks @bharatviswa504 for spotting this. These config values from ozone-recon env as they were.

@avijayanhwx @swagle can Recon use the same directory for both ozone.recon.db.dir and ozone.recon.om.db.dir?

This comment has been minimized.

Copy link
@avijayanhwx

avijayanhwx Nov 21, 2019

Contributor

@adoroszlai Yes the same directory can be used.

This comment has been minimized.

Copy link
@adoroszlai

adoroszlai Nov 21, 2019

Author Contributor

Thanks @avijayanhwx. Then I think we can keep it as is.

@bharatviswa504

This comment has been minimized.

Copy link
Contributor

bharatviswa504 commented Nov 21, 2019

Good idea, overall LGTM. One minor comment.
One more suggestion, In a similar way, we can merge ozone-om-ha and ozone-om-ha-s3 also.

@adoroszlai

This comment has been minimized.

Copy link
Contributor Author

adoroszlai commented Nov 21, 2019

Thanks @bharatviswa504 for the review.

In a similar way, we can merge ozone-om-ha and ozone-om-ha-s3 also.

Unfortunately cannot do that, since ozone-om-ha is very different (custom docker image with SSH, starts/stops OM manually, etc.).

@elek

This comment has been minimized.

Copy link
Member

elek commented Nov 21, 2019

Thanks to work on this @adoroszlai . I always felt the same, that we have too many environments, but I didn't know the right answer. therefore I just share my thoughts this is not a 👍 or 👎 as I don't know which one is the better.

1.THE PHILOSOPHY

It's -- at least partially -- a philosophical question, what is Ozone. (And as it's philosophy, I am interested about the opinion of our philosopher of Ozone cc @anuengineer)

By default I would say:

  • ozone = scm/om/datanode + s3g + recon
  • ozone-perf = ozone + all the tools to monitor

But we can also say what what this patch says:

  • ozone: the full ozone experience (scm/om/datanode + s3g + recon + all the monitoring tools).

(at lest for Apache Ozone, this is not true for proprietary distributions).

  1. THE USABILITY

The other question is usability: Do we need Prometheus and Jaeger all the time. Do we need to start them when we would like to test any of the features of ozone?

Prometheus: maybe
Jaeger: I am not sure.

  1. THE GUARANTEE

My third (actual) problem is the guarantee of a cluster. To avoid same flakiness I would like to declare that we need at least 3 datanodes for ozone compose clusters (see. HDDS-2606). But I am not happy with the usability that Ozone can't work if you don't do a docker-compose scale datanode=3

We may need a smaller subset of ozone (ozone-core) where the three datanodes are not required (but not used to run all the different acceptance testing). In this case the usability problem (2) can be solved as we can have a leaner ozone cluster to test locally.

@adoroszlai

This comment has been minimized.

Copy link
Contributor Author

adoroszlai commented Nov 21, 2019

Thanks @elek for your thoughts. I think (1) and (2) are addressed by the followup commit, which extracts monitoring and profiling into separate configs. (Let me know if any of the configs are miscategorized.) They can be mixed in as desired by one of:

# no COMPOSE_FILE var                                                  # => only Ozone
export COMPOSE_FILE=docker-compose.yaml:monitoring.yaml                # => add monitoring
export COMPOSE_FILE=docker-compose.yaml:profiling.yaml                 # => add profiling
export COMPOSE_FILE=docker-compose.yaml:monitoring.yaml:profiling.yaml # => add both

I need to think (or talk) about your third point (3) more.

@elek

This comment has been minimized.

Copy link
Member

elek commented Nov 28, 2019

I think (1) and (2) are addressed by the followup commit, which extracts monitoring and profiling into separate configs

Thanks the update @adoroszlai This approach is very smart, but I have some fear how easy is to understand it. (One additional function of the compose folders to provide simple examples to use ozone.)

But let's try out this approach. I am fine with it.

Can you please update the README.txt inside compose/ozone (currently it's the original ozoneperf readme, It can be simplified but we need to add information about the COMPOSE_FILE=... trick)?

@adoroszlai

This comment has been minimized.

Copy link
Contributor Author

adoroszlai commented Nov 28, 2019

Thanks for the feedback @elek.

Can you please update the README.txt

Sure, will do, but didn't want to write doc until the code is OK-ed. ;)

@adoroszlai

This comment has been minimized.

Copy link
Contributor Author

adoroszlai commented Dec 2, 2019

Changes in last two commits besides README update:

  1. fix freon-*.yaml:
    • versions need to match other yamls
    • monitoring config is required for getting Freon spans
  2. set safemode.min.datanode and ozone.replication to value of OZONE_REPLICATION_FACTOR environment variable
  3. add run.sh to make startup simpler (--scale datanodes based on OZONE_REPLICATION_FACTOR)

So docker-compose up without --scale still starts a cluster with 1 datanode, but ozone sh key put works OK, whereas previously it required explicit -r ONE. And getting a 3-datanode cluster is as simple as OZONE_REPLICATION_FACTOR=3 ./run.sh.

Copy link
Member

elek left a comment

Thank you @adoroszlai the update.

I am happy with this approach with setting the scale with OZONE_REPLICATION_FACTOR but wouldn't it be more clean to do it inside #282 / HDDS-2646 for all the environments together?

command: ["ozone","om"]
scm:
<<: *common-config
ports:
- 9876:9876
environment:
ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION

This comment has been minimized.

Copy link
@elek

elek Dec 3, 2019

Member

Why is it better to move out this to a separated file? I think it's easier to overview if it's in this file. Do we really need a separated file to store this one line?

This comment has been minimized.

Copy link
@adoroszlai

adoroszlai Dec 3, 2019

Author Contributor

I hated moving it out, but the following considerations together made me do it:

  1. Variable substitution only works in the yaml files, not in configs passed via env_file. So safemode.min.datanode and ozone.replication need to be in environment.
  2. When merging common-config, one complete dict overrides the other: depending on the order either common-config or the specific service gets to define environment.

So I moved out these one-liners to avoid duplicating the "two-liners". Plus it seems unlikely that anyone ever wants to change these infrastructure related variables.

Now another alternative occurred to me: we might define a separate dict for the configs to be merged into environment. Let me experiment a bit with this. If it works, we could avoid the separate config files.

This comment has been minimized.

Copy link
@elek

elek Dec 6, 2019

Member

I would prefer the separated common configs. Especially as we have only a few lines of settings they can be included in the common configs all together. But it's not a blocker for now, we can commit it (thanks to explain the reason behind the small files...)

This comment has been minimized.

Copy link
@adoroszlai

adoroszlai Dec 6, 2019

Author Contributor

ae2bc55 moves these one-liners back to docker-compose.yaml and the separate files are no longer needed.


declare -ix OZONE_REPLICATION_FACTOR
: ${OZONE_REPLICATION_FACTOR:=1}
docker-compose up --scale datanode=${OZONE_REPLICATION_FACTOR} --no-recreate "$@"

This comment has been minimized.

Copy link
@elek

elek Dec 3, 2019

Member

Can you please help me to understand why do we need --no-recreate?

This comment has been minimized.

Copy link
@adoroszlai

adoroszlai Dec 3, 2019

Author Contributor

Without --no-recreate, if I run docker-compose up a second time to start Freon after cluster is up, Docker Compose may recreate all existing containers. I noticed that this happens when running the second command from another terminal, while following logs in the original terminal. Not sure if this is really the cause, or what other conditions could trigger it. Using the flag helps avoid this situation.

Do you think it may cause problems in other cases?

This comment has been minimized.

Copy link
@anuengineer

anuengineer Dec 4, 2019

Contributor

Well, while I agree the call for re-create is correct here. @adoroszlai You did ask the question. The issue can happen if we have a cluster that was running that has error-ed out. Then re-running this command will not reset the system. But it is probably something that we can live with, or fix much later. I predict for a long time, when someone reports an issue, we will say " make sure you kill all running docker containers". But then, traditionally that is our first debugging step whenever someone tells us that docker based stuff is not stable for them.

Just to make sure, I am +1 and ok with this change. Just responding to your question; that is all.

This comment has been minimized.

Copy link
@elek

elek Dec 6, 2019

Member

In my experience the docker-compose up worked well even from other terminal if nothing has been changed and the docker-compose file set was the same.

Can we start the scm first with docker-compose up -d scm and after everything else with docker-compose up -d with this no-recreate?

This comment has been minimized.

Copy link
@adoroszlai

adoroszlai Dec 6, 2019

Author Contributor

if nothing has been changed and the docker-compose file set was the same.

But the readme says freon compose file should be added only when datanodes are up, so the set is not the same.

Or you can start freon instances in containers (to make it possible to scale them up):
If all the datanodes are started, start the freon instance:
```
docker-compose -f docker-compose.yaml -f freon-ockg.yaml up --scale datanode=3 -d

@adoroszlai

This comment has been minimized.

Copy link
Contributor Author

adoroszlai commented Dec 3, 2019

I am happy with this approach with setting the scale with OZONE_REPLICATION_FACTOR but wouldn't it be more clean to do it inside #282 / HDDS-2646 for all the environments together?

I'm fine with reworking this one to accommodate whatever changes are done in #282. I wanted to avoid blocking the acceptance test fix for a usability improvement.

@anuengineer

This comment has been minimized.

Copy link
Contributor

anuengineer commented Dec 4, 2019

It's -- at least partially -- a philosophical question, what is Ozone. (And as it's philosophy, I am interested about the opinion of our philosopher of Ozone cc @anuengineer)

I am okay with what this patch says -- in reality, irrespective of what we say, there will be monitoring , tracing and logging collectors in place for most data centers. So irrespective of what we do (Prometheus, Jaeger, Grafana, Fluentd) the system admins will do the right thing for them.

We are just show casing that fact that, it is trivial to do this with Ozone. So when someone is evaluating Ozone, the question of how can I really run this service in production is answered via the presence of these tools. I would go a step ahead add these as recipes in the Ozone documentation too.

@elek
elek approved these changes Dec 12, 2019
Copy link
Member

elek left a comment

+1. Let me commit it now.

It's a newer approach and we can improve it in follow-up jiras if needed.

Thanks @bharatviswa504 and @anuengineer the review and @adoroszlai the patch.

@elek elek closed this in e14f709 Dec 12, 2019
@adoroszlai

This comment has been minimized.

Copy link
Contributor Author

adoroszlai commented Dec 12, 2019

Thanks @anuengineer, @avijayanhwx, @bharatviswa504 for review, and @elek for review and commit.

@adoroszlai adoroszlai deleted the adoroszlai:HDDS-2588 branch Dec 12, 2019
ChenSammi added a commit that referenced this pull request Dec 23, 2019
mukul1987 added a commit to mukul1987/hadoop-ozone that referenced this pull request Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.