Skip to content
This repository has been archived by the owner on Nov 23, 2017. It is now read-only.

Elastic spark cluster #39

Open
wants to merge 6 commits into
base: branch-1.6
Choose a base branch
from

Conversation

ar-ms
Copy link

@ar-ms ar-ms commented Jul 18, 2016

Elastic Spark Cluster

Prerequisite

A running Spark cluster.

add-slaves

./spark-ec2 -k key-name -i file.pem add-slaves cluster_name

This command adds a new slave to the cluster. To add more slave use -s or --slaves option. You could precise to start the slaves as spot instances using the option --spot-price and specify the type of instances using --instance-type or -t.

remove-slaves

./spark-ec2 -k key-name -i file.pem remove-slaves cluster_name

The last command removes a slave if this is not the last slave in the cluster. To remove more slaves indicates the number using -s or --slaves option.
If the defined number is more than the number of slaves present on the cluster or the number is equal to -1, every slave will be removed but one slave will be kept by the cluster.

Note:

  • The slaves are removed randomly.
  • One slave will be kept.

Testing command shortcuts

  • Environment variables
KEY_NAME="key-name"
ID_FILE="file.pem"
SPARK_EC2_GIT_REPO="https://github.com/tirami-su/spark-ec2"
SPARK_EC2_GIT_BRANCH="elastic-spark-cluster"
CLUSTER_NAME="cluster_name"
  • Creating a cluster
    ./spark-ec2 -k $KEY_NAME -i $ID_FILE --spark-ec2-git-repo=$SPARK_EC2_GIT_REPO --spark-ec2-git-branch=$SPARK_EC2_GIT_BRANCH launch $CLUSTER_NAME

    slave = 1

  • Trying to remove a slave
    ./spark-ec2 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

    slave = 1

  • Adding a slave
    ./spark-ec2 -k $KEY_NAME -i $ID_FILE add-slaves $CLUSTER_NAME

    slave = 2

  • Adding three more slaves as spot instances
    ./spark-ec2 -s 3 -k $KEY_NAME -i $ID_FILE --spot-price=0.05 add-slaves $CLUSTER_NAME

    slave = 5

  • Removing two slaves
    ./spark-ec2 -s 2 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

    slave = 3

  • Removing all slaves (except one slave)
    ./spark-ec2 -s -1 -k $KEY_NAME -i $ID_FILE remove-slaves $CLUSTER_NAME

    slave = 1

  • Destroy cluster
    ./spark-ec2 -s -1 -k $KEY_NAME -i $ID_FILE detroy $CLUSTER_NAME

@shivaram
Copy link
Contributor

This is great. Thanks @Tirami-su -- I'll take a look at this soon.

@nchammas Would be great if you also took a look.

@ar-ms
Copy link
Author

ar-ms commented Jul 18, 2016

Thanks @shivaram !
Super :D !

@@ -787,9 +916,47 @@ def get_instances(group_names):
return (master_instances, slave_instances)


def transfert_SSH_key(opts, master, slave_nodes):
Copy link
Contributor

@nchammas nchammas Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transfert (with a t at the end)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooupss... Sorry a french reflex...

@nchammas
Copy link
Contributor

nchammas commented Jul 21, 2016

I won't be able to review this PR in detail since I'm struggling to find time to complete even my own implementation of this feature for Flintrock.

However, some high-level questions:

  • How does add-slaves know what version of Spark to install and other setup to do on the new slaves? Can the user inadvertently create a broken cluster with a bad call to add-slaves?
  • Can you explain the high-level purpose of entities.generic and how it fits in with the current code base?

@ar-ms
Copy link
Author

ar-ms commented Jul 25, 2016

Ho...

How does add-slaves know what version of Spark to install and other setup to do on the new slaves? Can the user inadvertently create a broken cluster with a bad call to add-slaves?

There are three types of deploy folders:

  • deploy.generic:
deploy.generic/
└── root
    └── spark-ec2
        └── ec2-variables.sh
  • entities.generic:
entities.generic/
└── root
    └── spark-ec2
        ├── masters
        └── slaves
  • new_slaves.generic
new_slaves.generic/
└── root
    └── spark-ec2
        └── new_slaves

deploy.generic which contains ec2-variables (SPARK_VERSION, HADOOP_MAJOR_VERSION, MODULES, ...) will be deployed only when the cluster is created. add-slaves uses the settings from ec2-variables.sh to create new slaves so it will not break the cluster since it will use the same versions.

There is a problem, if you use different instance types. For example, if you start a cluster with some big instances and after that you add new small instances. As the settings for memory and CPU are evaluated only on the master and the first slave in deploy_templates.py, if the new slaves don't have the sufficient memory or CPU, the jobs will fail.

Another "problem" is that if you modify files present in templates, like core-site.xml, etc.. directly in a running cluster, the modifications will be overridden by the original version present if you launch new instances, because the setup_new_slaves launches deploy_templates.py.

Can you explain the high-level purpose of entities.generic and how it fits in with the current code base?

The purpose of entities.generic is to separate the settings that will changes (masters, slaves(mostly)) from the others. entities.generic deploys only masters and slaves files, this separation enables us to not override the ec2-variables.sh when launch new slaves and so not to corrupt the cluster.

Thank you for this review and for the questions. If the answers aren't clear, feel free to ask me for details and same if you have other questions :) !

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants