Skip to content

danielbeach/SparkHadoopCluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

SparkHadoopCluster

create your own Apache Spark cluster with Hadoop/HDFS installed.

full blog post...

  1. install java of course sudo apt-get -y install openjdk-8-jdk-headless default-jre
    • This must be done on all nodes.
  2. Install Scala sudo apt install scala
    • I have no idea if this must be done on all nodes, I did.
  3. Setup password-less ssh between all nodes.
    • sudo apt install openssh-server openssh-client
  4. create keys ssh-keygen -t rsa -P ""
    • move .pub key into each worker nodes ~/.ssh/authorized_keys location.
  5. sudo vim /etc/hosts
    • add lines for master and nodes with name and ip address... something like follows.
    • 173.255.199.161 master and maybe 198.58.124.54 worker1
  6. Probably want to install Spark on master and nodes.... wget https://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
    • unpack tar xvf spark-2.4.3-bin-hadoop2.7.tgz
    • move sudo mv spark-2.4.3-bin-hadoop2.7/ /usr/local/spark
  7. We need to set Spark/Java path etc.... sudo vim ~/.bashrc
    • export PATH=/usr/local/spark/bin:$PATH
    • export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
    • export PATH=$JAVA_HOME/bin:$PATH
    • source ~/.bashrc
  8. vim /usr/local/spark/conf/spark-env.sh
    • export SPARK_MASTER_HOST=<master-ip <- fill in your IP address of master node here.
    • export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  9. modify the ridiculously named file vim /usr/local/spark/conf/slaves
    • add names of master and workers from .hosts file above.
    • Finally, start Spark.... sh /usr/local/spark/sbin/start-all.sh

Hadoop

  1. wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
    • tar -xvf hadoop-2.7.3.tar.gz
    • mv hadoop-2.7.3  hadoop<
  2. vim ~/hadoop/etc/hadoop/hadoop-env.sh
    • export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
  3. vim ~/hadoop/etc/hadoop/core-site.xml
fs.default.name hdfs://master:9000
  1. vim ~/hadoop/etc/hadoop/hdfs-site.xml
dfs.namenode.name.dir /home/beach/data/nameNode dfs.namenode.data.dir /home/beach/data/dataNode dfs.replication 2
  1. cd ~/hadoop/etc/hadoop
    • mv mapred-site.xml.template mapred-site.xml
    • sudo vim ~/hadoop/etc/hadoop/mapred-site.xml
mapreduce.framework.name yarn yarn.app.mapreduce.am.resource.mb 800 mapreduce.map.memory.mb 400 mapreduce.reduce.memory.mb 400
  1. vim ~/hadoop/etc/hadoop/slaves

    • localhost
    • worker1
    • worker2
  2. vim ~/hadoop/etc/hadoop/yarn-site.xml

yarn.acl.enable 0 yarn.resourcemanager.hostname master yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.resource.memory-mb 800 yarn.scheduler.maximum-allocation-mb 800 yarn.scheduler.minimum-allocation-mb 400
  1. sudo vim ~/.bashrc
    • export PATH=/home/beach/hadoop/bin:/home/beach/hadoop/sbin:$PATH

About

create your own Apache Spark cluster with Hadoop/HDFS installed.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published