Skip to content

VM hadoop setup instructions

Valentin Kuznetsov edited this page Jan 18, 2016 · 14 revisions

Instruction to setup CERN VM with Hadoop, WMArchive

Request VM from openstack.cern.ch

Install admin frontend das mongodb packages in a normal way how we install stuff on cmsweb VM [1]. Please note we need to use newest architecture, slc6_amd64_gcc493 which contains python 2.7. Here is what should be done in step 7 of instructions [1]:

(VER=HG1509a REPO="comp" A=/data/cfg/admin; ARCH=slc6_amd64_gcc493; cd /data; $A/InstallDev -A $ARCH -R comp@$VER -s image -v $VER -r comp=$REPO -p "admin frontend das mongodb backend")

I created /data/wma area and copied stuff from my lxplus.cern.ch:~valya/workspace/wma/ area over there

Create install area

  mkdir -p /data/wma/usr/lib/python2.7/site-packages

Create setup.sh file

#!/bin/bash
source /data/srv/current/apps/das/etc/profile.d/init.sh
export JAVA_HOME=/usr/lib/jvm/java
#export PATH=$PATH:$PWD/mongodb/bin
export PYTHONPATH=$PYTHONPATH:/data/wma/usr/lib/python2.7/site-packages

Set-up your environment source setup.sh, this will setup MongoDB, python 2.7, pymongo

Install pip [2] (optional)

curl https://bootstrap.pypa.io/get-pip.py > get-pip.py
python get-pip.py

Install java on VM

  sudo yum install java-1.8.0-openjdk-devel.x86_64

Create /etc/yum.repos.d/cloudera.repo with the following content:

[cloudera]
gpgcheck=0
name=Cloudera
enabled=1
priority=15
baseurl=https://cern.ch/it-service-hadoop/yum/cloudera-cdh542

Install hadoop and yarn

  sudo yum install hadoop-hdfs.x86_64 hadoop.x86_64 hive.noarch hadoop-libhdfs.x86_64
  sudo yum install hadoop-hdfs-namenode.x86_64 hadoop-hdfs-datanode.x86_64
  sudo yum install hadoop-yarn-nodemanager.x86_64

Configure hadoop [3, 4]

 sudo cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
 sudo alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
 sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
 ls /etc/hadoop/conf.my_cluster/
 sudo vim /etc/hadoop/conf.my_cluster/core-site.xml

Here is relevant part you should have in core-site.xml

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
    </configuration>

Adjust hdfs-site.xml

 sudo vim /etc/hadoop/conf.my_cluster/hdfs-site.xml

Create relevant part in hdfs-site.xml

    <property>
       <name>dfs.namenode.name.dir</name>
       <value>file:///var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value>
    </property>

Format local HDFS

  sudo -u hdfs hdfs namenode -format

Start HDFS

  cd /etc/init.d/
  sudo service hadoop-hdfs-datanode start
  sudo service hadoop-hdfs-namenode start

Start Yarn manager (for MapReduce jobs):

sudo service hadoop-yarn-nodemanager start

Create some areas on HDFS

  sudo -u hdfs hadoop fs -mkdir /tmp
  sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
  sudo -u hdfs hadoop fs -mkdir /test
  sudo -u hdfs hadoop fs -chmod -R 1777 /test
  hadoop fs -ls /tmp
  # now we ready to put anything to hadoop, e.g.
  hadoop fs -put local_file /tmp
  hadoop fs -ls /tmp

Install pydoop

  cd /data/wma/soft/pydoop
  python setup.py install --prefix=/data/wma/usr

Install avro

  cd /data/wma/soft/avro-1.7.7
  python setup.py install --prefix=/data/wma/usr

Install bz2file

  cd /data/wma/soft
  git clone git@github.com:nvawda/bz2file.git
  python setup.py install --prefix=/data/wma/usr

Fetch WMCore framework

  cd /data/wma
  git clone git@github.com:dmwm/WMCore.git

Get WMArchive framework

  cd /data/wma
  git clone git@github.com:dmwm/WMArchive.git

Remove DAS from deploy area, otherwise it will be started

  rm /data/srv/enabled/das

Adjust wmarch_config.py file if necessary. For time being we create static area and copy over there necessary web files:

  mkdir /data/wma/WMArchive/data
  cp -r WMArchive/src/css /data/wma/WMArchive/data
  cp -r WMArchive/src/js /data/wma/WMArchive/data
  cp -r WMArchive/src/images /data/wma/WMArchive/data
  cp -r WMArchive/src/templates /data/wma/WMArchive/data

Check if app_wmarchive_*.conf files exists in /data/srv/current/config/frontend/, if not copy those files over there:

  sudo cp app_wmarchive_* /data/srv/current/config/frontend/
  sudo chown _sw /data/srv/current/config/frontend/app_wmarchive_*
  sudo chgrp _config /data/srv/current/config/frontend/app_wmarchive_*

Start WMArchive service

  cd /data/wma
  ./run_wma.sh

Check if web server is running (WMArchive runs on port 8247):

  curl http://localhost:8247/wmarchive/web/

On vocms013 we should be able to access the service via https request (once frontend is configured):

  curl -k --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem https://vocms013.cern.ch/wmarchive/web/

At this point we're ready to insert data into WMArchive. Below two different approaches. One, to use testClient script and give it a json file:

WMArchive/test/python/testClient.py --json=WMArchive/test/data/fwjr_processing.json
STATUS 200 REASON OK
data {u'result': [u'{"ids": ["9d8bb0d3ddd54b6bc9158b5beb7eeb14", "cfe2a1feec3f0a5d708e5203a3870874", "1b6377ed38074e06da2a76f6efec7c35", "e436a319f6e9068bd8ceb285c0e3a0a3", "51027c9a760c8ac7617da956aaf74d20", "003f1046bbe24c17354848aef7ae611a", "4dac4a63d4612c7ba670437eb4629a46", "c3d7bf06561f59b6988bcc4c5a9b6697", "db7ac9841321ec550a8a68927f682c4e", "cd9f2c9c009ba465df97fa85872c7222"], "stype": "mongodb"}']} <type 'dict'>
Posted {u'result': [u'{"ids": ["9d8bb0d3ddd54b6bc9158b5beb7eeb14", "cfe2a1feec3f0a5d708e5203a3870874", "1b6377ed38074e06da2a76f6efec7c35", "e436a319f6e9068bd8ceb285c0e3a0a3", "51027c9a760c8ac7617da956aaf74d20", "003f1046bbe24c17354848aef7ae611a", "4dac4a63d4612c7ba670437eb4629a46", "c3d7bf06561f59b6988bcc4c5a9b6697", "db7ac9841321ec550a8a68927f682c4e", "cd9f2c9c009ba465df97fa85872c7222"], "stype": "mongodb"}']}

Or, we may run simple test with curl client (adjust ids if necessary):

  curl -D /dev/stdout -X POST -H "Content-type: application/json" -d "{\"data\":{\"name\":1}}" http://localhost:8247/wmarchive/data/
  curl -D /dev/stdout -H "Content-type: application/json" http://localhost:8247/wmarchive/data/eed35faf3b73d58157aa53d097899e8d

Here are some commands to use

# single document injection
curl -D /dev/stdout -X POST -H "Content-type: application/json" -d "{\"data\":{\"name\":1}}" http://localhost:8247/wmarchive/data/
# single document retrieval
curl -D /dev/stdout -H "Content-type: application/json" http://localhost:8247/wmarchive/data/eed35faf3b73d58157aa53d097899e8d

# multiple documents injection
curl -D /dev/stdout -X POST -H "Content-type: application/json" -d "{\"data\":[{\"name\":1}, {\"\name\":2}]}" http://localhost:8246/wmarchive/data/

# multiple documents retrieval
curl -D /dev/stdout -X POST -H "Content-type: application/json" -d "{\"query\":[\"eed35faf3b73d58157aa53d097899e8d\", \"bcee13403f554bc14f644ffdeaa93372\"]}" http://localhost:8246/wmarchive/data/
  1. https://cms-http-group.web.cern.ch/cms-http-group/tutorials/environ/vm-setup.html
  2. https://pip.pypa.io/en/stable/installing/
  3. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation
  4. http://www.cloudera.com/content/www/en-us/documentation/cdh/5-0-x/CDH5-Installation-Guide/cdh5ig_hdfs_cluster_deploy.html?scroll=topic_11_2