Software_Trend

Final project of trends in software engineering

Minimum Hardware and OS Requirements:

OS: Ubuntu 16.04
HDD: 1 TB SSD
HDD2: 2TB
RAM: 8GB DDR4
CPU: Quad Core 2.2 Ghz

For execution follow the next steps with a user with sudo permissions:

Install HADOOP

Install Oracle Java 8

Open a terminal session and execute the following commands with a user that has sudo privileges

sudo apt-get update
sudo apt-get install default-jre
sudo apt-get install default-jdk
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Configure JAVA_HOME

Execute the following command in terminal console to determinate java install path
```
sudo update-alternatives --config java
```
Copy the preferred install path and open /etc/environment with the prefered editor. P.E: nano, gedit

sudo nano /etc/environment or sudo gedit /etc/environment

Add the following line at the end of the file replacing the with the preferred install path surrounded by double quotes

JAVA_HOME=<path> P.E: JAVA_HOME="/usr/lib/jvm/java-8-oracle"

Save, close the file and proceed to reload it with the next command
```
source /etc/environment
```
Test the new environment variable
```
echo $JAVA_HOME
```
Install SSH

Execute the following command in terminal
```
sudo apt-get install ssh
```

Update OS Limits

Set limits in /etc/security/limits.conf. Without these the hadoop instance may get below error and will get occasionally killed by OS. First open the file for edit

sudo nano /etc/security/limits.conf` or `sudo gedit /etc/security/limits.conf

Add the following lines at the end of the file

*        hard    nofile        50000
*        soft    nofile        50000
*        hard    nproc         10000
*        soft    nproc         10000

Workaround for "Kill" command

Open the file logind.conf under the path /etc/systemd/logind.conf with your preferred editor

P.E: sudo nano /etc/systemd/logind.conf or sudo gedit /etc/systemd/logind.conf

Remove the hash (#) symbol from the line
```
KillUserProcesses=no
```
Save and close the file
Add users for HDFS and Yarn

Create users which will own and run hdfs and yarn processes, also create hadoop group and make all bigdata users part of it. In order to do that execute the following commands:
```
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser --ingroup hadoop yarn
sudo usermod -a -G hadoop apa
```
Setup SSH
- Create the public and private key for hdfs user. Note: Press ENTER when asked for a file
```
sudo su - hduser
ssh-keygen -t rsa -P "" 
```
- Save the generated public key as authorized key
```
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
```
- Test SSH for hdfs user. Note: Enter 'yes' when prompt for confirmation.
```
ssh localhost
exit
exit
```
- Repeat the previous three (3) steps for user yarn with the following commands
```
sudo su - yarn
ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost
exit
exit
```
Download HADOOP distribution
- Locate in the Downloads directory
```
cd Downloads/
```
- Download the distribution (2.8.2 at the moment)
```
wget http://www-us.apache.org/dist/hadoop/common/hadoop-2.8.2/hadoop-2.8.2.tar.gz
```
- Untar the downloaded file using the following command
```
tar xvf hadoop-2.8.2.tar.gz
```
- Move and rename the folder to his final destination
```
sudo mv hadoop-2.8.2  /usr/local/hadoop
```

Create and set permissions for directories

Execute the following commands to create the paths for datanode and namenode

sudo mkdir -p /usr/local/hadoop/data/namenode
sudo mkdir -p /usr/local/hadoop/data/datanode
sudo mkdir -p /usr/local/hadoop/logs
sudo chmod 777 /tmp

Ensure hadoop user owns the previous folders

cd /usr/local
sudo chown -R hduser:hadoop hadoop
sudo su - hduser
chmod g+w /usr/local/hadoop/logs
exit

Configure HADOOP

Set environment variables for HADOOP, in order to do this the .bashrc file has to be modified for the hduser, yarn, and user accounts, the file is modified using the preferred editor
```
sudo su - hduser
```
gedit .bashrc or nano .bashrc

Add the following block at the end of the file and then save it:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_CONF_DIR=$HADOOP_INSTALL/etc/hadoop 
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native"
export JAVA_LIBRARY_PATH=$HADOOP_INSTALL/lib/native
export LD_LIBRARY_PATH=$HADOOP_INSTALL/lib/native:$LD_LIBRARY_PATH
export YARN_EXAMPLES=$HADOOP_INSTALL/share/hadoop/mapreduce
export HADOOP_MAPRED_STOP_TIMEOUT=30
export YARN_STOP_TIMEOUT=30
#HADOOP VARIABLES END

Setup configuration files

Move to hadoop configuration directory and edit configuration files.
```
cd /usr/local/hadoop/etc/hadoop
```

There are two (2) options, copy the files from the Hadoop folder in this repository (Only if you have followed these step by step) or configure each file manually, if this is the case then:

Open the file core-site.xml

sudo gedit core-site.xml or sudo nano core-site.xml

Replace the content between the <configuration> tags with:

<property>
      <name>fs.defaultFS</name>
      <value>hdfs://localhost:9000</value>
</property>

Save and close file core-site.xml
Open file hdfs-site.xml

sudo gedit hdfs-site.xml or sudo nano hdfs-site.xml

Replace the content between the <configuration> tags with:

<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/usr/local/hadoop/data/namenode</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/usr/local/hadoop/data/datanode</value>
</property>
<property>
    <name>dfs.permissions.superusergroup</name>
    <value>hadoop</value>
</property>
<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>

Save and close file hdfs-site.xml
Create file mapred-site.xml from template executing the following commands
```
su hduser
cp mapred-site.xml.template mapred-site.xml
exit
```
As with the previous files this one has to be opened

sudo gedit mapred-site.xml or sudo nano mapred-site.xml

Replace the content between the <configuration> tags with:

<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>
<property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>1024</value>
</property>
<property>
    <name>mapreduce.map.memory.mb</name>
    <value>1024</value>
</property>
<property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx500m</value>
</property>
<property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx500m</value>
</property>
<property>
    <name>yarn.app.mapreduce.am.resource.mb</name>
    <value>512</value>
</property>
<property>
    <name>yarn.app.mapreduce.am.command-opts</name>
    <value>-Xmx200m</value>
</property>
<property>
    <name>mapreduce.jobhistory.done-dir</name>
    <value>/tmp/mr-history/done</value>
</property>
<property>
    <name>mapreduce.jobhistory.intermediate-done-dir</name>
    <value>/tmp/mr-history/intermediate</value>
</property>
<property>
    <name>mapreduce.task.userlog.limit.kb</name>
    <value>5000</value>
</property>
<property>
    <name>yarn.app.mapreduce.task.container.log.backups</name>
    <value>3</value>
</property>
<property>
    <name>mapred.userlog.retain.hours.max</name>
    <value>6</value>
</property>

Save and close the file mapred-site.xml
Open file hadoop-env.sh

sudo gedit hadoop-env.sh or sudo nano hadoop-env.sh

Configure JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

Save and close the file hadoop-env.sh
Open file mapred-env.sh

sudo gedit mapred-env.sh or sudo nano mapred-env.sh

Configure the following

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_MAPRED_IDENT_STRING=hduser

Save and close file mapred-env.sh
Open file yarn-site.xml

sudo gedit yarn-site.xml or sudo nano yarn-site.xml

Replace the content between the <configuration> tags

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value> org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>localhost</value>
</property>
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>6144</value>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>1024</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>1024</value>
</property>
<property>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
</property>
<property>
    <name>yarn.scheduler.maximum-allocation-vcores</name>
    <value>2</value>
</property>
<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>6</value>
</property>
<property>
    <name>yarn.nodemanager.default-container-executor.log-dirs.permissions</name>
    <value>755</value>
</property>

Save and close the file yarn-site.xml

Format Namenode

Format namenode and start, stop your hdfs first from hdfs account (first source your environment)

source .bashrc 
sudo su - hduser 
$HADOOP_INSTALL/bin/hdfs namenode -format
$HADOOP_INSTALL/sbin/start-dfs.sh Note: First time will prompt confirmation, type 'yes'
$HADOOP_INSTALL/sbin/stop-dfs.sh 
exit

Start hdfs and resourcemanager

Execute the following commands

sudo su -p - hduser -c $HADOOP_INSTALL/sbin/start-dfs.sh
sudo su -p - yarn -c $HADOOP_INSTALL/sbin/start-yarn.sh

Create user and history-server directories

sudo su - hduser 
hdfs dfs -mkdir -p /user/apa
hdfs dfs -mkdir -p /tmp/mr-history/done
hdfs dfs -mkdir -p /tmp/mr-history/intermediate 
hdfs dfs -mkdir -p /tmp/hadoop-yarn 
hdfs dfs -chown apa:hadoop  /user/apa
hdfs dfs -chown hduser:hadoop /tmp
hdfs dfs -chown hduser:hadoop /user
hdfs dfs -chmod -R 1777 /tmp
hdfs dfs -chmod 777 /tmp/mr-history
hdfs dfs -chmod 777 /tmp/mr-history/done
hdfs dfs -chmod 777 /tmp/mr-history/intermediate
hdfs dfs -chmod 1777 /tmp/hadoop-yarn 
exit

Start history server

sudo su -p - hduser -c "$HADOOP_INSTALL/sbin/mr-jobhistory-daemon.sh start historyserver"
Verify Install

yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-2.7.3.jar pi 16 1000

Create start - stop script

cat run_hdfs.sh

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-oracle-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native"
export JAVA_LIBRARY_PATH=$HADOOP_INSTALL/lib/native
export LD_LIBRARY_PATH=$HADOOP_INSTALL/lib/native:$LD_LIBRARY_PATH
export YARN_EXAMPLES=$HADOOP_INSTALL/share/hadoop/mapreduce
export HADOOP_MAPRED_STOP_TIMEOUT=30
export YARN_STOP_TIMEOUT=30
#HADOOP VARIABLES END

case $1 in
start)
su -p - hduser -c $HADOOP_INSTALL/sbin/start-dfs.sh
su -p - yarn -c $HADOOP_INSTALL/sbin/start-yarn.sh
su -p - hduser -c "$HADOOP_INSTALL/sbin/mr-jobhistory-daemon.sh start historyserver"
;;

stop)
su -p - hduser -c "$HADOOP_INSTALL/sbin/mr-jobhistory-daemon.sh stop historyserver"
su -p - yarn -c $HADOOP_INSTALL/sbin/stop-yarn.sh
su -p - hduser -c $HADOOP_INSTALL/sbin/stop-dfs.sh
;;

esac

Start and Stop hdfs as below.

chmod u+x run_hdfs.sh
sudo ./run_hdfs.sh start
sudo ./run_hdfs.sh stop

Install ZOOKEEPER

Apache zookeeper is used by hadoop components to co-ordinate their actions across cluster.

Create a zookeeper user. We use this user for our installation
```
sudo adduser --ingroup hadoop zookeeper
```

Download instance of Zookeeper (Latest Stable version 3.4.10 at the moment)

cd Downloads/
wget http://www-us.apache.org/dist/zookeeper/stable/zookeeper-3.4.10.tar.gz

Untar the binaries and make two (2) copies, one for each instance

tar xvf zookeeper-3.4.10.tar.gz 
sudo mv zookeeper-3.4.10 /usr/local/zookeeper
tar xvf zookeeper-3.4.10.tar.gz
sudo mv zookeeper-3.4.10 /usr/local/zookeeper1

Configure Zookeeper

Open file zoo.cfg with your preferred editor

sudo gedit /usr/local/zookeeper/conf/zoo.cfg or sudo nano /usr/local/zookeeper/conf/zoo.cfg

Set the file as follows

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/usr/local/zookeeper/data
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1=localhost:2888:3888
server.2=localhost:2889:3889

Copy file java.env located in Zookeeper folder to /usr/local/zookeeper/conf
```
sudo mv Zookeeper/java.env /usr/local/zookeeper/conf
```
Create data directory. In that directory, we need to create myid file, which will have only one character - 1, which tells this zookeeper server as to which server it is and accordingly which ports to use (from the last two lines in zoo.cfg).
```
mkdir /usr/local/zookeeper/data
mkdir /usr/local/zookeeper/logs
echo "1" > /usr/local/zookeeper/data/myid
```

Configure second server

Open file zoo.cfg with your preferred editor

sudo gedit /usr/local/zookeeper1/conf/zoo.cfg or sudo nano /usr/local/zookeeper1/conf/zoo.cfg

Set the file as follows

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/usr/local/zookeeper1/data
# the port at which the clients will connect
clientPort=2182
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1=localhost:2888:3888
server.2=localhost:2889:3889

Copy file java.env located in Zookeeper1 folder to /usr/local/zookeeper1/conf
```
sudo mv Zookeeper1/java.env /usr/local/zookeeper1/conf
```

Create data directory for server 2

mkdir /usr/local/zookeeper1/data
mkdir /usr/local/zookeeper1/logs 
echo "2" > /usr/local/zookeeper1/data/myid

Change ownership of directories

sudo chown -R zookeeper:hadoop /usr/local/zookeeper
sudo chown -R zookeeper:hadoop /usr/local/zookeeper1

Start server for test purposes

apa@apa-Lenovo-ideapad-700-15ISK:~$ sudo su -p - zookeeper -c "/usr/local/zookeeper/bin/zkServer.sh start"
ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
apa@apa-Lenovo-ideapad-700-15ISK:~$ sudo su -p - zookeeper -c "/usr/local/zookeeper1/bin/zkServer.sh start"
ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper1/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
apa@apa-Lenovo-ideapad-700-15ISK:~$ echo srvr | nc localhost 2181
Zookeeper version: 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
Latency min/avg/max: 0/0/0
Received: 1
Sent: 0
Connections: 1
Outstanding: 0
Zxid: 0x0
Mode: follower

Node count: 4
apa@apa-Lenovo-ideapad-700-15ISK:~$ echo srvr | nc localhost 2182
Zookeeper version: 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT
Latency min/avg/max: 0/0/0
Received: 1
Sent: 0
Connections: 1
Outstanding: 0
Zxid: 0x200000000
Mode: leader
Node count: 4

Stop servers

sudo su -p - zookeeper -c "/usr/local/zookeeper/bin/zkServer.sh stop"
sudo su -p - zookeeper -c "/usr/local/zookeeper1/bin/zkServer.sh stop"

Install Apache HIVE

Install prerequisites

Install hadoop
Install Zookeeper

Install mysql and mysql jdbc driver as below

sudo apt-get update
sudo apt-get install mysql-server
sudo service mysql status
sudo  mysql_secure_installation
sudo apt-get install libmysql-java

Add environment variables to file bashrc (This must be done for all users hduser, zookeeper, yarn, user), in order to do this execute the following commands

Execute for hduser user

sudo su - hduser

Open the bashrc file to edit

gedit .bashrc or nano .bashrc or vi .bashrc

Add the following text at the end of the file

#PIG VARIABLES
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$PIG_HOME/conf:$HADOOP_INSTALL/etc/hadoop
#PIG VARIABLES END

#HIVE_VARIABLES
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Save, close the file and close the hduser session

exit

Execute for zookeeper user

sudo su - zookeeper

Open the bashrc file to edit

gedit .bashrc or nano .bashrc or vi .bashrc

Add the following text at the end of the file

#PIG VARIABLES
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$PIG_HOME/conf:$HADOOP_INSTALL/etc/hadoop
#PIG VARIABLES END

#HIVE_VARIABLES
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Save, close the file and close the zookeeper session

exit

Execute for the yarn user

sudo su - yarn

Open the bashrc file to edit

gedit .bashrc or nano .bashrc or vi .bashrc

Add the following text at the end of the file

#PIG VARIABLES
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$PIG_HOME/conf:$HADOOP_INSTALL/etc/hadoop
#PIG VARIABLES END

#HIVE_VARIABLES
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Save, close the file and close the yarn session

exit

Execute for the your user account, in this example is apa

Open the bashrc file to edit

gedit .bashrc or nano .bashrc or vi .bashrc

Add the following text at the end of the file

#PIG VARIABLES
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$PIG_HOME/conf:$HADOOP_INSTALL/etc/hadoop
#PIG VARIABLES END

#HIVE_VARIABLES
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Save, close the file

Create hive user
```
sudo adduser --ingroup hadoop hive
```

Edit bashrc file for hive user

sudo su - hive

Open the bashrc file to edit

gedit .bashrc or nano .bashrc or vi .bashrc

Add the following text at the end of the file

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native"
export JAVA_LIBRARY_PATH=$HADOOP_INSTALL/lib/native
export LD_LIBRARY_PATH=$HADOOP_INSTALL/lib/native:$LD_LIBRARY_PATH
export YARN_EXAMPLES=$HADOOP_INSTALL/share/hadoop/mapreduce
export HADOOP_MAPRED_STOP_TIMEOUT=30
export YARN_STOP_TIMEOUT=30
#HADOOP VARIABLES END

#PIG VARIABLES
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$PIG_HOME/conf:$HADOOP_INSTALL/etc/hadoop
#PIG VARIABLES END

#HIVE_VARIABLES
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Save, close the file and close the hive user session

exit

Download and extract hive binary

cd Downloads/
wget http://apache.mirrors.lucidnetworks.net/hive/stable-2/apache-hive-2.3.2-bin.tar.gz
tar xvf apache-hive-2.3.2-bin.tar.gz
sudo mv apache-hive-2.3.2-bin /usr/local/hive
cd /usr/local
sudo chown -R hive:hadoop hive

Make mysql connector library available to hive

source .bashrc
sudo su - hive
cd $HIVE_HOME/lib
ln -s /usr/share/java/mysql-connector-java.jar mysq-connector-java.jar
exit

Create metastore database and user for hive

mysql -u root -p
mysql> create database metastore;
mysql> create user 'hive'@localhost identified by 'hive';
mysql> grant all on metastore.*  to 'hive'@localhost identified by 'hive'; 
mysql> flush privileges;
mysql> quit;

Set Hive configuration

Configure hive-env.sh
- Open file hive-env.sh located in $HIVE_HOME/conf/hive-env.sh using your preferred editor
  
  sudo gedit $HIVE_HOME/conf/hive-env.sh or sudo nano $HIVE_HOME/conf/hive-env.sh or sudo vi $HIVE_HOME/conf/hive-env.sh
- Add or update HADOOP_HOME in this file as follows
```
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/usr/local/hadoop
```
- Save and close the file
Configure hive-log4j2.properties
- Open file hive-log4j2.properties located in $HIVE_HOME/conf/hive-log4j2.properties using your preferred editor
  
  sudo gedit $HIVE_HOME/conf/hive-log4j2.properties or sudo nano $HIVE_HOME/conf/hive-log4j2.properties or sudo vi $HIVE_HOME/conf/hive-log4j2.properties
- Update log location, default is /tmp.
```
property.hive.log.dir = /usr/local/hive/logs/${sys:user.name}
```
- Save and close the file

Configure hive-site.xml

Open file hive-site.xml located in $HIVE_HOME/conf/hive-site.xml using your preferred editor

sudo gedit $HIVE_HOME/conf/hive-site.xml or sudo nano $HIVE_HOME/conf/hive-site.xml or sudo vi $HIVE_HOME/conf/hive-site.xml

Add the following text between the <configuration> tags

<property>
     <name>javax.jdo.option.ConnectionURL</name>
     <value>jdbc:mysql://localhost/metastore</value>
     <description>the URL of the MySQL database</description>
</property>
<property>
     <name>javax.jdo.option.ConnectionDriverName</name>
     <value>com.mysql.jdbc.Driver</value>
</property>
<property>
     <name>javax.jdo.option.ConnectionUserName</name>
     <value>hive</value>
</property>
<property>
     <name>javax.jdo.option.ConnectionPassword</name>
     <value>Hp1nvent</value>
</property>
<property>
     <name>datanucleus.fixedDatastore</name>
     <value>true</value>
</property>
<property>
     <name>hive.metastore.schema.verification</name>
     <value>true</value>
</property>
<property>
     <name>hive.metastore.uris</name>
     <value>thrift://localhost:9083</value>
     <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
<property>
     <name>hive.support.concurrency</name>
     <description>Enable Hive's Table Lock Manager Service</description>
     <value>true</value>
</property>
<property>
     <name>datanucleus.autoStartMechanism</name>
     <value>SchemaTable</value>
</property>
<property>
     <name>hive.security.authorization.createtable.owner.grants</name>
     <value>ALL</value>
     <description>
     The privileges automatically granted to the owner whenever a table gets created.
     An example like "select,drop" will grant select and drop privilege to the owner
     of the table. Note that the default gives the creator of a table no access to the
     table (but see HIVE-8067).
     </description>
</property>
<property>
     <name>hive.warehouse.subdir.inherit.perms</name>
     <value>false</value>
     <description>
     Set this to false if the table directories should be created
     with the permissions derived from dfs umask instead of
     inheriting the permission of the warehouse or database directory.
     </description>
</property>
<property>
     <name>hive.security.authorization.enabled</name>
     <value>true</value>
     <description>enable or disable the Hive client authorization</description>
</property>
<property>
     <name>hive.users.in.admin.role</name>
     <value>apa,hive</value>
     <description>
     Comma separated list of users who are in admin role for bootstrapping.
     More users can be added in ADMIN role later.
     </description>
</property>
<property>
     <name>hive.zookeeper.quorum</name>
     <description>Zookeeper quorum used by Hive's Table Lock Manager</description>
     <value>localhost:2181,localhost:2182</value>
</property>
<property>
     <name>hive.server2.thrift.port</name>
     <value>10001</value>
     <description>TCP port number to listen on, default 10000</description>
</property>
</configuration>

Save and close the file

Initialize metastore schema

sudo su - hive 
schematool -dbType mysql -initSchema 
exit

Create hdfs directories for hive

sudo su - hduser 
hdfs dfs -mkdir /user/hive
hdfs dfs -chmod 755 /user/hive
hdfs dfs -mkdir /user/hive/warehouse
hdfs dfs -chmod 1777 /user/hive/warehouse
hdfs dfs -chown -R hive:hadoop  /user/hive 
exit

Run Hiveserver2 and Metastore

sudo ./run_hdfs.sh start
sudo su -p - zookeeper -c "/usr/local/zookeeper/bin/zkServer.sh start"
sudo su -p - zookeeper -c "/usr/local/zookeeper1/bin/zkServer.sh start"
sudo su - hive 
$HIVE_HOME/bin/hive --service metastore & $HIVE_HOME/bin/hive --service hiveserver2

Run beeline to verify your installation (Replace user apa for your own user)

apa@apa-Lenovo-ideapad-700-15ISK:~$ $HIVE_HOME/bin/beeline -u jdbc:hive2://localhost:10001 -n apa - p apa
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://localhost:10001
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
17/07/29 23:44:56 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.1.1 by Apache Hive
0: jdbc:hive2://localhost:10001> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
+-----------+--+
No rows selected (0.444 seconds)
0: jdbc:hive2://localhost:10001> create table test (foo int, bar string);
No rows affected (1.316 seconds)
0: jdbc:hive2://localhost:10001> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
| test      |
+-----------+--+
1 row selected (0.216 seconds)
0: jdbc:hive2://localhost:10001> !q
Closing: 0: jdbc:hive2://localhost:10001
apa@apa-Lenovo-ideapad-700-15ISK:~$

Daemonize hive processes

apa@apa-Lenovo-ideapad-700-15ISK:~$ cat run_hive.sh 
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export YARN_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_INSTALL/lib/native"
export JAVA_LIBRARY_PATH=$HADOOP_INSTALL/lib/native
export LD_LIBRARY_PATH=$HADOOP_INSTALL/lib/native:$LD_LIBRARY_PATH
export YARN_EXAMPLES=$HADOOP_INSTALL/share/hadoop/mapreduce
export HADOOP_MAPRED_STOP_TIMEOUT=30
export YARN_STOP_TIMEOUT=30
#HADOOP VARIABLES END

#PIG VARIABLES
export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$PIG_HOME/conf:$HADOOP_INSTALL/etc/hadoop
#PIG VARIABLES END

#HIVE_VARIABLES
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

case $1 in
start)
echo "Starting metastore..."
su - hive -c "nohup /usr/local/hive/bin/hive --service metastore > /usr/local/hive/logs/hivemetastore.log 2>&1 &"
sleep 20
su - hive -c "ps -U hive -f | grep metastore | awk '\$8 ~ /oracle/ {print \$2}' > /usr/local/hive/metastore.pid"
echo "Starting hiveserver2..."
su - hive -c "nohup /usr/local/hive/bin/hiveserver2 > /usr/local/hive/logs/hiveserver2.log 2>&1 &"
sleep 10
su - hive -c "ps -U hive -f | grep HiveServer | awk '\$8 ~ /oracle/ {print \$2}' > /usr/local/hive/hiveserver2.pid"
;;

stop)
echo "Stopping hiveserver2..."
su - hive -c "kill `cat /usr/local/hive/hiveserver2.pid`"
sleep 10
echo "Stopping metastore..."
su - hive -c "kill `cat /usr/local/hive/metastore.pid`"
sleep 10
;;

esac

Make the script executable and test it

chmod u+x run_hive.sh
sudo ./run_hive.sh start
sudo ./run_hive.sh stop

Book Language Detection using Google N-Gram (British English - American English)

Download or Clone this repository.
Make sure your 2TB HDD is the media disk
Give your user permissions to modify your media disk
```
sudo chown -R apa:hadoop /media/
```
Create path /hive/Data/ in your media unit, the final path should be: /media/hive/Data/
```
cd /media/
mkdir hive
mkdir hive/Data
cd
```
Create folder tools in your home directory
```
mkdir tools
```
Move or Copy Tools content into your local tools folder
- Move example with apa user, change apa for your local user
```
mv /home/apa/software_trend//Tools/ /home/apa/tools
```
- Copy example with apa user, change apa for your local user
```
cp -r /home/apa/software_trend//Tools/ /home/apa/tools
```

Download the Google N-Gram repository executing the file 5Ngram-EN-US-GB-download.sh located inside tools folder, in order to do that execute the following command in a terminal session:

sudo ./tools/5Ngram-EN-US-GB-download.sh

After a while (Depending of your internet connection download speed) you will have a set of folders and tar.gz files in your /media/hive/Data/ path

/media/hive/Data/5N-EN-US-Part1/googlebooks-eng-us-all-5gram-20120701-0.gz
/media/hive/Data/5N-EN-US-Part1/googlebooks-eng-us-all-5gram-20120701-1.gz
/media/hive/Data/5N-EN-US-Part1/googlebooks-eng-us-all-5gram-20120701-2.gz
/media/hive/Data/5N-EN-US-Part1/googlebooks-eng-us-all-5gram-20120701-3.gz
...
...
/media/hive/Data/5N-EN-GB-Part4/googlebooks-eng-gb-all-5gram-20120701-_VERB_.gz

Mapreduce the Google N-Gram repository using Hive. In order to do this execute the following sh script
```
sudo ./tools/run_all.sh start
```

Divide the book to compare (Must be in txt file) into 5-grams, to do that execute the following command:

hadoop jar Ngram/Ngram/out/artifacts/Ngram_jar/Ngram.jar file:///home/apa/Ngram/Dracula.txt file:///home/apa/Ngram/Output/DemoOutOne 5

Using Hive compare the American English 5 Grams with the book 5-Grams, and the British English 5-Grams with the book 5-Grams, once the comparative is made the system calculates the Jaccard Coefficient and the Jaccard Distance for each one in order to find out the more likely language. Execute the following command:
```
sudo ./tools/run_all_2.sh start
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop

Hadoop

Ngram

Ngram

Tools

Tools

LICENSE

LICENSE

README.md

README.md

java.env

java.env

Repository files navigation

Software_Trend

Minimum Hardware and OS Requirements:

Install HADOOP

Install ZOOKEEPER

Install Apache HIVE

Book Language Detection using Google N-Gram (British English - American English)

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Hadoop		Hadoop
Ngram		Ngram
Tools		Tools
LICENSE		LICENSE
README.md		README.md
java.env		java.env

License

apupiales/software_trend

Folders and files

Latest commit

History

Repository files navigation

Software_Trend

Minimum Hardware and OS Requirements:

Install HADOOP

Install ZOOKEEPER

Install Apache HIVE

Book Language Detection using Google N-Gram (British English - American English)

About

Resources

License

Stars

Watchers

Forks

Languages