---
title: HIVE Lab
type: lab
duration: "1:25"
creator:
    name: Francesco Mosconi
    city: SF
---

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) HIVE Lab

## Introduction
In the past labs we have introduced Hadoop and MRJob and performed more and more complex map-reduce jobs using those tools.

It would be nice however to be able to use the familiar SQL syntax we have learned using relational databases when dealing with Hadoop. Luckily, the the Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that are used to help Hadoop modules and offer that functionality. In particular:

- _Sqoop_ is used to import and export data to and from between HDFS and RDBMS.
- _Pig_ is a procedural language platform used to develop a script for MapReduce operations.
- _Hive_ is a platform used to develop SQL type scripts to do MapReduce operations.

In this lab we will focus on **Hive**.

## Hive

Hive enables analysis of large data sets using a language very similar to standard ANSI SQL. This means anyone who can write SQL queries can access data stored on the Hadoop cluster. Hive offers a simple interface for:

- Log processing
- Text mining
- Document indexing
- Customer-facing business intelligence (e.g., Google Analytics)
- Predictive modeling, hypothesis testing

Let's start hive by typing `hive` to our VM prompt.

**NOTE:** If you turned the VM off, you'll have to re-start all the big data services by running bigdata_start.sh.

You should see a prompt like this:

    hive>

The `SHOW TABLES;` command displays the tables contained:

    hive> SHOW TABLES;
    
**Check:** do you remember the equivalent postgres command?
> Answer: \dt

Let's create a table called `gutenberg` where we'll store the word counts for the `project_gutenberg` documents.

```sql
CREATE EXTERNAL TABLE gutenberg (
    word STRING,
    count INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/vagrant/output_gutenberg';
```

We have just created a table called gutenberg that references the output folder of the `project_gutenberg` hadoop map reduce job we've executed in the past hours.

**Check:** go back to the file browser to check what the content of that folder is:

    $ hadoop fs -cat /user/vagrant/output_gutenberg/part*

Now that we have defined the table in Hive, we can query it using a SQL-like statement:

    hive> select * from gutenberg order by count desc limit 10;

As you will see, this starts a Map reduce job on the output files and should return something like this:


    Total MapReduce CPU Time Spent: 4 seconds 460 msec
    OK
    the 63656
    of  34367
    and 32787
    to  31399
    a   24811
    in  18168
    I   18070
    his 13485
    he  13299
    was 13029
    Time taken: 37.311 seconds, Fetched: 10 row(s)


## Exercise 1: Word count in Hive (20 min)

Let's go ahead and perform the word count for one of the books in project Gutenberg using Hive.

#### 1. Alice in Wonderland word count

Let's start by counting the words of Alice in Wonderland (pg11.txt).

- create a table called alice_text that will map to the text file lines
- create a table called alice that counts the words
    - hint: you will need to use the `LATERAL VIEW` keywords to parse the text file table
    
You can use these 3 resources as reference to find the appropriate commands:

- https://www.linkedin.com/pulse/word-count-program-using-r-spark-map-reduce-pig-hive-python-sahu
- http://www.hadooplessons.info/2014/12/in-this-post-i-am-going-to-discuss-how.html
- http://stackoverflow.com/questions/10039949/word-count-program-in-hive

#### 2. Peter Pan word count

Repeat the operation creating a new table called peter where you will store the word counts from Peter Pan (pg16.txt).

Note that you can get the definition of a table by using the `describe` command:

    hive> describe alice;
    hive> describe peter;

## Exercise 2: Joins in Hive (20 min)

The advantage of having a SQL-like interface is that it makes join operations much easier to perform.

Find the common words to alice and peter table and sort them by the sum of their total count in decreasing order. Limit the display to the first 20 most common words.

Result should something like:

|word|alice_count|peter_count|sum|
|---|---|---|---|
|the|1664|2331|3995|
|and|780|1396|2176|
|...|...|...|...|

## Exercise 3:  HIVE Serialization and DeSerialization (30 min)

If your mind is not blown yet by the possibility of doing SQL like queries on hadoop, the next exercise will surely impress you.

In fact we can perform SQL queries on documents defining the fields in a regular expression using the keyword SERDE.

As an example we have uploaded a log file freely accessible at the address:

http://ita.ee.lbl.gov/html/contrib/ClarkNet-HTTP.html

and we will parse it in HIVE.

As explained on that page, the logs are an ASCII file with one line per request, with the following columns:

- host making the request
- timestamp, The timezone is -0400.
- request given in quotes. (careful sometimes there are additional quotes)
- HTTP status reply code
- bytes in the reply.


In order to parse it we will have to use regular expressions. Since you have not been formally introduced to regular expressions yet, we will walk you through this part.

### Regular expressions

A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. They are implemented in python in the module `re`. You can find a cheat sheet of patterns [here](https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/).

The `re` module contains several functions to match patterns including
- match: to match a pattern at the beginning of a string
- search: to match a pattern anywhere in the string
- findall: to match all occurrences of the patten in a string

#### Load a few lines of the logs

using the gzip module, load a few lines of the [log file](../../assets/datasets/clarknet_access_log_Aug28.gz), and print them.

#### Matching ip

here's the first 2 lines in the log file:

    1: '204.249.225.59 - - [28/Aug/1995:00:00:34 -0400] "GET /pub/rmharris/catalogs/dawsocat/intro.html HTTP/1.0" 200 3542\n'
    2: 'access9.accsyst.com - - [28/Aug/1995:00:00:35 -0400] "GET /pub/robert/past99.gif HTTP/1.0" 200 4993\n'

Let's write a regular expression to match the ip address that appears at the beginning of the string. Note: various solutions are possible. Note that we have already prepared a couple of test cases for you.

In [13]:
import re

line1 = '204.249.225.59 - - [28/Aug/1995:00:00:34 -0400] "GET /pub/rmharris/catalogs/dawsocat/intro.html HTTP/1.0" 200 3542\n'
line2 = 'access9.accsyst.com - - [28/Aug/1995:00:00:35 -0400] "GET /pub/robert/past99.gif HTTP/1.0" 200 4993\n'

pattern = <WRITE YOUR PATTERN HERE>

assert(re.findall(pattern, line1) == ['204.249.225.59'])
assert(re.findall(pattern, line2) == ['access9.accsyst.com'])

Let's write a regular expression to match the date and time that appears within squared parenteses. Note: various solutions are possible. You can discard the timezone for now.

In [17]:
pattern = <WRITE YOUR PATTERN HERE>

assert(re.findall(pattern, line1) == ['28/Aug/1995:00:00:34'])
assert(re.findall(pattern, line2) == ['28/Aug/1995:00:00:35'])

If you are lost because it's your first time with regular expressions, no worries, below is the solution to the problem:

In [68]:
def get_it(line):
    return re.findall('^([^ ]+)\\s+-\\s+-\\s+\\[([^\\]]+)\s+-0400\\]\\s+\\"([^ ]+)?\\s+([^\\"]+)\\s*.*?\\"\\s+([^ ]+)\\s+([^ ]+)',
                      line)

assert(get_it('tampico.usc.edu - - [28/Aug/1995:01:04:59 -0400] "GET / " 200 1834')==[('tampico.usc.edu', '28/Aug/1995:01:04:59', 'GET', '/ ', '200', '1834')])

assert(get_it('cconcepts14.cconcepts.co.uk - - [29/Aug/1995:06:57:43 -0400] "GET /" 200 1834')==[('cconcepts14.cconcepts.co.uk',
  '29/Aug/1995:06:57:43',
  'GET',
  '/',
  '200',
  '1834')])

Checking that I caught all the exceptions

In [69]:
my_regex = '^([^ ]+)\\s+-\\s+-\\s+\\[([^\\]]+)\s+-0400\\]\\s+\\"([^ ]+)?\\s+([^\\"]+)\\s*.*?\\"\\s+([^ ]+)\\s+([^ ]+)'
for l in lines:
    ti = re.findall(my_regex, l)
    try:
        if len(ti[0]) != 6:
            print len(ti[0]), ti
    except:
        print l
        print ti

We are now ready to parse the log in hive.

- create a table in HIVE called logs with as many fields as extracted by the regex above.
- use the code template below to run the query

> Instructor note: They can have some freedom in the how they decide to split each line and which fields they want to extract, but they should at least separate these:
>
- host STRING,
- datetime STRING,
- method STRING,
- uri STRING,
- status INT


```sql
create external table logs(
    <INSERT HERE FIELD DEFINITION>,
    <INSERT HERE FIELD DEFINITION>,
    )
  row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
  with serdeproperties ( "input.regex" = <INSERT HERE YOUR REGEX> );

load data local inpath '/home/vagrant/data/logs/clarknet_access_log_Aug28' into table logs;

select * from logs limit 20;
```

## Bonus: Analyze logs

Now that we have a nice relational table defined for our logs, let's ask a couple of questions:

1. What are the top 10 host that most frequently request pages?
- What are the most requested resources (Uri)?


Additional Resources

- [Serde example](https://community.hortonworks.com/articles/8313/apache-hive-csv-serde-example.html)
- [Logs Page](http://ita.ee.lbl.gov/html/contrib/ClarkNet-HTTP.html)
- [Cloudera Twitter example](https://github.com/cloudera/cdh-twitter-example)
- [AWS Serde example](http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-gs.html)