## Hive assignment. Task1
The purpose of this task is to create an external table on the posts data of the stackoverflow.com website.

Create your own database and 'use' it. Create external table 'posts_sample_external' over the sample dataset with posts in '/data/stackexchange1000' directory. Create managed table 'posts_sample' and populate with the data from the external table. 'Posts_sample' table should be partitioned by year and by month of post creation. Provide output of query which selects lines number per each partition in the format:

```
year <tab> month <table> lines count
```

where year in YYYY format and month in YYYY-MM format. The result is the 3th line of the last query output.

## Create database

In [1]:
%%writefile create_db.hql

-- Add jar
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;

-- Create database
DROP DATABASE IF EXISTS mydb CASCADE;
CREATE DATABASE IF NOT EXISTS mydb LOCATION '/user/jovyan/somemetastore';


Overwriting create_db.hql


In [2]:
! hive --silent -f create_db.hql

## Create tables

In [3]:
%%writefile create_tables.hql

-- Add jar
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;

USE mydb;


-- Create posts_sample_external table
DROP TABLE IF EXISTS posts_sample_external;
CREATE EXTERNAL TABLE IF NOT EXISTS posts_sample_external(
    Id INT,
    PostTypeId  TINYINT,
    CreationDate STRING,
    Tags STRING,
    OwnerUserId INT,
    ParentId INT,
    Score INT,
    FavoriteCount INT
)
ROW FORMAT 
    SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
      "input.regex" = '^<row.*?(?=.*\\bId=\"(\\d+)\")(?=.*\\bPostTypeId=\"(\\d+)\")(?=.*\\bCreationDate=\"([^"]*)\")(?=.*\\bTags=\"([^"]*)\")?(?=.*\\bOwnerUserId=\"(\\d+)\")?(?=.*\\bParentId=\"(\\d+)\")?(?=.*\\bScore=\"(-?\\d+)\")(?=.*\\bFavoriteCount=\"(\\d+)\")?.*$',
      "input.regex.case.insensitive" = 'true'
    )
STORED AS TEXTFILE
LOCATION '/data/stackexchange1000/posts';


-- Create Posts_sample table
DROP TABLE IF EXISTS posts_sample;
CREATE TABLE IF NOT EXISTS posts_sample(
    Id INT,
    PostTypeId  TINYINT,
    CreationDate STRING,
    OwnerUserId INT,
    ParentId INT,
    Score INT,
    FavoriteCount INT,
    Tags array <string>
)
PARTITIONED BY ( 
  year string, 
  month string
)
STORED AS TEXTFILE
LOCATION '/user/jovyan/task1';

Overwriting create_tables.hql


In [4]:
! hive --silent -f create_tables.hql

## Queries

In [5]:
%%writefile query.hql

-- Add jar
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;


-- set hive
SET hive.exec.dynamic.partition=true;  
SET hive.exec.dynamic.partition.mode=nonstrict;

USE mydb;


-- Query
INSERT OVERWRITE TABLE posts_sample
PARTITION (year, month)
SELECT
    Id,
    PostTypeId,
    CreationDate,
    OwnerUserId,
    ParentId,
    Score,
    FavoriteCount,
    split(regexp_replace(Tags, '(&lt\;|&gt\;$)', ''), '&gt\;') AS Tags,
    regexp_extract(CreationDate, '^(\\d{4})', 1) AS year,
    regexp_extract(CreationDate, '^(\\d{4}-\\d{2})', 1) AS month
FROM 
    posts_sample_external
WHERE
    Id IS NOT NULL AND CreationDate IS NOT NULL;


Overwriting query.hql


In [6]:
! hive -f query.hql 2> out.log

## Output

In [7]:
%%writefile script.py
import re
import sys

text = []
for line in sys.stdin:
    text.append(line.strip())

year, month, numRows = re.search('(?=.*year=(\d+))(?=.*month=(\d+-\d+))(?=.*numRows=(\d+))', text[0]).groups()

print '%s\t%s\t%s' % (year, month, numRows)

Overwriting script.py


In [8]:
cat out.log | grep "Partition mydb.posts_sample" | head -3 | tail -1

grep: Partition mydb.posts_sample{year=2008, month=2008-10} stats: [numFiles=1, numRows=73, totalSize=4051, rawDataSize=3978]
write error: Broken pipe


In [9]:
%%bash

# Print results
# cat out.log | grep "Partition mydb.posts_sample" | head -3 | tail -1 | sed -nr "s/.*year=([0-9]{4}).*month=([0-9]{4}\-[0-9]{2}).*numRows=([0-9]+).*/\1\t\2\t\3/p"
cat out.log | grep "Partition mydb.posts_sample" | head -3 | tail -1 | python2 script.py

# Output stderr
cat out.log >&2


2008	2008-10	73



Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
Added [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar]
Added [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar] to class path
Added resources: [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar]
OK
Time taken: 1.075 seconds
Query ID = jovyan_20180205125353_1a4748d2-f7bf-4167-86fd-4d6377e586bf
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1517822664732_0015, Tracking URL = http://1c945e5f18ec:8088/proxy/application_1517822664732_0015/
Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1517822664732_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2018-02-05 12:53:19,502 Stage-1 map = 0%,  reduce = 0%
2018-02-05 12:53:37,925 Stage-1 