<H3>Parse data from HDFS location '/data/stackexchange1000/posts' into table posts_sample_external using the regex expression</H3>

In [60]:
%%writefile query.hql

ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;

USE stackoverflow_;

DROP TABLE if exists posts_sample_external; 

CREATE EXTERNAL TABLE posts_sample_external 
(row_id string,
post_type_id string,
year string,
month string)
ROW FORMAT 
SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES (
    "input.regex" = ".*?(?=.*\\bId=\"(\\d+)\")(?=.*\\bPostTypeId=\"(\\d+)\")(?=.*\\bCreationDate=\"(\\d+)-(\\d+)).*$"
)
LOCATION '/data/stackexchange1000/posts';

Overwriting query.hql


<h3> Create the table posts_sample partitioned by year and month within HDFS location '/user/jovyan/af_store/' </h3>

In [61]:
%%writefile query2.hql

USE stackoverflow_;

DROP TABLE if exists posts_sample; 

CREATE TABLE posts_sample 
(count int) 
PARTITIONED BY (year string, month string) 
LOCATION '/user/jovyan/af_store/';

Overwriting query2.hql


<h3> Populate 'posts_sample' table with data from 'posts_sample_external' table </h3>

In [62]:
%%writefile query3.hql

set hive.exec.dynamic.partition.mode=nonstrict;

USE stackoverflow_;

FROM posts_sample_external
INSERT OVERWRITE TABLE posts_sample
PARTITION (year, month)
SELECT count(*) as count, year, concat(year,"-",month) as month
WHERE year IS NOT NULL
GROUP BY year, month;

Overwriting query3.hql


<h3> Get the line for "2008-10"</h3>

In [72]:
%%writefile query4.hql
USE stackoverflow_;
SELECT year, month, count FROM posts_sample where month='2008-10';

Overwriting query4.hql


In [64]:
! hive -f creation_db.hql
! hive -f query.hql
! hive -f query2.hql
! hive -f query3.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 1.646 seconds
OK
Time taken: 0.321 seconds

Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
Added [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar]
Added [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar] to class path
Added resources: [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar]
OK
Time taken: 1.054 seconds
OK
Time taken: 1.708 seconds
OK
Time taken: 0.575 seconds

Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 1.124 seconds
OK
Time taken: 4.852 seconds
OK
Time taken: 0.733 seconds

Logging initialized using configuration in jar:file:/usr/loc

Partition stackoverflow_.posts_sample{year=2010, month=2010-09} stats: [numFiles=1, numRows=1, totalSize=4, rawDataSize=3]
Partition stackoverflow_.posts_sample{year=2010, month=2010-10} stats: [numFiles=1, numRows=1, totalSize=4, rawDataSize=3]
Partition stackoverflow_.posts_sample{year=2010, month=2010-11} stats: [numFiles=1, numRows=1, totalSize=4, rawDataSize=3]
Partition stackoverflow_.posts_sample{year=2010, month=2010-12} stats: [numFiles=1, numRows=1, totalSize=4, rawDataSize=3]
Partition stackoverflow_.posts_sample{year=2011, month=2011-01} stats: [numFiles=1, numRows=1, totalSize=4, rawDataSize=3]
Partition stackoverflow_.posts_sample{year=2011, month=2011-02} stats: [numFiles=1, numRows=1, totalSize=4, rawDataSize=3]
Partition stackoverflow_.posts_sample{year=2011, month=2011-03} stats: [numFiles=1, numRows=1, totalSize=4, rawDataSize=3]
Partition stackoverflow_.posts_sample{year=2011, month=2011-04} stats: [numFiles=1, numRows=1, totalSize=4, rawDataSize=3]
Partition stacko

MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 36.23 sec   HDFS Read: 60007532 HDFS Write: 7502 SUCCESS
Total MapReduce CPU Time Spent: 36 seconds 230 msec
OK
Time taken: 85.205 seconds


In [77]:
! hive -f query4.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-bin/lib/hive-common-1.1.0.jar!/hive-log4j.properties
OK
Time taken: 1.099 seconds
OK
2008	2008-10	73
Time taken: 2.8 seconds, Fetched: 1 row(s)
