# Hive Assignment 1. DDL: Create Tables

The purpose of this task is to create an external table on the posts data of the *stackoverflow.com* website.

Create your own database and 'use' it. Create external table `posts_sample_external` over the sample dataset with posts in `/data/stackexchange1000` directory. Create managed table `posts_sample` and populate with the data from the external table. The `posts_sample` table should be partitioned by year and by month of post creation. Provide output of query which selects lines number per each partition in the format:

```
year <tab> month <tab> lines count
```

where year in `YYYY` format and month in `YYYY-MM` format. The result is the 3th line of the last query output.

The result on the sample dataset:

```
2008    2008-10 73
```


### Step 1. Create database

In [None]:
%%writefile create_db.hql
DROP DATABASE IF EXISTS demodb CASCADE;
CREATE DATABASE demodb LOCATION '/user/jovyan/stackoverflow_';

In [None]:
! hive -f create_db.hql

### Step 2. Create tables

**Data example:** 

```
<row Id="1394" PostTypeId="2" ParentId="1390" CreationDate="2008-08-04T16:38:03.667" Score="16" Body="... long text ..." OwnerUserId="91" LastEditorUserId="1" LastEditorDisplayName="Jeff Atwood" LastEditDate="2008-08-27T13:02:50.273" LastActivityDate="2008-08-27T13:02:50.273" CommentCount="1" />
```

**Data model:**

* `id` - _(integer)_ - id of the post
* `year` - _(string)_ - post creation year in format YYYY
* `month` - _(string)_ - post creation month in format YYYY-MM

In [None]:
%%writefile create_tables.hql
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-serde.jar;

USE stackoverflow_;
DROP TABLE IF EXISTS posts_sample_external;
DROP TABLE IF EXISTS posts_sample;

-- Create 'posts_sample_external' table with post id, post creation year and post creation month
CREATE EXTERNAL TABLE posts_sample_external (
    id INT,
    year INT,
    month STRING
)
ROW FORMAT
    SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        "input.regex" = ".*?(?=\\bId=\"(\\d+)\").*(?=.*\\bCreationDate=\"(\\d{4}).*)(?=.*\\bCreationDate=\"(\\d{4}-\\d{2}).*).*$"
    )
LOCATION '/data/stackexchange1000/posts'
TBLPROPERTIES (
    "skip.header.line.count"="1"
);

-- Create and fill managed table 'posts_sample' with partitioning
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.error.on.empty.partition=true;

CREATE TABLE posts_sample (
    id INT
)
PARTITIONED BY (year INT, month STRING);

FROM posts_sample_external
INSERT OVERWRITE TABLE posts_sample
PARTITION (year, month)
SELECT id, year, month;


In [None]:
! hive -f create_tables.hql

### Step 3. Count posts by months

In [None]:
%%writefile query.hql
ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
USE stackoverflow_;

SELECT year, month, COUNT(*) as count 
FROM posts_sample 
GROUP BY year, month ORDER BY month;

In [None]:
! hive -f query.hql