#### Common warnings:

1. __Backup your solution into the 'work' directory inside the home directory ('/home/jovyan'). It is the only one that state will be saved over sessions.__

1. Please, ensure that you call the right interpreter (python2 or python3). Do not write just "python" without the major version. There is no guarantee that any particular version of Python is set as the default one in the Grading system.

1. One cell must contain only one programming language.
E.g. if a cell contains Python code and you also want to call a bash-command (using “!”) in it, you should move the bash to another cell.

1. Our IPython converter is an improved version of the standard converter Nbconvert and it can process most of Jupyter's magic commands correctly (e.g. it understands "%%bash" and executes the cell as a "bash"-script). However, we highly recommend to avoid magics wherever possible.

#### Hive specific warning:

__A Hive task notebook works only in the home directory!__ (The parent directory for starter_and_demos)
Do not submit `CREATE/DROP DATABASE` line(s) into the Grading system. Most likely you will get "Permission denied" error if such line is submitted.


## The demonstrative notebook for Hive assignments.

To run any HiveQL query in the notebook you should:
1. write the code of query into a separate file using `%%writefile [-a] <file>` magic,
2. execute this file in hive using `! hive -f <file>` command.

To make grading system check a task correctly, execution command must be in a separate cell.

### 1. Creation the database.

Firstly, create your Hive database. You can name the database whatever you want.

Let's drop database if it has already created.

In [1]:
# %%writefile creation_db.hql

# DROP DATABASE IF EXISTS demodb CASCADE;

Writing creation_db.hql


And now create it.

In [2]:
# %%writefile -a creation_db.hql
# CREATE DATABASE demodb LOCATION '/user/jovyan/demodb';

Appending to creation_db.hql


Finally, execute the file we filled earlier.

In [3]:
# ! hive -f creation_db.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-2.3.6-bin/lib/hive-common-2.3.6.jar!/hive-log4j2.properties Async: true
OK
Time taken: 3.871 seconds
OK
Time taken: 0.226 seconds


On the real Hadoop-cluster where your submission will be checked we already have precreated Hive databases for all users. This helps to avoid database name conflicts. If you're the new user, the database will be created during your first submission of Hive assignment. The system won't allow you to create your own database on Hadoop-cluster so when you submit the final version of the task you shoud **remove or comment** all the lines related to database's dropping and creation. 

You can left all the lines with `USE` without any changes. The grading system will replace database's name to name of the precreated database. In assignments 2 and 3 you'll need to use `stackoverflow_` database. This database's name will not be changed by the grading system.

### 2. Creation the external table

Let us our source dataset have 2 collumns:
* ip-address,
* its subnet's mask.

For example:
```
148.45.113.216	255.255.255.248
203.98.141.0	255.255.255.240
183.168.36.0	255.255.255.128
111.157.172.232	255.255.255.248
80.46.87.0	255.255.255.0
247.248.233.0	255.255.255.128
```
Now we'll create the external table with 2 fields: ip and mask.

In [4]:
# %%writefile exteral_table.hql

# ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;

# USE demodb;
# DROP TABLE IF EXISTS Subnets;

# CREATE EXTERNAL TABLE Subnets (
#     ip STRING,
#     mask STRING
# )
# ROW FORMAT DELIMITED FIELDS TERMINATED BY  '\t'
# STORED AS TEXTFILE
# LOCATION '/data/subnets/ips';

Writing exteral_table.hql


In [6]:
# ! hive -f exteral_table.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-2.3.6-bin/lib/hive-common-2.3.6.jar!/hive-log4j2.properties Async: true
Added [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar]
OK
Time taken: 0.683 seconds
OK
Time taken: 0.848 seconds
OK
Time taken: 0.18 seconds


### 3. Demo query on created table

Let's write a simpe query:
 > Compute avarage value of IPs for each subnet's mask.

In [7]:
%%writefile query.hql

ADD JAR /opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar;
USE demodb;

Writing query.hql


In [8]:
%%writefile -a query.hql

SELECT AVG(counts.cnt)
FROM (
    SELECT mask, count(ip) as cnt
    FROM Subnets
    GROUP BY mask
) counts;

Appending to query.hql


Please take into account that the grading system catch all output (both result and MapReduce logs) from the last cell of the notebook, so __don't__ redirect any output from this cell to `/dev/null`

#### Final notice:

1. Please take into account that you must __not__ redirect __stderr__ to anywhere. Hadoop, Hive, and Spark print their logs to stderr and the Grading system also reads and analyses it.

1. During checking the code from the notebook, the system runs all notebook's cells and reads the output of only the last filled cell. It is clear that any exception should not be thrown in the running cells. If you decide to write some text in a cell, you should change the style of the cell to Markdown (Cell -> Cell type -> Markdown).

1. The Grader takes into account the output from the sample dataset you have in the notebook. Therefore, you have to "Run All" cells in the notebook before you send the ipynb solution.

1. The name of the notebook must contain only Roman letters, numbers and characters “-” or “_”. For example, Windows adds something like " (2)" (with the leading space) at the end of a filename if you try to download a file with the same name. This is a problem, because you will have a space character and curly braces "(" and ")". 

In [9]:
! hive -f query.hql


Logging initialized using configuration in jar:file:/usr/local/apache-hive-2.3.6-bin/lib/hive-common-2.3.6.jar!/hive-log4j2.properties Async: true
Added [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar]
OK
Time taken: 0.72 seconds
Query ID = jovyan_20201214183854_af206af2-be00-4d3d-bddd-36d61696e014
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1607970259371_0001, Tracking URL = http://172.17.0.22:8088/proxy/application_1607970259371_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1607970259371_0001
Hadoop jo