In [1]:
#from IPython.core.display import display, HTML
#display(HTML("<style>.container { width:100% !important; }</style>"))

# MSBA 6330 Homework 3 - Hive Data Management and Optimization - Danny Moncada

# Part 1 - Short Answers

## Hive vs RDBMS - RDBMS advantage

### 1. What are some reasons you wouldn't replace a typical relational database with Hive? Give at least three.

- RDBMS (client-server) has very fast response time, often returning data in milliseconds.
- RDBMS allows for frequent modification of a small number of records - which is helpful when you are working with small data sets and not millions of rows of data.
- RDBMS provides the capability to serve thousands of simultaneous clients at once; this is extremely useful when you have a tool like Oracle Business Intelligence Enterprise Edition (OBIEE) and you want to serve provide users with access to a reporting environments across the various campuses at the University.
- RDBMS has very low latency in comparison with Hive; this is a point that cannot be understated, especially when working daily with SQL development tools and being used to getting results back very quickly.
- RDBMS allows you to update/delete indivudal records while in Hive this can only be done in "managed" tables.

## Hive vs RDBMS - Hive Benefits

### 2. Name at least two benefits that Hive and Hadoop have over typical data warehousing and RDBMS systems.

- Hive is more productive than writing MapReduce directly; in 5 lines of HiveQL you can do the same amount of work as 100 lines of Java.
- Hive brings large scale data analysis to a broader audience, since someone new to the tool can leverage their existing SQL knowledge and doesn't need any previous software development experience.
- Hive offers interoperability with other systems, like JDBC/ODBC & external files; in additino, many Business Intelligence tools like Tableau and Qlikview support Hive.
- Hive supports very huge datasets, up to pedabtyes in size, and storage costs are very low compared to RDBMS.

In traditional RDBMS, you must follow what is known as __"schema on write"__; this means that you _have_ to create the table with a rigid structure before any of the data is loaded.  If you try to load a table with "Hello world" into a column that is DATE data type, your INSERT query will error out because of incompability between the column type and data being loaded.  However, you can store the data in Hive/HDFS without knowing the format of the data.  It will only check the format, including fields and types of the data, when you need to read it.  This is known as __"schema on read"__.  This provides far more flexibility when loading the data and speed when writing data to a table.

## Hive partition

### 3. How do table partition affects Hive query performance? What are some of the best practices in creating Hive partitions?

By default, all the data files for a table are stored in a single directory, and ALL the files in the directory are read during the query execution.  When you are running a query, it doesn't make sense to do a full table scan when you only need a small subset of data or are filtering on certain criteria/columns.  In order to improve query performance, it makes much more sense to partition the data tables so that it is faster to read through and you only pull back the information that is being requested.  What partitioning does is __subdivide__ the data, physically dividing the data during the load based on values in a certain column or columns into _subdirectories_.

This, in turn, speeds up any queries that filter on the partition column because only the files that contain the specified data need to be read.  Queries that filter on partitioned fields limit the _total_ amount of data read.  This is really important for Big Data, where most of the data sets you are working with are in the millions of rows and a long running query could be very costly, both in time and CPU/memory/resources.

You should use partitions when:
* A query that would read the entire data set would take too long to run
* A query will be using fields that would normally belong to a partitioned column (like date or state)
* A partition column has a reasonable (not too many, not too few) number of different values (like month or zip code) 
* A data generation or ETL process already in place splits the data by file or directory name; i.e. in the case where a batch job uploads new data every month, you would create a partition based on month so it would just create a new subdirectory and not overwrite any previous subdirectories that were created)
* A partition column in not a part of the data itself

You should NOT use partitions when:
* A column contains too many unique values, like first name or address; this will fragment the data and create too many subdirectories.
* A column generates an excessive amount of partitions, because you may end up creating many small partitions that are only a few MB and KB in size, which is not optimal for Hive's performance.

# Part II. Hands on

In [None]:
## 5. Create a directory /home/cloudera/flights on your local machine. Download the data file into this folder using linux 
## utility wget: shell wget http://idsdl.csom.umn.edu/c/share/airport_data4mon.zip

[cloudera@quickstart ~]$ mkdir /home/cloudera/flights

[cloudera@quickstart ~]$ cd flights
[cloudera@quickstart flights]$ pwd
/home/cloudera/flights

[cloudera@quickstart flights]$ wget http://idsdl.csom.umn.edu/c/share/airport_data4mon.zip
--2019-07-01 10:32:09--  http://idsdl.csom.umn.edu/c/share/airport_data4mon.zip
Resolving idsdl.csom.umn.edu... 134.84.138.46, 2607:ea00:101:480a:250:56ff:febb:e76b
Connecting to idsdl.csom.umn.edu|134.84.138.46|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24319838 (23M) [application/zip]
Saving to: “airport_data4mon.zip”

100%[================================================================>] 24,319,838  25.9M/s   in 0.9s

2019-07-01 10:32:10 (25.9 MB/s) - “airport_data4mon.zip” saved [24319838/24319838]
        
## 6. Unzip the data into /home/cloudera/flights/ using unzip [filename].

[cloudera@quickstart flights]$ unzip airport_data4mon.zip
Archive:  airport_data4mon.zip
  inflating: 297004386_T_ONTIME_1.csv
  inflating: 297004386_T_ONTIME_2.csv
  inflating: 297004386_T_ONTIME_3.csv
  inflating: 297004386_T_ONTIME_4.csv

## 7. How many lines are there across the 4 files extracted from the zip file? [Report Answer]

[cloudera@quickstart flights]$ wc -l *.csv
   445828 297004386_T_ONTIME_1.csv
   423890 297004386_T_ONTIME_2.csv
   479123 297004386_T_ONTIME_3.csv
   461631 297004386_T_ONTIME_4.csv
  1810472 total
    
## There are 1,810,472 lines across the 4 files.

## 8. Using HDFS's put method to copy the 4 files into the HDFS directory /user/cloudera/flights/ (create it if needed).

[cloudera@quickstart flights]$ hadoop fs -mkdir /user/cloudera/flights

[cloudera@quickstart flights]$ hadoop fs -ls /user/cloudera/
drwxr-xr-x   - cloudera cloudera          0 2019-07-01 10:45 /user/cloudera/flights

## 9. Verify that all the files have been copied to /user/cloudera/flights/ by listing file names and sizes in the directory 
## [Report Result]

[cloudera@quickstart flights]$ hadoop fs -put /home/cloudera/flights/*.csv /user/cloudera/flights

[cloudera@quickstart flights]$ hadoop fs -ls /user/cloudera/flights
Found 4 items
-rw-r--r--   1 cloudera cloudera   54599300 2019-07-01 10:49 /user/cloudera/flights/297004386_T_ONTIME_1.csv
-rw-r--r--   1 cloudera cloudera   51994206 2019-07-01 10:49 /user/cloudera/flights/297004386_T_ONTIME_2.csv
-rw-r--r--   1 cloudera cloudera   58775991 2019-07-01 10:49 /user/cloudera/flights/297004386_T_ONTIME_3.csv
-rw-r--r--   1 cloudera cloudera   56643780 2019-07-01 10:49 /user/cloudera/flights/297004386_T_ONTIME_4.csv

## 10. Our flight data contains 4 different files, each around 60 MB in size. These are smaller than a block size, so perhaps 
## it would be beneficial to create one large file in HDFS by combining the contents of these 4 files. First delete the files 
## in the /user/cloudera/flights/ directory.

[cloudera@quickstart flights]$ hadoop fs -rm -r /user/cloudera/flights/*
Deleted /user/cloudera/flights/297004386_T_ONTIME_1.csv
Deleted /user/cloudera/flights/297004386_T_ONTIME_2.csv
Deleted /user/cloudera/flights/297004386_T_ONTIME_3.csv
Deleted /user/cloudera/flights/297004386_T_ONTIME_4.csv

## 11. Then upload 4 files on your local machine as a single file /user/cloudera/flights/all_flights.csv on HDFS 
## (hint: consider using cat & pipe).

## I tried for about 40 minutes to try to get cat & pipe to work and kept getting permission denied errors so I gave up.

[cloudera@quickstart flights]$ hadoop fs -getmerge /user/cloudera/flights/*.csv /home/cloudera/flights/all_flights.csv

[cloudera@quickstart flights]$ hadoop fs -put /home/cloudera/flights/all_flights.csv /user/cloudera/flights/all_flights.csv

## 12. Verify the pooled file on HDFS by listing the content of the directory /user/cloudera/flights/. [Report Result]

[cloudera@quickstart flights]$ hadoop fs -ls /user/cloudera/flights
Found 1 items
-rw-r--r--   1 cloudera cloudera  222013277 2019-07-01 12:00 /user/cloudera/flights/all_flights.csv

## 13. Verify that the pooled file has the same number of lines as the original in Q7. [Report Result]

[cloudera@quickstart flights]$ hadoop fs -cat /user/cloudera/flights/* | wc -l
1810472

## Same line count as the original.

## 14. Create a database called flights.

[cloudera@quickstart flights]$ beeline -u jdbc:hive2://
scan complete in 9ms
Connecting to jdbc:hive2://
Connected to: Apache Hive (version 1.1.0-cdh5.10.0)
Driver: Hive JDBC (version 1.1.0-cdh5.10.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.10.0 by Apache Hive

0: jdbc:hive2://> CREATE DATABASE flights;
OK
No rows affected (3.942 seconds)

## 15. List the databases in Hive [Report Result]

0: jdbc:hive2://> SHOW DATABASES;
OK
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| dualcore       |
| flights        |
+----------------+--+
3 rows selected (0.792 seconds)

## 16. Create a simple external table flight_data_raw based on the data the flights directory with the following fields: 
## sql YEAR STRING, MONTH STRING, DAY_OF_MONTH STRING, FL_DATE STRING, UNIQUE_CARRIER STRING, AIRLINE_ID STRING, CARRIER STRING,
## TAIL_NUM STRING, FL_NUM STRING, ORIGIN_AIRPORT_ID STRING, ORIGIN_AIRPORT_SEQ_ID STRING, ORIGIN STRING, DEST_AIRPORT_ID 
## STRING, DEST_AIRPORT_SEQ_ID STRING, DEST STRING, DEP_DELAY STRING, ARR_DELAY STRING, CANCELLED STRING, DIVERTED STRING, 
## DISTANCE STRING
    
0: jdbc:hive2://> CREATE EXTERNAL TABLE flight_data_raw (YEAR STRING, MONTH STRING, DAY_OF_MONTH STRING, FL_DATE STRING, UNIQUE_CARRIER STRING, AIRLINE_ID STRING, 
                                                         CARRIER STRING, TAIL_NUM STRING, FL_NUM STRING, ORIGIN_AIRPORT_ID STRING, ORIGIN_AIRPORT_SEQ_ID STRING, 
                                                         ORIGIN STRING, DEST_AIRPORT_ID STRING, DEST_AIRPORT_SEQ_ID STRING, DEST STRING, DEP_DELAY STRING, 
                                                         ARR_DELAY STRING, CANCELLED STRING, DIVERTED STRING, DISTANCE STRING) 
            ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
            LINES TERMINATED BY '\n' 
            LOCATION '/user/cloudera/flights/' 
            TBLPROPERTIES("skip.header.line.count"="1");
OK
No rows affected (0.121 seconds)

## 17. Query flight_data_raw to obtain the number of rows in the table [Report Result & Answer]: - How long did the query take 
## to run (report Time Taken displayed at the end of the query log)?

0: jdbc:hive2://> SELECT COUNT(YEAR) FROM flight_data_raw;

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 17.6 sec   HDFS Read: 222026153 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 17 seconds 600 msec
OK
+----------+--+
|   _c0    |
+----------+--+
| 1810471  |
+----------+--+
1 row selected (70.604 seconds)

# There are 1,180,472 rows.
# It took 70.604 seconds to run.

## 18. Query the first 3 rows, and display only YEAR, CARRIER, TAIL_NUM, FL_NUM. [Report Result]

0: jdbc:hive2://> SELECT YEAR, CARRIER, TAIL_NUM, FL_NUM FROM flight_data_raw LIMIT 3;
OK
19/07/02 12:26:17 [main]: WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
+-------+----------+-----------+---------+--+
| year  | carrier  | tail_num  | fl_num  |
+-------+----------+-----------+---------+--+
| 2016  | "AA"     | "N4YBAA"  | "43"    |
| 2016  | "AA"     | "N434AA"  | "43"    |
| 2016  | "AA"     | "N541AA"  | "43"    |
+-------+----------+-----------+---------+--+
3 rows selected (0.117 seconds)

## 19. There are two issues with the data returned. What are they? [Report Answer]

## The first issue is that the first row of data is simply the column names of the CSV file - I fixed this in my code by adding the line to skip the header row.
## The second issue with the data is there are quotes surrounding the string values.

## 21. Create a new EXTERNAL table called flight_data_csv using the same data location and all the same columns as the 
## flight_data_raw, except set the data types to be appropriate types (choose from STRING, INT, TINYINT, FLOAT).

0: jdbc:hive2://> CREATE EXTERNAL TABLE flight_data_csv (YEAR INT, MONTH TINYINT, DAY_OF_MONTH TINYINT, FL_DATE STRING, UNIQUE_CARRIER STRING, AIRLINE_ID INT, 
                                                         CARRIER STRING, TAIL_NUM STRING, FL_NUM STRING, 
                                                         ORIGIN_AIRPORT_ID INT, ORIGIN_AIRPORT_SEQ_ID INT, ORIGIN STRING, DEST_AIRPORT_ID INT, DEST_AIRPORT_SEQ_ID INT, 
                                                         DEST STRING, DEP_DELAY FLOAT, ARR_DELAY FLOAT, 
                                                         CANCELLED FLOAT, DIVERTED FLOAT, DISTANCE FLOAT) 
            ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
            WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "\"") 
            LOCATION '/user/cloudera/flights/' 
            TBLPROPERTIES("skip.header.line.count"="1");
OK
No rows affected (0.175 seconds)

## 22. Query the first 3 rows and limit the query to YEAR, CARRIER, TAIL_NUM, FL_NUM. What is the difference in the output 
## compared to the raw table? [Report Result & Answer]

0: jdbc:hive2://> DESCRIBE flight_data_csv;
OK
+------------------------+------------+--------------------+--+
|        col_name        | data_type  |      comment       |
+------------------------+------------+--------------------+--+
| year                   | string     | from deserializer  |
| month                  | string     | from deserializer  |
| day_of_month           | string     | from deserializer  |
| fl_date                | string     | from deserializer  |
| unique_carrier         | string     | from deserializer  |
| airline_id             | string     | from deserializer  |
| carrier                | string     | from deserializer  |
| tail_num               | string     | from deserializer  |
| fl_num                 | string     | from deserializer  |
| origin_airport_id      | string     | from deserializer  |
| origin_airport_seq_id  | string     | from deserializer  |
| origin                 | string     | from deserializer  |
| dest_airport_id        | string     | from deserializer  |
| dest_airport_seq_id    | string     | from deserializer  |
| dest                   | string     | from deserializer  |
| dep_delay              | string     | from deserializer  |
| arr_delay              | string     | from deserializer  |
| cancelled              | string     | from deserializer  |
| diverted               | string     | from deserializer  |
| distance               | string     | from deserializer  |
+------------------------+------------+--------------------+--+
20 rows selected (0.127 seconds)

0: jdbc:hive2://> SELECT YEAR, CARRIER, TAIL_NUM, FL_NUM FROM flight_data_csv LIMIT 3;
OK
+-------+----------+-----------+---------+--+
| year  | carrier  | tail_num  | fl_num  |
+-------+----------+-----------+---------+--+
| 2016  | AA       | N4YBAA    | 43      |
| 2016  | AA       | N434AA    | 43      |
| 2016  | AA       | N541AA    | 43      |
+-------+----------+-----------+---------+--+
3 rows selected (0.111 seconds)

## The quotes surrounding the fields enclosed by quotes are gone.

## 23.  Determine the number of rows that don't have a year in the YEAR column, rather have the text YEAR. Why do you think this happened? [Report Result & Answer].

0: jdbc:hive2://> SELECT COUNT(*) FROM flight_data_csv WHERE YEAR = 'YEAR';
MapReduce Total cumulative CPU time: 44 seconds 420 msec
Ended Job = job_1561941999858_0015
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 44.42 sec   HDFS Read: 222026755 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 44 seconds 420 msec
OK
+------+--+
| _c0  |
+------+--+
| 3    |
+------+--+
1 row selected (97.248 seconds)

0: jdbc:hive2://> SELECT * FROM flight_data_csv WHERE YEAR = 'YEAR';
OK
+-----------------------+------------------------+-------------------------------+--------------------------+---------------------------------+-----------------------------+--------------------------+---------------------------+-------------------------+------------------------------------+----------------------------------------+-------------------------+----------------------------------+--------------------------------------+-----------------------+----------------------------+----------------------------+----------------------------+---------------------------+---------------------------+--+
| flight_data_csv.year  | flight_data_csv.month  | flight_data_csv.day_of_month  | flight_data_csv.fl_date  | flight_data_csv.unique_carrier  | flight_data_csv.airline_id  | flight_data_csv.carrier  | flight_data_csv.tail_num  | flight_data_csv.fl_num  | flight_data_csv.origin_airport_id  | flight_data_csv.origin_airport_seq_id  | flight_data_csv.origin  | flight_data_csv.dest_airport_id  | flight_data_csv.dest_airport_seq_id  | flight_data_csv.dest  | flight_data_csv.dep_delay  | flight_data_csv.arr_delay  | flight_data_csv.cancelled  | flight_data_csv.diverted  | flight_data_csv.distance  |
+-----------------------+------------------------+-------------------------------+--------------------------+---------------------------------+-----------------------------+--------------------------+---------------------------+-------------------------+------------------------------------+----------------------------------------+-------------------------+----------------------------------+--------------------------------------+-----------------------+----------------------------+----------------------------+----------------------------+---------------------------+---------------------------+--+
| YEAR                  | MONTH                  | DAY_OF_MONTH                  | FL_DATE                  | UNIQUE_CARRIER                  | AIRLINE_ID                  | CARRIER                  | TAIL_NUM                  | FL_NUM                  | ORIGIN_AIRPORT_ID                  | ORIGIN_AIRPORT_SEQ_ID                  | ORIGIN                  | DEST_AIRPORT_ID                  | DEST_AIRPORT_SEQ_ID                  | DEST                  | DEP_DELAY                  | ARR_DELAY                  | CANCELLED                  | DIVERTED                  | DISTANCE                  |
| YEAR                  | MONTH                  | DAY_OF_MONTH                  | FL_DATE                  | UNIQUE_CARRIER                  | AIRLINE_ID                  | CARRIER                  | TAIL_NUM                  | FL_NUM                  | ORIGIN_AIRPORT_ID                  | ORIGIN_AIRPORT_SEQ_ID                  | ORIGIN                  | DEST_AIRPORT_ID                  | DEST_AIRPORT_SEQ_ID                  | DEST                  | DEP_DELAY                  | ARR_DELAY                  | CANCELLED                  | DIVERTED                  | DISTANCE                  |
| YEAR                  | MONTH                  | DAY_OF_MONTH                  | FL_DATE                  | UNIQUE_CARRIER                  | AIRLINE_ID                  | CARRIER                  | TAIL_NUM                  | FL_NUM                  | ORIGIN_AIRPORT_ID                  | ORIGIN_AIRPORT_SEQ_ID                  | ORIGIN                  | DEST_AIRPORT_ID                  | DEST_AIRPORT_SEQ_ID                  | DEST                  | DEP_DELAY                  | ARR_DELAY                  | CANCELLED                  | DIVERTED                  | DISTANCE                  |
+-----------------------+------------------------+-------------------------------+--------------------------+---------------------------------+-----------------------------+--------------------------+---------------------------+-------------------------+------------------------------------+----------------------------------------+-------------------------+----------------------------------+--------------------------------------+-----------------------+----------------------------+----------------------------+----------------------------+---------------------------+---------------------------+--+

## 3 rows do not have a YEAR column, and instead have the text YEAR.  These are the header rows from three out of four CSVs that we merged back in the first few steps.  The first header row was removed
## during the CREATE EXTERNAL TABLE step.

## 24. Create a Hive Managed table flight_data_parquet with parquet storage format and partitioned by MONTH and DAY_OF_MONTH.

0: jdbc:hive2://> CREATE TABLE flight_data_parquet (YEAR INT, FL_DATE STRING, UNIQUE_CARRIER STRING, AIRLINE_ID INT, CARRIER STRING, TAIL_NUM STRING, FL_NUM INT, ORIGIN_AIRPORT_ID INT, ORIGIN_AIRPORT_SEQ_ID INT, 
                                                    ORIGIN STRING, DEST_AIRPORT_ID INT, DEST_AIRPORT_SEQ_ID INT, DEST STRING, DEP_DELAY FLOAT, ARR_DELAY FLOAT, CANCELLED FLOAT, DIVERTED FLOAT, DISTANCE FLOAT) 
            PARTITIONED BY (MONTH TINYINT, DAY_OF_MONTH TINYINT) STORED AS PARQUET;
OK
No rows affected (0.151 seconds)

## 25. Using the above example, load the data for January.

0: jdbc:hive2://> INSERT INTO TABLE flights.flight_data_parquet PARTITION(MONTH = 1, DAY_OF_MONTH) SELECT F.YEAR, F.FL_DATE, 
            F.UNIQUE_CARRIER, F.AIRLINE_ID, F.CARRIER, 
            F.TAIL_NUM, F.FL_NUM, F.ORIGIN_AIRPORT_ID, 
            F.ORIGIN_AIRPORT_SEQ_ID,  F.ORIGIN, F.DEST_AIRPORT_ID, F.DEST_AIRPORT_SEQ_ID, F.DEST, F.DEP_DELAY, F.ARR_DELAY, 
            F.CANCELLED, F.DIVERTED, F.DISTANCE, 
            F.DAY_OF_MONTH FROM flights.flight_data_csv F 
            WHERE F.YEAR != 'YEAR' AND F.MONTH = 1;

19/07/02 18:06:19 [main]: WARN parse.BaseSemanticAnalyzer: Dynamic partitioning is used; only validating 1 columns
Query ID = cloudera_20190702180606_d316cfef-0043-481b-9066-2639d3c3a80a
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
19/07/02 18:06:20 [HiveServer2-Background-Pool: Thread-472]: WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1561941999858_0016, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1561941999858_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1561941999858_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
19/07/02 18:06:44 [HiveServer2-Background-Pool: Thread-472]: WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2019-07-02 18:06:44,615 Stage-1 map = 0%,  reduce = 0%
2019-07-02 18:07:45,684 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 45.61 sec
2019-07-02 18:07:54,811 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 56.52 sec
2019-07-02 18:08:06,344 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 65.15 sec
MapReduce Total cumulative CPU time: 1 minutes 8 seconds 600 msec
Ended Job = job_1561941999858_0016
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/flights.db/flight_data_parquet/month=1/.hive-staging_hive_2019-07-02_18-06-19_654_2849159941001265115-1/-ext-10000
Loading data to table flights.flight_data_parquet partition (month=1, day_of_month=null)
         Time taken for load dynamic partitions : 4288
        Loading partition {month=1, day_of_month=17}
        Loading partition {month=1, day_of_month=20}
        Loading partition {month=1, day_of_month=21}
        Loading partition {month=1, day_of_month=1}
        Loading partition {month=1, day_of_month=2}
        Loading partition {month=1, day_of_month=3}
        Loading partition {month=1, day_of_month=4}
        Loading partition {month=1, day_of_month=24}
        Loading partition {month=1, day_of_month=25}
        Loading partition {month=1, day_of_month=22}
        Loading partition {month=1, day_of_month=23}
        Loading partition {month=1, day_of_month=26}
        Loading partition {month=1, day_of_month=27}
        Loading partition {month=1, day_of_month=28}
        Loading partition {month=1, day_of_month=31}
        Loading partition {month=1, day_of_month=29}
        Loading partition {month=1, day_of_month=9}
        Loading partition {month=1, day_of_month=7}
        Loading partition {month=1, day_of_month=8}
        Loading partition {month=1, day_of_month=6}
        Loading partition {month=1, day_of_month=5}
        Loading partition {month=1, day_of_month=30}
        Loading partition {month=1, day_of_month=11}
        Loading partition {month=1, day_of_month=12}
        Loading partition {month=1, day_of_month=10}
        Loading partition {month=1, day_of_month=14}
        Loading partition {month=1, day_of_month=19}
        Loading partition {month=1, day_of_month=13}
        Loading partition {month=1, day_of_month=18}
        Loading partition {month=1, day_of_month=16}
        Loading partition {month=1, day_of_month=15}
         Time taken for adding to write entity : 13
Partition flights.flight_data_parquet{month=1, day_of_month=1} stats: [numFiles=1, numRows=13019, totalSize=241978, rawDataSize=234342]
Partition flights.flight_data_parquet{month=1, day_of_month=10} stats: [numFiles=1, numRows=13988, totalSize=256251, rawDataSize=251784]
Partition flights.flight_data_parquet{month=1, day_of_month=11} stats: [numFiles=1, numRows=15174, totalSize=269653, rawDataSize=273132]
Partition flights.flight_data_parquet{month=1, day_of_month=12} stats: [numFiles=1, numRows=14566, totalSize=260199, rawDataSize=262188]
Partition flights.flight_data_parquet{month=1, day_of_month=13} stats: [numFiles=1, numRows=14800, totalSize=263475, rawDataSize=266400]
Partition flights.flight_data_parquet{month=1, day_of_month=14} stats: [numFiles=1, numRows=15295, totalSize=271127, rawDataSize=275310]
Partition flights.flight_data_parquet{month=1, day_of_month=15} stats: [numFiles=1, numRows=15308, totalSize=271435, rawDataSize=275544]
Partition flights.flight_data_parquet{month=1, day_of_month=16} stats: [numFiles=1, numRows=11563, totalSize=220315, rawDataSize=208134]
Partition flights.flight_data_parquet{month=1, day_of_month=17} stats: [numFiles=1, numRows=12970, totalSize=239354, rawDataSize=233460]
Partition flights.flight_data_parquet{month=1, day_of_month=18} stats: [numFiles=1, numRows=15107, totalSize=268260, rawDataSize=271926]
Partition flights.flight_data_parquet{month=1, day_of_month=19} stats: [numFiles=1, numRows=14580, totalSize=260326, rawDataSize=262440]
Partition flights.flight_data_parquet{month=1, day_of_month=2} stats: [numFiles=1, numRows=14869, totalSize=267443, rawDataSize=267642]
Partition flights.flight_data_parquet{month=1, day_of_month=20} stats: [numFiles=1, numRows=14790, totalSize=261209, rawDataSize=266220]
Partition flights.flight_data_parquet{month=1, day_of_month=21} stats: [numFiles=1, numRows=15285, totalSize=269271, rawDataSize=275130]
Partition flights.flight_data_parquet{month=1, day_of_month=22} stats: [numFiles=1, numRows=15290, totalSize=269669, rawDataSize=275220]
Partition flights.flight_data_parquet{month=1, day_of_month=23} stats: [numFiles=1, numRows=11732, totalSize=216583, rawDataSize=211176]
Partition flights.flight_data_parquet{month=1, day_of_month=24} stats: [numFiles=1, numRows=14001, totalSize=250626, rawDataSize=252018]
Partition flights.flight_data_parquet{month=1, day_of_month=25} stats: [numFiles=1, numRows=15177, totalSize=269490, rawDataSize=273186]
Partition flights.flight_data_parquet{month=1, day_of_month=26} stats: [numFiles=1, numRows=14545, totalSize=255909, rawDataSize=261810]
Partition flights.flight_data_parquet{month=1, day_of_month=27} stats: [numFiles=1, numRows=14763, totalSize=259035, rawDataSize=265734]
Partition flights.flight_data_parquet{month=1, day_of_month=28} stats: [numFiles=1, numRows=15271, totalSize=268392, rawDataSize=274878]
Partition flights.flight_data_parquet{month=1, day_of_month=29} stats: [numFiles=1, numRows=15293, totalSize=268974, rawDataSize=275274]
Partition flights.flight_data_parquet{month=1, day_of_month=3} stats: [numFiles=1, numRows=15878, totalSize=280991, rawDataSize=285804]
Partition flights.flight_data_parquet{month=1, day_of_month=30} stats: [numFiles=1, numRows=11699, totalSize=218693, rawDataSize=210582]
Partition flights.flight_data_parquet{month=1, day_of_month=31} stats: [numFiles=1, numRows=13817, totalSize=250644, rawDataSize=248706]
Partition flights.flight_data_parquet{month=1, day_of_month=4} stats: [numFiles=1, numRows=15570, totalSize=275065, rawDataSize=280260]
Partition flights.flight_data_parquet{month=1, day_of_month=5} stats: [numFiles=1, numRows=14582, totalSize=261423, rawDataSize=262476]
Partition flights.flight_data_parquet{month=1, day_of_month=6} stats: [numFiles=1, numRows=14683, totalSize=262763, rawDataSize=264294]
Partition flights.flight_data_parquet{month=1, day_of_month=7} stats: [numFiles=1, numRows=15193, totalSize=270452, rawDataSize=273474]
Partition flights.flight_data_parquet{month=1, day_of_month=8} stats: [numFiles=1, numRows=15228, totalSize=271282, rawDataSize=274104]
Partition flights.flight_data_parquet{month=1, day_of_month=9} stats: [numFiles=1, numRows=11791, totalSize=222267, rawDataSize=212238]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 68.6 sec   HDFS Read: 222025083 HDFS Write: 7994969 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 8 seconds 600 msec
OK
No rows affected (121.498 seconds)

## 26. Truncate the table use TRUNCATE TABLE tablename. Tip: If for some reason your data loading is unsuccessful, you need to trucate the table to restart.

0: jdbc:hive2://> TRUNCATE TABLE flight_data_parquet;
OK
No rows affected (6.335 seconds)

## 27. Loading the 4 partitions one by one will be a slightly tedious process because the query to select from flight_data_csv is doing a full table scan across all 5M records for every insert statement you run. 
## Instead of manually executing one statement at a time, create a file called load_parquet.sql that contains similarly formulated data loading commands for all 4 partitions (4 months). Then execute the commands as a batch. 
## This query may take a while to finish as each insert may take a few minutes on our VM. [Report Script load_parquet.sql at the end of your report]

[cloudera@quickstart ~]$ hive -f /home/cloudera/load_parquet.sql

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
Query ID = cloudera_20190702183939_f7ce2c48-6450-4b72-b380-a0fb6ee42542
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1561941999858_0017, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1561941999858_0017/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1561941999858_0017
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-02 18:40:09,770 Stage-1 map = 0%,  reduce = 0%
2019-07-02 18:41:05,451 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 51.32 sec
2019-07-02 18:41:16,762 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 65.44 sec
MapReduce Total cumulative CPU time: 1 minutes 5 seconds 900 msec
Ended Job = job_1561941999858_0017
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/flights.db/flight_data_parquet/month=1/.hive-staging_hive_2019-07-02_18-39-41_045_154619857942660049-1/-ext-10000
Loading data to table flights.flight_data_parquet partition (month=1, day_of_month=null)
         Time taken for load dynamic partitions : 6970
        Loading partition {month=1, day_of_month=14}
        Loading partition {month=1, day_of_month=5}
        Loading partition {month=1, day_of_month=15}
        Loading partition {month=1, day_of_month=4}
        Loading partition {month=1, day_of_month=3}
        Loading partition {month=1, day_of_month=12}
        Loading partition {month=1, day_of_month=13}
        Loading partition {month=1, day_of_month=2}
        Loading partition {month=1, day_of_month=7}
        Loading partition {month=1, day_of_month=6}
        Loading partition {month=1, day_of_month=20}
        Loading partition {month=1, day_of_month=8}
        Loading partition {month=1, day_of_month=9}
        Loading partition {month=1, day_of_month=21}
        Loading partition {month=1, day_of_month=22}
        Loading partition {month=1, day_of_month=1}
        Loading partition {month=1, day_of_month=28}
        Loading partition {month=1, day_of_month=25}
        Loading partition {month=1, day_of_month=26}
        Loading partition {month=1, day_of_month=29}
        Loading partition {month=1, day_of_month=27}
        Loading partition {month=1, day_of_month=24}
        Loading partition {month=1, day_of_month=31}
        Loading partition {month=1, day_of_month=23}
        Loading partition {month=1, day_of_month=30}
        Loading partition {month=1, day_of_month=10}
        Loading partition {month=1, day_of_month=11}
        Loading partition {month=1, day_of_month=17}
        Loading partition {month=1, day_of_month=16}
        Loading partition {month=1, day_of_month=18}
        Loading partition {month=1, day_of_month=19}
         Time taken for adding to write entity : 80
Partition flights.flight_data_parquet{month=1, day_of_month=1} stats: [numFiles=1, numRows=13019, totalSize=241978, rawDataSize=234342]
Partition flights.flight_data_parquet{month=1, day_of_month=10} stats: [numFiles=1, numRows=13988, totalSize=256251, rawDataSize=251784]
Partition flights.flight_data_parquet{month=1, day_of_month=11} stats: [numFiles=1, numRows=15174, totalSize=269653, rawDataSize=273132]
Partition flights.flight_data_parquet{month=1, day_of_month=12} stats: [numFiles=1, numRows=14566, totalSize=260199, rawDataSize=262188]
Partition flights.flight_data_parquet{month=1, day_of_month=13} stats: [numFiles=1, numRows=14800, totalSize=263475, rawDataSize=266400]
Partition flights.flight_data_parquet{month=1, day_of_month=14} stats: [numFiles=1, numRows=15295, totalSize=271127, rawDataSize=275310]
Partition flights.flight_data_parquet{month=1, day_of_month=15} stats: [numFiles=1, numRows=15308, totalSize=271435, rawDataSize=275544]
Partition flights.flight_data_parquet{month=1, day_of_month=16} stats: [numFiles=1, numRows=11563, totalSize=220315, rawDataSize=208134]
Partition flights.flight_data_parquet{month=1, day_of_month=17} stats: [numFiles=1, numRows=12970, totalSize=239354, rawDataSize=233460]
Partition flights.flight_data_parquet{month=1, day_of_month=18} stats: [numFiles=1, numRows=15107, totalSize=268260, rawDataSize=271926]
Partition flights.flight_data_parquet{month=1, day_of_month=19} stats: [numFiles=1, numRows=14580, totalSize=260326, rawDataSize=262440]
Partition flights.flight_data_parquet{month=1, day_of_month=2} stats: [numFiles=1, numRows=14869, totalSize=267443, rawDataSize=267642]
Partition flights.flight_data_parquet{month=1, day_of_month=20} stats: [numFiles=1, numRows=14790, totalSize=261209, rawDataSize=266220]
Partition flights.flight_data_parquet{month=1, day_of_month=21} stats: [numFiles=1, numRows=15285, totalSize=269271, rawDataSize=275130]
Partition flights.flight_data_parquet{month=1, day_of_month=22} stats: [numFiles=1, numRows=15290, totalSize=269669, rawDataSize=275220]
Partition flights.flight_data_parquet{month=1, day_of_month=23} stats: [numFiles=1, numRows=11732, totalSize=216583, rawDataSize=211176]
Partition flights.flight_data_parquet{month=1, day_of_month=24} stats: [numFiles=1, numRows=14001, totalSize=250626, rawDataSize=252018]
Partition flights.flight_data_parquet{month=1, day_of_month=25} stats: [numFiles=1, numRows=15177, totalSize=269490, rawDataSize=273186]
Partition flights.flight_data_parquet{month=1, day_of_month=26} stats: [numFiles=1, numRows=14545, totalSize=255909, rawDataSize=261810]
Partition flights.flight_data_parquet{month=1, day_of_month=27} stats: [numFiles=1, numRows=14763, totalSize=259035, rawDataSize=265734]
Partition flights.flight_data_parquet{month=1, day_of_month=28} stats: [numFiles=1, numRows=15271, totalSize=268392, rawDataSize=274878]
Partition flights.flight_data_parquet{month=1, day_of_month=29} stats: [numFiles=1, numRows=15293, totalSize=268974, rawDataSize=275274]
Partition flights.flight_data_parquet{month=1, day_of_month=3} stats: [numFiles=1, numRows=15878, totalSize=280991, rawDataSize=285804]
Partition flights.flight_data_parquet{month=1, day_of_month=30} stats: [numFiles=1, numRows=11699, totalSize=218693, rawDataSize=210582]
Partition flights.flight_data_parquet{month=1, day_of_month=31} stats: [numFiles=1, numRows=13817, totalSize=250644, rawDataSize=248706]
Partition flights.flight_data_parquet{month=1, day_of_month=4} stats: [numFiles=1, numRows=15570, totalSize=275065, rawDataSize=280260]
Partition flights.flight_data_parquet{month=1, day_of_month=5} stats: [numFiles=1, numRows=14582, totalSize=261423, rawDataSize=262476]
Partition flights.flight_data_parquet{month=1, day_of_month=6} stats: [numFiles=1, numRows=14683, totalSize=262763, rawDataSize=264294]
Partition flights.flight_data_parquet{month=1, day_of_month=7} stats: [numFiles=1, numRows=15193, totalSize=270452, rawDataSize=273474]
Partition flights.flight_data_parquet{month=1, day_of_month=8} stats: [numFiles=1, numRows=15228, totalSize=271282, rawDataSize=274104]
Partition flights.flight_data_parquet{month=1, day_of_month=9} stats: [numFiles=1, numRows=11791, totalSize=222267, rawDataSize=212238]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 65.9 sec   HDFS Read: 222025070 HDFS Write: 7994969 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 5 seconds 900 msec
OK
Time taken: 108.767 seconds
Query ID = cloudera_20190702184141_18aa79e0-1bc4-43af-a5ce-b545cf6d19d6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1561941999858_0018, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1561941999858_0018/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1561941999858_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-02 18:41:48,228 Stage-1 map = 0%,  reduce = 0%
2019-07-02 18:42:42,035 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 52.59 sec
2019-07-02 18:42:54,234 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 66.75 sec
MapReduce Total cumulative CPU time: 1 minutes 6 seconds 750 msec
Ended Job = job_1561941999858_0018
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/flights.db/flight_data_parquet/month=2/.hive-staging_hive_2019-07-02_18-41-29_669_504214137251046844-1/-ext-10000
Loading data to table flights.flight_data_parquet partition (month=2, day_of_month=null)
         Time taken for load dynamic partitions : 3860
        Loading partition {month=2, day_of_month=27}
        Loading partition {month=2, day_of_month=3}
        Loading partition {month=2, day_of_month=2}
        Loading partition {month=2, day_of_month=28}
        Loading partition {month=2, day_of_month=26}
        Loading partition {month=2, day_of_month=25}
        Loading partition {month=2, day_of_month=24}
        Loading partition {month=2, day_of_month=1}
        Loading partition {month=2, day_of_month=4}
        Loading partition {month=2, day_of_month=7}
        Loading partition {month=2, day_of_month=6}
        Loading partition {month=2, day_of_month=5}
        Loading partition {month=2, day_of_month=21}
        Loading partition {month=2, day_of_month=8}
        Loading partition {month=2, day_of_month=22}
        Loading partition {month=2, day_of_month=9}
        Loading partition {month=2, day_of_month=20}
        Loading partition {month=2, day_of_month=23}
        Loading partition {month=2, day_of_month=29}
        Loading partition {month=2, day_of_month=14}
        Loading partition {month=2, day_of_month=13}
        Loading partition {month=2, day_of_month=10}
        Loading partition {month=2, day_of_month=11}
        Loading partition {month=2, day_of_month=12}
        Loading partition {month=2, day_of_month=18}
        Loading partition {month=2, day_of_month=15}
        Loading partition {month=2, day_of_month=16}
        Loading partition {month=2, day_of_month=19}
        Loading partition {month=2, day_of_month=17}
         Time taken for adding to write entity : 43
Partition flights.flight_data_parquet{month=2, day_of_month=1} stats: [numFiles=1, numRows=15202, totalSize=265965, rawDataSize=273636]
Partition flights.flight_data_parquet{month=2, day_of_month=10} stats: [numFiles=1, numRows=14879, totalSize=262440, rawDataSize=267822]
Partition flights.flight_data_parquet{month=2, day_of_month=11} stats: [numFiles=1, numRows=15508, totalSize=274463, rawDataSize=279144]
Partition flights.flight_data_parquet{month=2, day_of_month=12} stats: [numFiles=1, numRows=15603, totalSize=275593, rawDataSize=280854]
Partition flights.flight_data_parquet{month=2, day_of_month=13} stats: [numFiles=1, numRows=12038, totalSize=225383, rawDataSize=216684]
Partition flights.flight_data_parquet{month=2, day_of_month=14} stats: [numFiles=1, numRows=13508, totalSize=249218, rawDataSize=243144]
Partition flights.flight_data_parquet{month=2, day_of_month=15} stats: [numFiles=1, numRows=15459, totalSize=276061, rawDataSize=278262]
Partition flights.flight_data_parquet{month=2, day_of_month=16} stats: [numFiles=1, numRows=15288, totalSize=273264, rawDataSize=275184]
Partition flights.flight_data_parquet{month=2, day_of_month=17} stats: [numFiles=1, numRows=15326, totalSize=271823, rawDataSize=275868]
Partition flights.flight_data_parquet{month=2, day_of_month=18} stats: [numFiles=1, numRows=15573, totalSize=275377, rawDataSize=280314]
Partition flights.flight_data_parquet{month=2, day_of_month=19} stats: [numFiles=1, numRows=15566, totalSize=275891, rawDataSize=280188]
Partition flights.flight_data_parquet{month=2, day_of_month=2} stats: [numFiles=1, numRows=14561, totalSize=262121, rawDataSize=262098]
Partition flights.flight_data_parquet{month=2, day_of_month=20} stats: [numFiles=1, numRows=12266, totalSize=227025, rawDataSize=220788]
Partition flights.flight_data_parquet{month=2, day_of_month=21} stats: [numFiles=1, numRows=14411, totalSize=258871, rawDataSize=259398]
Partition flights.flight_data_parquet{month=2, day_of_month=22} stats: [numFiles=1, numRows=15533, totalSize=272164, rawDataSize=279594]
Partition flights.flight_data_parquet{month=2, day_of_month=23} stats: [numFiles=1, numRows=15146, totalSize=267036, rawDataSize=272628]
Partition flights.flight_data_parquet{month=2, day_of_month=24} stats: [numFiles=1, numRows=15335, totalSize=274143, rawDataSize=276030]
Partition flights.flight_data_parquet{month=2, day_of_month=25} stats: [numFiles=1, numRows=15574, totalSize=276730, rawDataSize=280332]
Partition flights.flight_data_parquet{month=2, day_of_month=26} stats: [numFiles=1, numRows=15577, totalSize=275143, rawDataSize=280386]
Partition flights.flight_data_parquet{month=2, day_of_month=27} stats: [numFiles=1, numRows=12253, totalSize=228795, rawDataSize=220554]
Partition flights.flight_data_parquet{month=2, day_of_month=28} stats: [numFiles=1, numRows=14420, totalSize=256661, rawDataSize=259560]
Partition flights.flight_data_parquet{month=2, day_of_month=29} stats: [numFiles=1, numRows=15561, totalSize=270878, rawDataSize=280098]
Partition flights.flight_data_parquet{month=2, day_of_month=3} stats: [numFiles=1, numRows=14786, totalSize=264270, rawDataSize=266148]
Partition flights.flight_data_parquet{month=2, day_of_month=4} stats: [numFiles=1, numRows=15290, totalSize=269146, rawDataSize=275220]
Partition flights.flight_data_parquet{month=2, day_of_month=5} stats: [numFiles=1, numRows=15351, totalSize=273963, rawDataSize=276318]
Partition flights.flight_data_parquet{month=2, day_of_month=6} stats: [numFiles=1, numRows=11612, totalSize=217178, rawDataSize=209016]
Partition flights.flight_data_parquet{month=2, day_of_month=7} stats: [numFiles=1, numRows=12409, totalSize=229098, rawDataSize=223362]
Partition flights.flight_data_parquet{month=2, day_of_month=8} stats: [numFiles=1, numRows=15232, totalSize=271104, rawDataSize=274176]
Partition flights.flight_data_parquet{month=2, day_of_month=9} stats: [numFiles=1, numRows=14622, totalSize=261553, rawDataSize=263196]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 66.75 sec   HDFS Read: 222025070 HDFS Write: 7583618 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 6 seconds 750 msec
OK
Time taken: 92.324 seconds
Query ID = cloudera_20190702184343_fc3983a6-aad1-4761-ae7a-e54f78eee5c8
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1561941999858_0019, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1561941999858_0019/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1561941999858_0019
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-02 18:43:19,826 Stage-1 map = 0%,  reduce = 0%
2019-07-02 18:44:00,893 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 38.15 sec
2019-07-02 18:44:19,772 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 58.38 sec
MapReduce Total cumulative CPU time: 58 seconds 380 msec
Ended Job = job_1561941999858_0019
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/flights.db/flight_data_parquet/month=3/.hive-staging_hive_2019-07-02_18-43-02_003_6896598711090725783-1/-ext-10000
Loading data to table flights.flight_data_parquet partition (month=3, day_of_month=null)
         Time taken for load dynamic partitions : 3961
        Loading partition {month=3, day_of_month=18}
        Loading partition {month=3, day_of_month=20}
        Loading partition {month=3, day_of_month=21}
        Loading partition {month=3, day_of_month=22}
        Loading partition {month=3, day_of_month=12}
        Loading partition {month=3, day_of_month=2}
        Loading partition {month=3, day_of_month=1}
        Loading partition {month=3, day_of_month=11}
        Loading partition {month=3, day_of_month=3}
        Loading partition {month=3, day_of_month=10}
        Loading partition {month=3, day_of_month=19}
        Loading partition {month=3, day_of_month=13}
        Loading partition {month=3, day_of_month=14}
        Loading partition {month=3, day_of_month=17}
        Loading partition {month=3, day_of_month=15}
        Loading partition {month=3, day_of_month=16}
        Loading partition {month=3, day_of_month=7}
        Loading partition {month=3, day_of_month=8}
        Loading partition {month=3, day_of_month=9}
        Loading partition {month=3, day_of_month=5}
        Loading partition {month=3, day_of_month=6}
        Loading partition {month=3, day_of_month=4}
        Loading partition {month=3, day_of_month=30}
        Loading partition {month=3, day_of_month=31}
        Loading partition {month=3, day_of_month=25}
        Loading partition {month=3, day_of_month=26}
        Loading partition {month=3, day_of_month=24}
        Loading partition {month=3, day_of_month=23}
        Loading partition {month=3, day_of_month=27}
        Loading partition {month=3, day_of_month=28}
        Loading partition {month=3, day_of_month=29}
         Time taken for adding to write entity : 33
Partition flights.flight_data_parquet{month=3, day_of_month=1} stats: [numFiles=1, numRows=15151, totalSize=270554, rawDataSize=272718]
Partition flights.flight_data_parquet{month=3, day_of_month=10} stats: [numFiles=1, numRows=16105, totalSize=283774, rawDataSize=289890]
Partition flights.flight_data_parquet{month=3, day_of_month=11} stats: [numFiles=1, numRows=16107, totalSize=283811, rawDataSize=289926]
Partition flights.flight_data_parquet{month=3, day_of_month=12} stats: [numFiles=1, numRows=13545, totalSize=247695, rawDataSize=243810]
Partition flights.flight_data_parquet{month=3, day_of_month=13} stats: [numFiles=1, numRows=15459, totalSize=275705, rawDataSize=278262]
Partition flights.flight_data_parquet{month=3, day_of_month=14} stats: [numFiles=1, numRows=16092, totalSize=283246, rawDataSize=289656]
Partition flights.flight_data_parquet{month=3, day_of_month=15} stats: [numFiles=1, numRows=15643, totalSize=274200, rawDataSize=281574]
Partition flights.flight_data_parquet{month=3, day_of_month=16} stats: [numFiles=1, numRows=15796, totalSize=278091, rawDataSize=284328]
Partition flights.flight_data_parquet{month=3, day_of_month=17} stats: [numFiles=1, numRows=16162, totalSize=281613, rawDataSize=290916]
Partition flights.flight_data_parquet{month=3, day_of_month=18} stats: [numFiles=1, numRows=16125, totalSize=283300, rawDataSize=290250]
Partition flights.flight_data_parquet{month=3, day_of_month=19} stats: [numFiles=1, numRows=13575, totalSize=248202, rawDataSize=244350]
Partition flights.flight_data_parquet{month=3, day_of_month=2} stats: [numFiles=1, numRows=15471, totalSize=272074, rawDataSize=278478]
Partition flights.flight_data_parquet{month=3, day_of_month=20} stats: [numFiles=1, numRows=15490, totalSize=275132, rawDataSize=278820]
Partition flights.flight_data_parquet{month=3, day_of_month=21} stats: [numFiles=1, numRows=16095, totalSize=283467, rawDataSize=289710]
Partition flights.flight_data_parquet{month=3, day_of_month=22} stats: [numFiles=1, numRows=15617, totalSize=271368, rawDataSize=281106]
Partition flights.flight_data_parquet{month=3, day_of_month=23} stats: [numFiles=1, numRows=15779, totalSize=277437, rawDataSize=284022]
Partition flights.flight_data_parquet{month=3, day_of_month=24} stats: [numFiles=1, numRows=16152, totalSize=286250, rawDataSize=290736]
Partition flights.flight_data_parquet{month=3, day_of_month=25} stats: [numFiles=1, numRows=16086, totalSize=283660, rawDataSize=289548]
Partition flights.flight_data_parquet{month=3, day_of_month=26} stats: [numFiles=1, numRows=13551, totalSize=248367, rawDataSize=243918]
Partition flights.flight_data_parquet{month=3, day_of_month=27} stats: [numFiles=1, numRows=15296, totalSize=272587, rawDataSize=275328]
Partition flights.flight_data_parquet{month=3, day_of_month=28} stats: [numFiles=1, numRows=16095, totalSize=283175, rawDataSize=289710]
Partition flights.flight_data_parquet{month=3, day_of_month=29} stats: [numFiles=1, numRows=15612, totalSize=273379, rawDataSize=281016]
Partition flights.flight_data_parquet{month=3, day_of_month=3} stats: [numFiles=1, numRows=15849, totalSize=279474, rawDataSize=285282]
Partition flights.flight_data_parquet{month=3, day_of_month=30} stats: [numFiles=1, numRows=15768, totalSize=278946, rawDataSize=283824]
Partition flights.flight_data_parquet{month=3, day_of_month=31} stats: [numFiles=1, numRows=16065, totalSize=283097, rawDataSize=289170]
Partition flights.flight_data_parquet{month=3, day_of_month=4} stats: [numFiles=1, numRows=15827, totalSize=279075, rawDataSize=284886]
Partition flights.flight_data_parquet{month=3, day_of_month=5} stats: [numFiles=1, numRows=13115, totalSize=242439, rawDataSize=236070]
Partition flights.flight_data_parquet{month=3, day_of_month=6} stats: [numFiles=1, numRows=14870, totalSize=267925, rawDataSize=267660]
Partition flights.flight_data_parquet{month=3, day_of_month=7} stats: [numFiles=1, numRows=15812, totalSize=279011, rawDataSize=284616]
Partition flights.flight_data_parquet{month=3, day_of_month=8} stats: [numFiles=1, numRows=15341, totalSize=272906, rawDataSize=276138]
Partition flights.flight_data_parquet{month=3, day_of_month=9} stats: [numFiles=1, numRows=15471, totalSize=275283, rawDataSize=278478]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 58.38 sec   HDFS Read: 222025073 HDFS Write: 8497658 SUCCESS
Total MapReduce CPU Time Spent: 58 seconds 380 msec
OK
Time taken: 86.68 seconds
Query ID = cloudera_20190702184444_9f4c5e56-8ee4-4521-9c85-0c8facf6ca4e
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1561941999858_0020, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1561941999858_0020/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1561941999858_0020
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-02 18:44:48,185 Stage-1 map = 0%,  reduce = 0%
2019-07-02 18:45:23,980 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 28.69 sec
2019-07-02 18:45:48,321 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 57.97 sec
MapReduce Total cumulative CPU time: 1 minutes 2 seconds 400 msec
Ended Job = job_1561941999858_0020
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://quickstart.cloudera:8020/user/hive/warehouse/flights.db/flight_data_parquet/month=4/.hive-staging_hive_2019-07-02_18-44-28_688_4011904282550586482-1/-ext-10000
Loading data to table flights.flight_data_parquet partition (month=4, day_of_month=null)
         Time taken for load dynamic partitions : 3618
        Loading partition {month=4, day_of_month=25}
        Loading partition {month=4, day_of_month=6}
        Loading partition {month=4, day_of_month=5}
        Loading partition {month=4, day_of_month=4}
        Loading partition {month=4, day_of_month=26}
        Loading partition {month=4, day_of_month=23}
        Loading partition {month=4, day_of_month=24}
        Loading partition {month=4, day_of_month=22}
        Loading partition {month=4, day_of_month=3}
        Loading partition {month=4, day_of_month=8}
        Loading partition {month=4, day_of_month=7}
        Loading partition {month=4, day_of_month=30}
        Loading partition {month=4, day_of_month=9}
        Loading partition {month=4, day_of_month=18}
        Loading partition {month=4, day_of_month=2}
        Loading partition {month=4, day_of_month=1}
        Loading partition {month=4, day_of_month=15}
        Loading partition {month=4, day_of_month=10}
        Loading partition {month=4, day_of_month=19}
        Loading partition {month=4, day_of_month=13}
        Loading partition {month=4, day_of_month=14}
        Loading partition {month=4, day_of_month=17}
        Loading partition {month=4, day_of_month=16}
        Loading partition {month=4, day_of_month=12}
        Loading partition {month=4, day_of_month=11}
        Loading partition {month=4, day_of_month=20}
        Loading partition {month=4, day_of_month=21}
        Loading partition {month=4, day_of_month=27}
        Loading partition {month=4, day_of_month=28}
        Loading partition {month=4, day_of_month=29}
         Time taken for adding to write entity : 26
Partition flights.flight_data_parquet{month=4, day_of_month=1} stats: [numFiles=1, numRows=16021, totalSize=282491, rawDataSize=288378]
Partition flights.flight_data_parquet{month=4, day_of_month=10} stats: [numFiles=1, numRows=15304, totalSize=272047, rawDataSize=275472]
Partition flights.flight_data_parquet{month=4, day_of_month=11} stats: [numFiles=1, numRows=16128, totalSize=281014, rawDataSize=290304]
Partition flights.flight_data_parquet{month=4, day_of_month=12} stats: [numFiles=1, numRows=15625, totalSize=271533, rawDataSize=281250]
Partition flights.flight_data_parquet{month=4, day_of_month=13} stats: [numFiles=1, numRows=15802, totalSize=273670, rawDataSize=284436]
Partition flights.flight_data_parquet{month=4, day_of_month=14} stats: [numFiles=1, numRows=16108, totalSize=277825, rawDataSize=289944]
Partition flights.flight_data_parquet{month=4, day_of_month=15} stats: [numFiles=1, numRows=16134, totalSize=280829, rawDataSize=290412]
Partition flights.flight_data_parquet{month=4, day_of_month=16} stats: [numFiles=1, numRows=12800, totalSize=233679, rawDataSize=230400]
Partition flights.flight_data_parquet{month=4, day_of_month=17} stats: [numFiles=1, numRows=15186, totalSize=272290, rawDataSize=273348]
Partition flights.flight_data_parquet{month=4, day_of_month=18} stats: [numFiles=1, numRows=16109, totalSize=282907, rawDataSize=289962]
Partition flights.flight_data_parquet{month=4, day_of_month=19} stats: [numFiles=1, numRows=15749, totalSize=273492, rawDataSize=283482]
Partition flights.flight_data_parquet{month=4, day_of_month=2} stats: [numFiles=1, numRows=13426, totalSize=247536, rawDataSize=241668]
Partition flights.flight_data_parquet{month=4, day_of_month=20} stats: [numFiles=1, numRows=15819, totalSize=273668, rawDataSize=284742]
Partition flights.flight_data_parquet{month=4, day_of_month=21} stats: [numFiles=1, numRows=16100, totalSize=280355, rawDataSize=289800]
Partition flights.flight_data_parquet{month=4, day_of_month=22} stats: [numFiles=1, numRows=16156, totalSize=283198, rawDataSize=290808]
Partition flights.flight_data_parquet{month=4, day_of_month=23} stats: [numFiles=1, numRows=12795, totalSize=235973, rawDataSize=230310]
Partition flights.flight_data_parquet{month=4, day_of_month=24} stats: [numFiles=1, numRows=15197, totalSize=266825, rawDataSize=273546]
Partition flights.flight_data_parquet{month=4, day_of_month=25} stats: [numFiles=1, numRows=16105, totalSize=278439, rawDataSize=289890]
Partition flights.flight_data_parquet{month=4, day_of_month=26} stats: [numFiles=1, numRows=15746, totalSize=278455, rawDataSize=283428]
Partition flights.flight_data_parquet{month=4, day_of_month=27} stats: [numFiles=1, numRows=15820, totalSize=278745, rawDataSize=284760]
Partition flights.flight_data_parquet{month=4, day_of_month=28} stats: [numFiles=1, numRows=16124, totalSize=283195, rawDataSize=290232]
Partition flights.flight_data_parquet{month=4, day_of_month=29} stats: [numFiles=1, numRows=16143, totalSize=284755, rawDataSize=290574]
Partition flights.flight_data_parquet{month=4, day_of_month=3} stats: [numFiles=1, numRows=15333, totalSize=273550, rawDataSize=275994]
Partition flights.flight_data_parquet{month=4, day_of_month=30} stats: [numFiles=1, numRows=12770, totalSize=238692, rawDataSize=229860]
Partition flights.flight_data_parquet{month=4, day_of_month=4} stats: [numFiles=1, numRows=16084, totalSize=283111, rawDataSize=289512]
Partition flights.flight_data_parquet{month=4, day_of_month=5} stats: [numFiles=1, numRows=15740, totalSize=274810, rawDataSize=283320]
Partition flights.flight_data_parquet{month=4, day_of_month=6} stats: [numFiles=1, numRows=15834, totalSize=274025, rawDataSize=285012]
Partition flights.flight_data_parquet{month=4, day_of_month=7} stats: [numFiles=1, numRows=16122, totalSize=283685, rawDataSize=290196]
Partition flights.flight_data_parquet{month=4, day_of_month=8} stats: [numFiles=1, numRows=16151, totalSize=283665, rawDataSize=290718]
Partition flights.flight_data_parquet{month=4, day_of_month=9} stats: [numFiles=1, numRows=13199, totalSize=242157, rawDataSize=237582]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 62.4 sec   HDFS Read: 222025073 HDFS Write: 8148954 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 2 seconds 400 msec
OK
Time taken: 91.101 seconds
WARN: The method class org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
WARN: Please see http://www.slf4j.org/codes.html#release for an explanation.

In [None]:
## 28. What is the size taken on disk for the data in each table? Fill the result in the following table Hint: hadoop fs -du -h hdfs_directory. [Report Results & Fill Table]

[cloudera@quickstart ~]$ hadoop fs -du -h /user/hive/warehouse/flights.db
211.7 M  211.7 M  /user/hive/warehouse/flights.db/flight_data_csv
30.7 M   30.7 M   /user/hive/warehouse/flights.db/flight_data_parquet

|Comparison |flight_data_csv |flight_data_parquet |
|-----------|:--------------:|-------------------:|
|disk usage| 211.7 M| 30.7 M |

In [None]:
## 29. What is the average flight distance in March? How long did the query take? [Report Results and Fill Time in Table]

0: jdbc:hive2://> SELECT AVG(DISTANCE) FROM flight_data_csv WHERE MONTH = 3;

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 41.41 sec   HDFS Read: 222027268 HDFS Write: 18 SUCCESS
Total MapReduce CPU Time Spent: 41 seconds 410 msec
OK
+--------------------+--+
|        _c0         |
+--------------------+--+
| 851.4734618740113  |
+--------------------+--+
1 row selected (77.059 seconds)

0: jdbc:hive2://> SELECT AVG(DISTANCE) FROM flight_data_parquet WHERE MONTH = 3;

Total MapReduce CPU Time Spent: 19 seconds 580 msec
OK
+--------------------+--+
|        _c0         |
+--------------------+--+
| 851.4734618740113  |
+--------------------+--+
1 row selected (55.19 seconds)

|Comparison |flight_data_csv |flight_data_parquet |
|-----------|:--------------:|-------------------:|
|avg dist in march| 77.059 s| 55.19 s|


In [None]:
## 30. Compare the results in the table. Explain why your are seeing a difference in the query performance across two storage formats? [Report Answer]

## The partitioning definitely helps the query performance for the parquet table.  Since the query is being filtered on month, and we partitioned on month and day_of_month, our query does not have to read through
## and do an entire table scan.  Instead, it just has to locate the subdirectory for month = 3 (March), and then perform the calculation on that subset of data.

In [None]:
## 31. The datasets we have created take up space on HDFS and on your local machine. We ask you to properly remove the Hive databases/tables, HDFS/local datasets that are generated as a result of this assignment 
## (but you can keep yor load_parquet.sql).

0: jdbc:hive2://> DROP DATABASE flights CASCADE;
OK
No rows affected (0.415 seconds)

[cloudera@quickstart ~]$ rm -r /home/cloudera/flights
[cloudera@quickstart ~]$

[cloudera@quickstart ~]$ hadoop fs -rm -r /user/cloudera/flights
Deleted /user/cloudera/flights

# load_parquet.sql

In [None]:
INSERT INTO TABLE flights.flight_data_parquet PARTITION(MONTH = 1, DAY_OF_MONTH) SELECT F.YEAR, F.FL_DATE, F.UNIQUE_CARRIER, F.AIRLINE_ID, F.CARRIER, F.TAIL_NUM, F.FL_NUM, F.ORIGIN_AIRPORT_ID, F.ORIGIN_AIRPORT_SEQ_ID,  
F.ORIGIN, F.DEST_AIRPORT_ID, F.DEST_AIRPORT_SEQ_ID, F.DEST, F.DEP_DELAY, F.ARR_DELAY, F.CANCELLED, F.DIVERTED, F.DISTANCE, F.DAY_OF_MONTH FROM flights.flight_data_csv F WHERE F.YEAR != 'YEAR' AND F.MONTH = 1;
INSERT INTO TABLE flights.flight_data_parquet PARTITION(MONTH = 2, DAY_OF_MONTH) SELECT F.YEAR, F.FL_DATE, F.UNIQUE_CARRIER, F.AIRLINE_ID, F.CARRIER, F.TAIL_NUM, F.FL_NUM, F.ORIGIN_AIRPORT_ID, F.ORIGIN_AIRPORT_SEQ_ID,  
F.ORIGIN, F.DEST_AIRPORT_ID, F.DEST_AIRPORT_SEQ_ID, F.DEST, F.DEP_DELAY, F.ARR_DELAY, F.CANCELLED, F.DIVERTED, F.DISTANCE, F.DAY_OF_MONTH FROM flights.flight_data_csv F WHERE F.YEAR != 'YEAR' AND F.MONTH = 2;
INSERT INTO TABLE flights.flight_data_parquet PARTITION(MONTH = 3, DAY_OF_MONTH) SELECT F.YEAR, F.FL_DATE, F.UNIQUE_CARRIER, F.AIRLINE_ID, F.CARRIER, F.TAIL_NUM, F.FL_NUM, F.ORIGIN_AIRPORT_ID, F.ORIGIN_AIRPORT_SEQ_ID,  
F.ORIGIN, F.DEST_AIRPORT_ID, F.DEST_AIRPORT_SEQ_ID, F.DEST, F.DEP_DELAY, F.ARR_DELAY, F.CANCELLED, F.DIVERTED, F.DISTANCE, F.DAY_OF_MONTH FROM flights.flight_data_csv F WHERE F.YEAR != 'YEAR' AND F.MONTH = 3;
INSERT INTO TABLE flights.flight_data_parquet PARTITION(MONTH = 4, DAY_OF_MONTH) SELECT F.YEAR, F.FL_DATE, F.UNIQUE_CARRIER, F.AIRLINE_ID, F.CARRIER, F.TAIL_NUM, F.FL_NUM, F.ORIGIN_AIRPORT_ID, F.ORIGIN_AIRPORT_SEQ_ID,  
F.ORIGIN, F.DEST_AIRPORT_ID, F.DEST_AIRPORT_SEQ_ID, F.DEST, F.DEP_DELAY, F.ARR_DELAY, F.CANCELLED, F.DIVERTED, F.DISTANCE, F.DAY_OF_MONTH FROM flights.flight_data_csv F WHERE F.YEAR != 'YEAR' AND F.MONTH = 4;