## Inserting Data using Stage Table

Let us understand how to insert data into order_items with Parquet file format. 

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - DML and Partitioning").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@3925730f


org.apache.spark.sql.SparkSession@3925730f

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

As data is in text file format and our table is created with Parquet file format, we will not be able to use LOAD command to load the data.

In [5]:
%%sql
use itv002461_retail

++
||
++
++



In [6]:
%%sql

LOAD DATA LOCAL INPATH '/data/retail_db/order_items'
    OVERWRITE INTO TABLE order_items

Moved: 'hdfs://m01.itversity.com:9000/user/itv002461/warehouse/itv002461_retail.db/order_items/part-00000' to trash at: hdfs://m01.itversity.com:9000/user/itv002461/.Trash/Current


++
||
++
++



* Above load command will be successful, however when we try to query it will fail as the query expects data to be in Parquet file format.

In [7]:
%%sql

SELECT * FROM order_items LIMIT 10

	at org.apache.parquet.hadoop.Parq...


Magic sql failed to execute with error: 
Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 4, w01.itversity.com, executor 2): java.lang.RuntimeException: hdfs://m01.itversity.com:9000/user/itv002461/warehouse/itv002461_retail.db/order_items/part-00000 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [48, 46, 48, 10]
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:524)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:505)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:499)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:448)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.footerFileMetaData$lzycompute$1(ParquetFileFormat.scala:371)
	at org.apache.spark.sql.execution.datasources.parquet

In [8]:
%%sql

TRUNCATE TABLE order_items

++
||
++
++



Following are the steps to get data into table which is created using different file format or delimiter than our source data.

* We need to create stage table with text file format and comma as delimiter (order_items_stage).
* Load data from our files in local file system to stage table.
* Using stage table run insert command to insert data into our target table (order_items).

Let us see an example of inserting data into the target table from staging table.

In [9]:
%%sql

USE itv002461_retail

++
||
++
++



In [10]:
%%sql

SHOW tables

+----------------+-----------+-----------+
|        database|  tableName|isTemporary|
+----------------+-----------+-----------+
|itv002461_retail| categories|      false|
|itv002461_retail|  customers|      false|
|itv002461_retail|departments|      false|
|itv002461_retail|order_items|      false|
|itv002461_retail|     orders|      false|
|itv002461_retail|   products|      false|
+----------------+-----------+-----------+



In [11]:
%%sql

CREATE TABLE order_items_stage (
  order_item_id INT,
  order_item_order_id INT,
  order_item_product_id INT,
  order_item_quantity INT,
  order_item_subtotal FLOAT,
  order_item_product_price FLOAT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

++
||
++
++



In [12]:
spark.sql("DESCRIBE FORMATTED order_items_stage").show(200, false)

+----------------------------+--------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                   |comment|
+----------------------------+--------------------------------------------------------------------------------------------+-------+
|order_item_id               |int                                                                                         |null   |
|order_item_order_id         |int                                                                                         |null   |
|order_item_product_id       |int                                                                                         |null   |
|order_item_quantity         |int                                                                                         |null   |
|order_item_subtotal         |float                                         

In [13]:
%%sql

LOAD DATA LOCAL INPATH '/data/retail_db/order_items' INTO TABLE order_items_stage

++
||
++
++



In [14]:
%%sql

SELECT * FROM order_items_stage LIMIT 10

|            3|                  2|                  502|                  5|              250.0|     ...


+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

In [15]:
%%sql

TRUNCATE TABLE order_items

++
||
++
++



In [16]:
%%sql

INSERT INTO TABLE order_items
SELECT * FROM order_items_stage

++
||
++
++



In [17]:
%%sql

SELECT * FROM order_items LIMIT 10

|            3|                  2|                  502|                  5|              250.0|     ...


+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

In [18]:
%%sql

SELECT count(1) FROM order_items

+--------+
|count(1)|
+--------+
|  172198|
+--------+



* `INSERT INTO` will append data into the target table by adding new files.

In [19]:
%%sql

INSERT INTO TABLE order_items
SELECT * FROM order_items_stage

++
||
++
++



In [20]:
%%sql

SELECT * FROM order_items LIMIT 10

|            3|                  2|                  502|                  5|              250.0|     ...


+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

In [21]:
%%sql

SELECT count(1) FROM order_items

+--------+
|count(1)|
+--------+
|  344396|
+--------+



* `INSERT OVERWRITE` will overwrite the data in target table by deleting the files related to old data from the directory pointed by the Spark Metastore table.

In [22]:
%%sql

INSERT OVERWRITE TABLE order_items
SELECT * FROM order_items_stage

++
||
++
++



In [23]:
%%sql

SELECT * FROM order_items

|            3|                  2|                  502|                  5|              250.0|     ...


+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|order_item_id|order_item_order_id|order_item_product_id|order_item_quantity|order_item_subtotal|order_item_product_price|
+-------------+-------------------+---------------------+-------------------+-------------------+------------------------+
|            1|                  1|                  957|                  1|             299.98|                  299.98|
|            2|                  2|                 1073|                  1|             199.99|                  199.99|
|            3|                  2|                  502|                  5|              250.0|                    50.0|
|            4|                  2|                  403|                  1|             129.99|                  129.99|
|            5|                  4|                  897|                  2|              49.98|                   24.99|
|            6| 

In [24]:
%%sql

SELECT count(1) FROM order_items

+--------+
|count(1)|
+--------+
|  172198|
+--------+



In [25]:
import sys.process._

s"hdfs dfs -ls /user/${username}/warehouse/${username}_retail.db/order_items" !

Found 3 items
-rw-r--r--   3 itv002461 supergroup          0 2022-05-30 00:31 /user/itv002461/warehouse/itv002461_retail.db/order_items/_SUCCESS
-rw-r--r--   3 itv002461 supergroup     862839 2022-05-30 00:31 /user/itv002461/warehouse/itv002461_retail.db/order_items/part-00000-9a6ae15f-203c-4a37-ae6c-b2963bde9837-c000.snappy.parquet
-rw-r--r--   3 itv002461 supergroup     858034 2022-05-30 00:31 /user/itv002461/warehouse/itv002461_retail.db/order_items/part-00001-9a6ae15f-203c-4a37-ae6c-b2963bde9837-c000.snappy.parquet




0