## Loading Data into Tables - HDFS

Let us understand how we can load data from HDFS location into Spark Metastore table.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@bbf3cdd


org.apache.spark.sql.SparkSession@bbf3cdd

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* We can use load command with out **LOCAL** to get data from HDFS location into Spark Metastore Table.
* User running load command from HDFS location need to have write permissions on the source location as data will be moved (deleted on source and copied to Spark Metastore table)
* Make sure user have write permissions on the source location.
* First we need to copy the data into HDFS location where user have write permissions.

In [1]:
import sys.process._
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
s"hadoop fs -rm -R /user/${username}/retail_db/orders" !

2022-05-27 07:53:26,743 INFO fs.TrashPolicyDefault: Moved: 'hdfs://m01.itversity.com:9000/user/itv002461/retail_db/orders' to trash at: hdfs://m01.itversity.com:9000/user/itv002461/.Trash/Current/user/itv002461/retail_db/orders1653652406726




0

In [3]:
s"hadoop fs -mkdir /user/${username}/retail_db" !

mkdir: `/user/itv002461/retail_db': File exists




1

In [4]:
s"hadoop fs -put -f /data/retail_db/orders /user/${username}/retail_db" !



0

In [5]:
s"hadoop fs -ls /user/${username}/retail_db/orders" !

Found 1 items
-rw-r--r--   3 itv002461 supergroup    2999944 2022-05-27 07:54 /user/itv002461/retail_db/orders/part-00000




0

* Here is the script which will truncate the table and then load the data from HDFS location to Hive table.

In [3]:
%%sql

USE itv002461_retail

Waiting for a Spark session to start...

++
||
++
++



In [4]:
%%sql

TRUNCATE TABLE orders

++
||
++
++



In [6]:
%%sql

LOAD DATA INPATH '/user/itv002461/retail_db/orders' 
  INTO TABLE orders

++
||
++
++



In [9]:
import sys.process._
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [10]:
s"hadoop fs -ls /user/${username}/warehouse/${username}_retail.db/orders" !

Found 1 items
-rwxr-xr-x   3 itv002461 supergroup    2999944 2022-05-27 07:54 /user/itv002461/warehouse/itv002461_retail.db/orders/part-00000




0

In [11]:
s"hadoop fs -ls /user/${username}/retail_db/orders" !



0

In [12]:
%%sql

SELECT * FROM orders LIMIT 10

|   34572|2014-02-23 00:00:...|             8135|        ...


+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   34565|2014-02-23 00:00:...|             8702|       COMPLETE|
|   34566|2014-02-23 00:00:...|             3066|PENDING_PAYMENT|
|   34567|2014-02-23 00:00:...|             7314|SUSPECTED_FRAUD|
|   34568|2014-02-23 00:00:...|             1271|       COMPLETE|
|   34569|2014-02-23 00:00:...|            11083|       COMPLETE|
|   34570|2014-02-23 00:00:...|             3159|         CLOSED|
|   34571|2014-02-23 00:00:...|             4551|         CLOSED|
|   34572|2014-02-23 00:00:...|             8135|        PENDING|
|   34573|2014-02-23 00:00:...|             7497|PENDING_PAYMENT|
|   34574|2014-02-23 00:00:...|             1868|        ON_HOLD|
+--------+--------------------+-----------------+---------------+



In [13]:
%%sql

SELECT count(1) FROM orders

+--------+
|count(1)|
+--------+
|   68883|
+--------+



* Using Spark SQL with Python or Scala

In [None]:
spark.sql("USE itversity_retail")

In [None]:
spark.sql("TRUNCATE TABLE orders")

In [None]:
spark.sql("""
LOAD DATA INPATH '/user/itversity/retail_db/orders' 
  INTO TABLE orders""")

In [None]:
s"hadoop fs -ls /user/${username}/retail_db/orders" !

In [None]:
spark.sql("SELECT * FROM orders LIMIT 10")

In [None]:
spark.sql("SELECT count(1) FROM orders")

* If you look at **/user/training/retail_db** orders directory would have been deleted.
* Move is much faster compared to copying the files by moving blocks around, hence Hive load command from HDFS location will always try to move files.