## Managed Tables - Exercise

Let us use NYSE data and see how we can create tables in Spark Metastore.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@917bba7


org.apache.spark.sql.SparkSession@917bba7

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Duration: **30 Minutes**
* Data Location (Local): /data/nyse_all/nyse_data
* Create a database with the name - YOUR_OS_USER_NAME_nyse
* Table Name: nyse_eod
* File Format: TEXTFILE (default)
* Review the files by running Linux commands before using data sets. Data is compressed and we can load the files as is.
* Copy one of the zip file to your home directory and preview the data. There should be 7 fields. You need to determine the delimiter.
* Field Names: stockticker, tradedate, openprice, highprice, lowprice, closeprice, volume. For example, you need to use `BIGINT` for volume not `INT`.
* Determine correct data types based on the values
* Create Managed table with default Delimiter.
> As delimiters in data and table are not same, you need to figure out how to get data into the target table.
* Make sure the data is copied into the table as per the structure defined and validate.

In [3]:
%%sql 
 DROP DATABASE IF EXISTS itv002461_nyse

Waiting for a Spark session to start...

Magic sql failed to execute with error: 
org.apache.hadoop.hive.ql.metadata.HiveException: InvalidOperationException(message:Database itv002461_nyse is not empty. One or more tables exist.);

In [None]:
%%sql

CREATE DATABASE itv002461_nyse

In [4]:
%%sql
USE itv002461_nyse 

++
||
++
++



In [5]:
%%sql 
DROP TABLE IF EXISTS nyse_eod

++
||
++
++



In [6]:
%%sql 
CREATE TABLE IF NOT EXISTS nyse_eod(
    stockticker STRING,
    tradedate INT,
    openprice FLOAT,
    highprice FLOAT,
    lowprice FLOAT,
    closeprice FLOAT,
    volume BIGINT  
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

++
||
++
++



In [7]:
%%sql 
LOAD DATA LOCAL INPATH '/home/itv002461/tables' OVERWRITE INTO TABLE nyse_eod

++
||
++
++



In [None]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

### Validation

Run the following queries to ensure that you will be able to read the data.

```
DESCRIBE FORMATTED YOUR_OS_USER_NAME_nyse.nyse_eod;
SELECT * FROM YOUR_OS_USER_NAME_nyse.nyse_eod LIMIT 10
SELECT count(1) FROM YOUR_OS_USER_NAME_nyse.nyse_eod;
```

In [8]:
// There should not be field delimiter as the requirement is to use default delimiter
spark.sql("DESCRIBE FORMATTED itv002461_nyse.nyse_eod").show(200, false)

+----------------------------+---------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                        |comment|
+----------------------------+---------------------------------------------------------------------------------+-------+
|stockticker                 |string                                                                           |null   |
|tradedate                   |int                                                                              |null   |
|openprice                   |float                                                                            |null   |
|highprice                   |float                                                                            |null   |
|lowprice                    |float                                                                            |null   |
|closeprice                  |fl

In [9]:
%%sql

SELECT * FROM itv002461_nyse.nyse_eod LIMIT 10

|      ...


+-----------+---------+---------+---------+--------+----------+------+
|stockticker|tradedate|openprice|highprice|lowprice|closeprice|volume|
+-----------+---------+---------+---------+--------+----------+------+
|          A| 20010101|    54.75|    54.75|   54.75|     54.75|     0|
|         AA| 20010101|    100.5|    100.5|   100.5|     100.5|     0|
|        ABB| 20010101|    20.25|    20.25|   20.25|     20.25|     0|
|        ABC| 20010101|   12.625|   12.625|  12.625|    12.625|     0|
|        ABM| 20010101|    15.31|    15.31|   15.31|     15.31|     0|
|        ABT| 20010101|    48.44|    48.44|   48.44|     48.44|     0|
|        ABX| 20010101|    16.38|    16.38|   16.38|     16.38|     0|
|        ACP| 20010101|     8.94|     8.94|    8.94|      8.94|     0|
|        ACV| 20010101|    28.54|    28.54|   28.54|     28.54|     0|
|        ADC| 20010101|    13.75|    13.75|   13.75|     13.75|     0|
+-----------+---------+---------+---------+--------+----------+------+



In [10]:
%%sql

SELECT count(1) FROM itv002461_nyse.nyse_eod

+--------+
|count(1)|
+--------+
| 9384739|
+--------+

