## Overview of Data Types

Let us get an overview of Data Types.

Let us start spark context for this Notebook so that we can execute the code provided. You can sign up for our [10 node state of the art cluster/labs](https://labs.itversity.com/plans) to learn Spark SQL using our unique integrated LMS.

In [1]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [2]:
import org.apache.spark.sql.SparkSession

val username = System.getProperty("user.name")
val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Managing Tables - Basic DDL and DML").
    master("yarn").
    getOrCreate

username = itv002461
spark = org.apache.spark.sql.SparkSession@52cbb2f4


org.apache.spark.sql.SparkSession@52cbb2f4

If you are going to use CLIs, you can use Spark SQL using one of the 3 approaches.

**Using Spark SQL**

```
spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Scala**

```
spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark**

```
pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Syntactically Hive and Spark SQL are almost same.
* Go to this [hive page](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL) and review supported data types.
* Spark Metastore supports all standard data types.
  * Numeric - INT, BIGINT, FLOAT etc
  * Alpha Numeric or String - CHAR, VARCHAR, STRING
  * Date and Timestamp - DATE, TIMESTAMP
  * Special Data Types - ARRAY, STRUCT etc
  * Boolean - BOOLEAN
* If the file format is text file with special types, then we need to consider other clauses under DELIMITED ROW FORMAT (if we don't want to use default delimiters).

In [3]:
%%sql

DROP DATABASE IF EXISTS itv002461_sms CASCADE

Waiting for a Spark session to start...

++
||
++
++



In [4]:
%%sql

CREATE DATABASE IF NOT EXISTS itv002461_sms

++
||
++
++



In [5]:
%%sql

USE itv002461_sms

++
||
++
++



In [6]:
%%sql

CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_numbers ARRAY<STRING>,
    student_address STRUCT<street:STRING, city:STRING, state:STRING, zip:STRING>
) STORED AS TEXTFILE
ROW FORMAT
    DELIMITED FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY ','

++
||
++
++



In [7]:
%%sql

DESCRIBE students

+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|          student_id|                 int|   null|
|  student_first_name|              string|   null|
|   student_last_name|              string|   null|
|student_phone_num...|       array<string>|   null|
|     student_address|struct<street:str...|   null|
+--------------------+--------------------+-------+



In [8]:
%%sql

INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)

++
||
++
++



In [9]:
%%sql

SELECT * FROM students

+----------+------------------+-----------------+---------------------+---------------+
|student_id|student_first_name|student_last_name|student_phone_numbers|student_address|
+----------+------------------+-----------------+---------------------+---------------+
|         1|             Scott|            Tiger|                 null|           null|
+----------+------------------+-----------------+---------------------+---------------+



In [10]:
%%sql

INSERT INTO students VALUES (2, 'Donald', 'Duck', ARRAY('1234567890', '2345678901'), NULL)

++
||
++
++



In [11]:
%%sql

SELECT * FROM students

+----------+------------------+-----------------+---------------------+---------------+
|student_id|student_first_name|student_last_name|student_phone_numbers|student_address|
+----------+------------------+-----------------+---------------------+---------------+
|         1|             Scott|            Tiger|                 null|           null|
|         2|            Donald|             Duck| [1234567890, 2345...|           null|
+----------+------------------+-----------------+---------------------+---------------+



In [12]:
%%sql

INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', ARRAY('1234567890', '2345678901'), STRUCT('A Street', 'One City', 'Some State', '12345')),
    (4, 'Bubble', 'Guppy', ARRAY('5678901234', '6789012345'), STRUCT('Bubbly Street', 'Guppy', 'La la land', '45678'))

++
||
++
++



In [13]:
%%sql

SELECT * FROM students

+----------+------------------+-----------------+-----------------...


+----------+------------------+-----------------+---------------------+--------------------+
|student_id|student_first_name|student_last_name|student_phone_numbers|     student_address|
+----------+------------------+-----------------+---------------------+--------------------+
|         1|             Scott|            Tiger|                 null|                null|
|         3|            Mickey|            Mouse| [1234567890, 2345...|[A Street, One Ci...|
|         2|            Donald|             Duck| [1234567890, 2345...|                null|
|         4|            Bubble|            Guppy| [5678901234, 6789...|[Bubbly Street, G...|
+----------+------------------+-----------------+---------------------+--------------------+



In [14]:
val username = System.getProperty("user.name")

username = itv002461


itv002461

In [15]:
import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_sms.db/students"!

Found 4 items
-rwxr-xr-x   3 itv002461 supergroup         20 2022-05-27 01:20 /user/itv002461/warehouse/itv002461_sms.db/students/part-00000-14259197-9477-42d2-93de-d30919e2f1a6-c000
-rwxr-xr-x   3 itv002461 supergroup         72 2022-05-27 01:20 /user/itv002461/warehouse/itv002461_sms.db/students/part-00000-21ad5ceb-9c6a-4e89-9205-2fb2ad524009-c000
-rwxr-xr-x   3 itv002461 supergroup         39 2022-05-27 01:20 /user/itv002461/warehouse/itv002461_sms.db/students/part-00000-e235990f-7cae-48f9-b3f3-a5998840b946-c000
-rwxr-xr-x   3 itv002461 supergroup         74 2022-05-27 01:20 /user/itv002461/warehouse/itv002461_sms.db/students/part-00001-21ad5ceb-9c6a-4e89-9205-2fb2ad524009-c000




0

In [16]:
s"hdfs dfs -cat /user/${username}/warehouse/${username}_sms.db/students/*"!

1	Scott	Tiger	\N	\N
3	Mickey	Mouse	1234567890,2345678901	A Street,One City,Some State,12345
2	Donald	Duck	1234567890,2345678901	\N
4	Bubble	Guppy	5678901234,6789012345	Bubbly Street,Guppy,La la land,45678




0

* Using Spark SQL with Python or Scala

In [None]:
spark.sql("DROP DATABASE IF EXISTS itversity_sms CASCADE")

In [None]:
spark.sql("CREATE DATABASE IF NOT EXISTS itversity_sms")

In [None]:
spark.sql("USE itversity_sms")

In [None]:
spark.sql("DROP TABLE IF EXISTS students")

In [None]:
spark.sql("""
CREATE TABLE students (
    student_id INT,
    student_first_name STRING,
    student_last_name STRING,
    student_phone_numbers ARRAY<STRING>,
    student_address STRUCT<street:STRING, city:STRING, state:STRING, zip:STRING>
) STORED AS TEXTFILE
ROW FORMAT
    DELIMITED FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':'
""")

In [None]:
spark.sql("INSERT INTO students VALUES (1, 'Scott', 'Tiger', NULL, NULL)")

In [None]:
spark.sql("INSERT INTO students VALUES (2, 'Donald', 'Duck', ARRAY('1234567890', '2345678901'), NULL)")

In [None]:
spark.sql("""
INSERT INTO students VALUES 
    (3, 'Mickey', 'Mouse', ARRAY('1234567890', '2345678901'), STRUCT('A Street', 'One City', 'Some State', '12345')),
    (4, 'Bubble', 'Guppy', ARRAY('5678901234', '6789012345'), STRUCT('Bubbly Street', 'Guppy', 'La la land', '45678'))
""")

In [None]:
spark.sql("SELECT * FROM students")