# ELT With Spark
Querying files of file paths:  
```SQL
select * from file_format.\`/path/to/file\`
```

`file_format` can be `csv`, `json`, etc  


```SQL 
select * from text.`/path/to/file` -- when file might be corrupted
```

CTAS statement:  

```sql
CREATE TABLE table_name
AS SELECT * FROM file_format.`/path/to/file`
```

CTAS doesn't support manual schema declration, usefulf only when source has its defined schema. Does not support file options (headers, separators, etc)


This supports options, but there's no data movement. This is a NON-DELTA table, where the table is external, just referenced. Hence no time-travel, no performance benefits of delta.
```sql
CREATE TABLE table_name (col list)
USING data_source
OPTIONS (key1 = val1, key2 = val2, ...)
LOCATION = path

-- example
CREATE TABLE table_name (col list)
USING JDBC
OPTIONS (
  url="",
  dbtable="",
  username="",
  password=""
)

```

Solution to the performance, first create a `TEMP VIEW` of external, then load as delta:
```sql
CREATE TEMP VIEW temp_view_name (col list)
USING data_source
OPTIONS (...)
LOCATION = paath

CREATE TABLE table_name
AS SELECT * FROM temp_view_name -- creates the delta!
```

In [1]:
from delta import *
from pyspark.sql import SparkSession

builder = SparkSession.builder.appName('delta-tutorial').config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension").config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

23/06/28 08:36:00 WARN Utils: Your hostname, debian resolves to a loopback address: 127.0.1.1; using 192.168.100.237 instead (on interface wlp5s0)
23/06/28 08:36:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/aadi/miniconda3/envs/spark_env/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/aadi/.ivy2/cache
The jars for the packages stored in: /home/aadi/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-ca2bc352-e65a-4f58-ab7d-09eedd1cb450;1.0
	confs: [default]
	found io.delta#delta-core_2.12;2.3.0 in central
	found io.delta#delta-storage;2.3.0 in central
	found org.antlr#antlr4-runtime;4.8 in central
:: resolution report :: resolve 100ms :: artifacts dl 4ms
	:: modules in use:
	io.delta#delta-core_2.12;2.3.0 from central in [default]
	io.delta#delta-storage;2.3.0 from central in [default]
	org.antlr#antlr4-runtime;4.8 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |

23/06/28 08:36:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [None]:
# doesn't work well unless schema is self-defined
spark.sql('''
select *, input_file_name() source_file  from csv.`/storage/data/airline_2m.csv` 
''').limit(10).show()

In [14]:
spark.read.option('header', 'true').csv('/storage/data/airline_2m.csv').select(*[
    'Year', 'Month', 'FlightDate','Reporting_Airline', 'Origin', 'Dest'
]).write.mode('overwrite').csv('/storage/data/airlines_2m_small.csv')

                                                                                

In [16]:
spark.sql('DROP TABLE airline')
spark.sql('''
CREATE TABLE airline 
  (FlightDate timestamp, Reporting_Airline string, Flight_Number_Reporting_Airline int, Origin string, Dest string, DepTime int, DepDelay double, ArrTime int, ArrDelay double)
USING CSV
OPTIONS (
  header="true",
  delimiter=";"
)
LOCATION "/storage/data/airline_2m_small.csv"
''')

23/06/28 08:41:14 WARN HadoopFSUtils: The directory file:/storage/data/airline_2m_small.csv was not found. Was it deleted very recently?


DataFrame[]

In [22]:
# no data moved during table creation, Location is still CSV (external)
# all metadata stored in metastore
spark.sql('describe extended airline').show(truncate=False)

+-------------------------------+-----------------------------------------+-------+
|col_name                       |data_type                                |comment|
+-------------------------------+-----------------------------------------+-------+
|FlightDate                     |timestamp                                |null   |
|Reporting_Airline              |string                                   |null   |
|Flight_Number_Reporting_Airline|int                                      |null   |
|Origin                         |string                                   |null   |
|Dest                           |string                                   |null   |
|DepTime                        |int                                      |null   |
|DepDelay                       |double                                   |null   |
|ArrTime                        |int                                      |null   |
|ArrDelay                       |double                                   |n

In [27]:
spark.sql('DROP TABLE airline')
spark.sql('''
CREATE TABLE airline 
AS SELECT * FROM 
csv.`/storage/data/airline_2m.csv`
''')

23/06/28 08:47:02 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.


AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);
'CreateTable `default`.`airline`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
+- Project [_c0#2155, _c1#2156, _c2#2157, _c3#2158, _c4#2159, _c5#2160, _c6#2161, _c7#2162, _c8#2163, _c9#2164, _c10#2165, _c11#2166, _c12#2167, _c13#2168, _c14#2169, _c15#2170, _c16#2171, _c17#2172, _c18#2173, _c19#2174, _c20#2175, _c21#2176, _c22#2177, _c23#2178, ... 85 more fields]
   +- Relation [_c0#2155,_c1#2156,_c2#2157,_c3#2158,_c4#2159,_c5#2160,_c6#2161,_c7#2162,_c8#2163,_c9#2164,_c10#2165,_c11#2166,_c12#2167,_c13#2168,_c14#2169,_c15#2170,_c16#2171,_c17#2172,_c18#2173,_c19#2174,_c20#2175,_c21#2176,_c22#2177,_c23#2178,... 85 more fields] csv


In [13]:
spark.sql('''
select * from text.`/storage/data/electric-vehicle.json`
''').show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                              |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{                                                                                                                                                                                                                  |
|  "meta" : {                                                                                                                                   