<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Reading Data with Spark - Tables

**Technical Accomplishments:**
- Demonstrate how to pre-register data sources in HIVE Data Warehouse.
- Introduce temporary views over files.
- Read data from tables/views.

## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

s3 = boto3.client('s3')
baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

## Registering Tables

So far we've seen purely programmatic methods for reading in data.

Spark allows us to "register" the equivalent of "tables" so that they can be easily accessed by all users. 

## Register a Table/View
* An in-memory DataFrame can be easily persisted as a table using the `saveAsTable(...)` command
* In our case we are going to to persiste the file `wikipedia_pageviews_by_second.parquet`, a parquet file we are already able to read. 

In [None]:
parquetFile = baseUri + "wikipedia_pageviews_by_second.parquet"
df = spark.read.parquet(parquetFile)
df.write.saveAsTable("pageviews_by_second")

**NOTE** If a table already esists, you must specify the writing `mode`. More details in the Writing Data notebook.

The writing `mode` can be `overwrite`, `append`, `ErrorIfExists` or `ignore`

In [None]:
df.write.mode("overwrite").saveAsTable("pageviews_by_second")

## Reading from a Table/View

We can now read in the "table" **pageviews_by_seconds_example** as a `DataFrame` with one simple command (and then print the schema):

In [None]:
pageviewsBySecondsExampleDF = spark.read.table("pageviews_by_second")

pageviewsBySecondsExampleDF.printSchema()

And of course we can now view that data as well:

In [None]:
pageviewsBySecondsExampleDF

### Review: Reading from Tables
* No job is executed - the schema is stored in the table definition in local HIVE Data Warehouse.
* The data types shown here are those we defined when we registered the table.
* In our case, the file was locally on the cluster.
* The "registration" of the table simply makes future access, or access by multiple users easier.
* The users of the notebook cannot see username and passwords, secret keys, tokens, etc.

## Temporary Views

Tables that are loadable by the call `spark.read.table(..)` are also accessible through the SQL APIs.

In [None]:
spark.sql("select * from pageviews_by_second limit(5)")

You can also take an existing `DataFrame` and register it as a view exposing it as a table to the SQL API.

If you recall from earlier, we have an instance called `parquetDF`.

We can create a [temporary] view with this call...

In [None]:
# create a temporary view from the resulting DataFrame
pageviewsBySecondsExampleDF.createOrReplaceTempView("temp_pageviews_by_second")

And now we can use the SQL API to reference that same `DataFrame` as the table **parquet_table**.

In [None]:
spark.sql("select * from temp_pageviews_by_second limit(5)")

**Final Notes** 

The method createOrReplaceTempView(..) is bound to the SparkSession meaning it will be discarded once the session ends.

On the other hand, the method createOrReplaceGlobalTempView(..) is bound to the spark application.

Or to put that another way, I can use createOrReplaceTempView(..) in this notebook only. However, I can call createOrReplaceGlobalTempView(..) in this notebook and then access it from another.

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.