<div style="text-align: center; line-height: 0; padding-top: 2px;">
  <img src="https://www.quantiaconsulting.com/logos/quantia_logo_orizz.png" alt="Quantia Consulting" style="width: 600px; height: 250px">
</div>

# ![Spark Logo Tiny](https://www.quantiaconsulting.com/logos/logo_spark_tiny.png) Reading Data with Spark - CSV Files

**Technical Accomplishments:**
- Start working with the API documentation
- Introduce the class `DataFrameReader`
- Read data from:
  * CSV without a Schema.
  * CSV with a Schema.

## Getting Started

Let's start importing libraries and creating useful variables 

In [None]:
%load_ext autotime

import os
import qcutils
from pyspark.sql import SparkSession
import boto3
import io

baseUri = "s3a://quantia-master/training/"

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.5 pyspark-shell'

spark = (SparkSession.builder 
    .master("local[*]")
    .appName("test")
    .getOrCreate()
        )
qcutils.init_spark_session(spark)

spark

## Reading from CSV w/InferSchema

We are going to start by reading in a very simple text file.

### The Data Source
* For this exercise, we will be using a tab-separated file called **wikipedia_pageviews_by_second.tsv** (<a href="https://datahub.io/en/dataset/english-wikipedia-pageviews-by-second" target="_blank">255 MB</a> file from Wikipedia)

In [None]:
qcutils.list_s3_bucket_objects(limit=10)

In [None]:
qcutils.print_s3_bucket_object(key='training/wikipedia_pageviews_by_second.tsv')

We can use `qcutils.print_s3_bucket_object(...)` to peek at the first couple thousand characters of the file.

There are a couple of things to note here:
* The file has a header.
* The file is tab separated (we can infer that from the file extension and the lack of other characters between each "column").
* The first two columns are strings and the third is a number.

Knowing those details, we can read in the "CSV" file.

### Step #1 - Read The CSV File
Let's start with the bare minimum by specifying the tab character as the delimiter and the location of the file:

In [None]:
# A reference to our tab-seperated-file
csvFile = baseUri + "wikipedia_pageviews_by_second.tsv"

tempDF = (spark.read           # The DataFrameReader
   .option("sep", "\t")        # Use tab delimiter (default is comma-separator)
   .csv(csvFile)               # Creates a DataFrame from CSV after reading in the file
)

In [None]:
tempDF

This is guaranteed to <u>trigger one job</u>.

A *Job* is triggered anytime we are "physically" __required to touch the data__.

In some cases, __one action may create multiple jobs__ (multiple reasons to touch the data).

In this case, the reader has to __"peek" at the first line__ of the file to determine how many columns of data we have.

We can see the structure of the `DataFrame` by executing the command `printSchema()`

It prints to the console the name of each column, its data type and if it's null or not.

**Note:** We will be covering the other `DataFrame` functions in other notebooks.

In [None]:
tempDF.printSchema()

We can see from the schema that...
* there are three columns
* the column names **_c0**, **_c1**, and **_c2** (automatically generated names)
* all three columns are **strings**
* all three columns are **nullable**

And if we take a quick peek at the data, we can see that line #1 contains the headers and not data:

In [None]:
tempDF.show(5)

### Step #2 - Use the File's Header
Next, we can add an option that tells the reader that the data contains a header and to use that header to determine our column names.

** *NOTE:* ** *We know we have a header based on what we can see in "head" of the file from earlier.*

In [None]:
tempDF2 = (spark.read                    # The DataFrameReader
   .option("sep", "\t")        # Use tab delimiter (default is comma-separator)
   .option("header", "true")   # Use first line of all files as header
   .csv(csvFile)               # Creates a DataFrame from CSV after reading in the file
)

In [None]:
tempDF2.printSchema()

A couple of notes about this iteration:
* again, only one job
* there are three columns
* all three columns are **strings**
* all three columns are **nullable**
* the column names are specified: **timestamp**, **site**, and **requests** (the change we were looking for)

A "peek" at the first line of the file is all that the reader needs to determine the number of columns and the name of each column.

Before going on, make a note of the duration of the previous call - it should be just under 3 seconds.

### Step #3 - Infer the Schema

Lastly, we can add an option that tells the reader to infer each column's data type (aka the schema)

In [None]:
(spark.read                        # The DataFrameReader
   .option("header", "true")       # Use first line of all files as header
   .option("sep", "\t")            # Use tab delimiter (default is comma-separator)
   .option("inferSchema", "true")  # Automatically infer data types
   .csv(csvFile)                   # Creates a DataFrame from CSV after reading in the file
   .printSchema()
)

### Review: Reading CSV w/InferSchema
* we still have three columns
* all three columns are still **nullable**
* all three columns have their proper names
* two jobs were executed (not one as in the previous example)
* our three columns now have distinct data types:
  * **timestamp** == **timestamp**
  * **site** == **string**
  * **requests** == **integer**

**Question:** Why were there two jobs?

**Question:** How long did the last job take?

**Question:** Why did it take so much longer?

Discuss...

## Reading from CSV w/User-Defined Schema

This time we are going to read the same file.

The difference here is that we are going to define the schema beforehand and hopefully avoid the execution of any extra jobs.

### Step #1
Declare the schema.

This is just a list of field names and data types.

In [None]:
# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

csvSchema = StructType([
  StructField("timestamp", StringType(), nullable=False),
  StructField("site", StringType(), nullable=False),
  StructField("requests", IntegerType(), nullable=False)
])

### Step #2
Read in our data (and print the schema).

We can specify the schema, or rather the `StructType`, with the `schema(..)` command:

In [None]:
tempDF3 = (spark.read                   # The DataFrameReader
  .option('header', 'true')   # Ignore line #1 - it's a header
  .option('sep', "\t")        # Use tab delimiter (default is comma-separator)
  .schema(csvSchema)          # Use the specified schema
  .csv(csvFile)               # Creates a DataFrame from CSV after reading in the file
)

In [None]:
tempDF3.printSchema()

In [None]:
tempDF3.show(5)

### Review: Reading CSV w/ User-Defined Schema
* We still have three columns
* All three columns are **NOT** nullable because we declared them as such.
* All three columns have their proper names
* Zero jobs were executed
* Our three columns now have distinct data types:
  * **timestamp** == **string**
  * **site** == **string**
  * **requests** == **integer**

**Question:** What is different about the data types of these columns compared to the previous exercise & why?

**Question:** Do I need to indicate that the file has a header?

**Question:** Do the declared column names need to match the columns in the header of the TSV file?

Discuss...

For a list of all the options related to reading CSV files, please see the documentation for `DataFrameReader.csv(..)`

### A note on the time to count and sum requests when the source is a Parquet file

In [None]:
spark.read.option('header', 'true').option('sep', "\t").schema(csvSchema).csv(csvFile).count()

It took approximately 3 sec. Take a note. We will see the difference in reading a Parquet file.

In [None]:
df = spark.read.option('header', 'true').option('sep', "\t").schema(csvSchema).csv(csvFile)
df.select(df.requests).groupBy().sum()

It took approximately 9 sec. Take a note. We will see the difference in reading a Parquet file.

##### ![Quantia Tiny Logo](https://www.quantiaconsulting.com/logos/quantia_logo_tiny.png) 2020 Quantia Consulting, srl. All rights reserved.