# Reading Data Lab
* The goal of this lab is to put into practice about reading data with Apache Spark.
* At the bottom of this notebook are additional cells that will help verify that your work is accurate.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Instructions
0. Start with the file **dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv**, some random file you haven't seen yet.
0. Read in the data and assign it to a `DataFrame` named **testDF**.
0. Run the last cell to verify that the data was loaded correctly and to print its schema.

**Note:** Data types:
 * **prev_id**: integer
 * **curr_id**: integer
 * **n**: integer
 * **prev_title**: string
 * **curr_title**: string
 * **type**: string

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run "../Includes/Classroom Setup"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Show Your Work

### The Data Source
* For this exercise, we will be using a tab-separated file called **2015_02_clickstream.tsv**
* We can use **&percnt;fs head ...** to display the file.

In [0]:
%fs head "dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv"

In [0]:
# For ustilization of StructType and StructField
from pyspark.sql.types import *

# Data Source
tsvFile = "dbfs:/mnt/training/wikipedia/clickstream/2015_02_clickstream.tsv"

# Predefined Schema
tsvSchema = StructType([
  StructField("prev_id", IntegerType(), True),
  StructField("curr_id", IntegerType(), True),
  StructField("n", IntegerType(), True),
  StructField("prev_title", StringType(), True),
  StructField("curr_title", StringType(), True),
  StructField("type", StringType(), True)
])

testDF = (spark.read
        .option("sep", "\t")
        .option("header", "true")
        .schema(tsvSchema)
        .csv(tsvFile))

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Verify Your Work
Run the following cell to verify that your `DataFrame` was created properly.

**Remember:** This should execute without triggering a single job.

In [0]:
testDF.printSchema()

columns = testDF.dtypes
assert len(columns) == 6, "Expected 6 columns but found " + str(len(columns))

assert columns[0][0] == "prev_id",    "Expected column 0 to be \"prev_id\" but found \"" + columns[0][0] + "\"."
assert columns[0][1] == "int",        "Expected column 0 to be of type \"int\" but found \"" + columns[0][1] + "\"."

assert columns[1][0] == "curr_id",    "Expected column 1 to be \"curr_id\" but found \"" + columns[1][0] + "\"."
assert columns[1][1] == "int",        "Expected column 1 to be of type \"int\" but found \"" + columns[1][1] + "\"."

assert columns[2][0] == "n",          "Expected column 2 to be \"n\" but found \"" + columns[2][0] + "\"."
assert columns[2][1] == "int",        "Expected column 2 to be of type \"int\" but found \"" + columns[2][1] + "\"."

assert columns[3][0] == "prev_title", "Expected column 3 to be \"prev_title\" but found \"" + columns[3][0] + "\"."
assert columns[3][1] == "string",     "Expected column 3 to be of type \"string\" but found \"" + columns[3][1] + "\"."

assert columns[4][0] == "curr_title", "Expected column 4 to be \"curr_title\" but found \"" + columns[4][0] + "\"."
assert columns[4][1] == "string",     "Expected column 4 to be of type \"string\" but found \"" + columns[4][1] + "\"."

assert columns[5][0] == "type",       "Expected column 5 to be \"type\" but found \"" + columns[5][0] + "\"."
assert columns[5][1] == "string",     "Expected column 5 to be of type \"string\" but found \"" + columns[5][1] + "\"."

print("Congratulations, all tests passed... that is if no jobs were triggered :-)\n")
