# `DataFrame`

* Are the most efficient due to catalyst optimizer
* Are available in all languages
* A table with data rows and columns
* Analogous to a spreadsheet or table
* Distributed and spans over multiple machines!
* Easiest to use, particularly for non-functional programmers

## The `SparkSession`

* A majority of the jobs that Spark will run will require the `SparkSession`
* The `SparkSession` is the entry point to programming Spark with the `Dataset` and `DataFrame` API
* In REPL and Notebook environments, it is previously assigned to the `spark` value

In [31]:
spark

res16: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@b7597ab


## Creating a Range

* `range` is a method in the `SparkSession` that returns a `DataSet`
* A `DataFrame` is actually a `DataSet[Row]` where row is the representation of a row of data
* We will discuss `Row` and `DataSet` later

In [36]:
val dataset = spark.range(1, 100)

dataset: org.apache.spark.sql.Dataset[Long] = [id: bigint]


## Changing a `Dataset[Long]` to a `DataFrame`

In [38]:
val dataFrame = dataset.toDF("myRange")

dataFrame: org.apache.spark.sql.DataFrame = [myRange: bigint]


## Making a `DataFrame` from the `RDD`

## `show()`

* Shows the data
* Default of 20 elements
* Can be changed

In [39]:
dataFrame.show(40)

+-------+
|myRange|
+-------+
|      1|
|      2|
|      3|
|      4|
|      5|
|      6|
|      7|
|      8|
|      9|
|     10|
|     11|
|     12|
|     13|
|     14|
|     15|
|     16|
|     17|
|     18|
|     19|
|     20|
|     21|
|     22|
|     23|
|     24|
|     25|
|     26|
|     27|
|     28|
|     29|
|     30|
|     31|
|     32|
|     33|
|     34|
|     35|
|     36|
|     37|
|     38|
|     39|
|     40|
+-------+
only showing top 40 rows



This is a download from Kaggle.com called the good reads books dataset located at https://www.kaggle.com/jealousleopard/goodreadsbooks

## `spark.read`
* Reads data from a filesystem
* Should specify a file type
* Preferably from a distributed file system like hdfs
* Uses `load` to load the information from the location, for example
  * Use `"hdfs://"` to load from hdfs
  * Use `"s3a://"` to load from s3 on AWS
* Here we will use a local file system
* **Question** what is wrong with the results? Hint: You may want to view the [data file](../data/books.csv)

In [40]:
val booksDF = spark.read
                   .format("csv")
                   .load("../data/books.csv")
booksDF.show()

+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|   _c0|                 _c1|                 _c2|           _c3|       _c4|          _c5|          _c6|        _c7|          _c8|               _c9|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|# num_pages|ratings_count|text_reviews_count|
|     1|Harry Potter and ...|J.K. Rowling-Mary...|          4.56|0439785960|9780439785969|          eng|        652|      1944099|             26249|
|     2|Harry Potter and ...|J.K. Rowling-Mary...|          4.49|0439358078|9780439358071|          eng|        870|      1996446|             27613|
|     3|Harry Potter and ...|J.K. Rowling-Mary...|          4.47|0439554934|9780439554930|          

booksDF: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 8 more fields]


## Schema

* Schemas have by default are assumed by the structure of our tables
* We can view the schemas of each of these DataFrame by calling `printSchema`
* A schema is a `StructType` made up of a number of fields called `StructField`s
* A `StructField` has:
  * A `name`
  * A `type`
  * A `boolean` that specifies whether the column is nullable
  * A schema can also contain other `StructType` (Spark complex types).
  * Can also be overridden by your own custom schema which is preferred for production

## Fixing the schema

* Notice the schema from the above, by calling `printSchema()`
* This shows the schema of the `DataFrame`
* **Question** What do you think is wrong with the schema that is determined

In [41]:
booksDF.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)



## Infering the schema and bringing in the header

* Setting the option `inferSchama` we can set the schama based on the data 
* Setting the option `header` we can set the first row to be the header

In [42]:
val booksDF = spark.read.format("csv")
                     .option("inferSchema", "true")
                     .option("header", "true")
                     .load("../data/books.csv")

booksDF: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 8 more fields]


In [43]:
booksDF.show(5)

+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|bookID|               title|             authors|average_rating|      isbn|       isbn13|language_code|# num_pages|ratings_count|text_reviews_count|
+------+--------------------+--------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|     1|Harry Potter and ...|J.K. Rowling-Mary...|          4.56|0439785960|9780439785969|          eng|        652|      1944099|             26249|
|     2|Harry Potter and ...|J.K. Rowling-Mary...|          4.49|0439358078|9780439358071|          eng|        870|      1996446|             27613|
|     3|Harry Potter and ...|J.K. Rowling-Mary...|          4.47|0439554934|9780439554930|          eng|        320|      5629932|             70390|
|     4|Harry Potter and ...|        J.K. Rowling|          4.41|0439554896|9780439554893|          

## `show` for the title looks cramped for space

* With `show` there are some other signatures that are worth while of investigating
* The signature from the (API)[https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset] shows some of the following signatures
   * `show(numRows: Int, truncate: Int, vertical: Boolean):Unit`
   * `show(numRows: Int, truncate: Int): Unit`
   * `show(numRows: Int, truncate: Boolean): Unit`
   * `show(truncate: Boolean): Unit`
   * `show(): Unit`
   * `show(numRows: Int)`
* `numRows` are the number of rows you wish to show
* `truncate` as a `Boolean`. If set `True` then it will truncate, `False` will show full text
* `truncate` as an `Int`. If set to more than `0`, truncates strings to truncate characters and all cells will be aligned right.
* `vertical = true` will show the records as a list for better viewing, let's try each in turn using `smallerSelectionDF`

In [44]:
booksDF.show(numRows=30, truncate=false)

+------+------------------------------------------------------------------------------------------------------------+---------------------------------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|bookID|title                                                                                                       |authors                                      |average_rating|isbn      |isbn13       |language_code|# num_pages|ratings_count|text_reviews_count|
+------+------------------------------------------------------------------------------------------------------------+---------------------------------------------+--------------+----------+-------------+-------------+-----------+-------------+------------------+
|1     |Harry Potter and the Half-Blood Prince (Harry Potter  #6)                                                   |J.K. Rowling-Mary GrandPré                   |4.56          |0439785960|9780439785969|eng     




In [45]:
booksDF.show(numRows=30, vertical=true, truncate=30)

-RECORD 0--------------------------------------------
 bookID             | 1                              
 title              | Harry Potter and the Half-B... 
 authors            | J.K. Rowling-Mary GrandPré     
 average_rating     | 4.56                           
 isbn               | 0439785960                     
 isbn13             | 9780439785969                  
 language_code      | eng                            
 # num_pages        | 652                            
 ratings_count      | 1944099                        
 text_reviews_count | 26249                          
-RECORD 1--------------------------------------------
 bookID             | 2                              
 title              | Harry Potter and the Order ... 
 authors            | J.K. Rowling-Mary GrandPré     
 average_rating     | 4.49                           
 isbn               | 0439358078                     
 isbn13             | 9780439358071                  
 language_code      | eng   

 isbn13             | 9780618391004                  
 language_code      | eng                            
 # num_pages        | 218                            
 ratings_count      | 18934                          
 text_reviews_count | 43                             
-RECORD 28-------------------------------------------
 bookID             | 37                             
 title              | The Lord of the Rings: Comp... 
 authors            | Jude Fisher                    
 average_rating     | 4.50                           
 isbn               | 0618510826                     
 isbn13             | 9780618510825                  
 language_code      | eng                            
 # num_pages        | 224                            
 ratings_count      | 343                            
 text_reviews_count | 6                              
-RECORD 29-------------------------------------------
 bookID             | 38                             
 title              | The Lo

## Spark Data Types

In Spark, it is good to know the certain data types so that we can either interpret or cast, here is the list of data types. In the **Scala Type** column all types are in the `org.apache.spark.sql.types` package.  The latest types API for Scala [can be found here](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/package-summary.html)


| Spark Type             | Scala Type                 | Scala API       |
| -----------------------| :-------------------------:| ---------------:|
| `ByteType`             | `Byte`                     | `ByteType`      |
| `ShortType`            | `Short`                    | `ShortType`     |
| `IntegerType`          | `Int`                      | `IntegerType`   |
| `LongType`             | `Long`                     | `LongType`      |
| `FloatType`            | `Float`                    | `FloatType`     |
| `DoubleType`           | `Double`                   | `DoubleType`    |
| `DecimalType`          | `java.math.BigDecimal`     | `DecimalType`   |
| `StringType`           | `String`                   | `StringType`    |
| `BinaryType`           | `Array[Byte]`              | `BinaryType`    |
| `TimestampType`        | `java.sql.Timestamp`       | `TimestampType` |
| `DateType`             | `java.sql.Date`            | `DateType`      |
| `ArrayType`            | `scala.collection.Seq`     | `ArrayType`     |
| `MapType`              | `scala.collection.Map`     | `MapType`       |
| `StructType`           | `org.apache.spark.sql.Row` | `StructType`    |
| `StructField`          | `StructField`              | `StructField`   |

## Explicitly Setting our Schema

First let's take a look at the original schema

In [46]:
booksDF.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- # num_pages: string (nullable = true)
 |-- ratings_count: integer (nullable = true)
 |-- text_reviews_count: integer (nullable = true)



In [47]:
import org.apache.spark.sql.types._
val bookSchema = new StructType(Array(
   new StructField("bookID", IntegerType, false),
   new StructField("title", StringType, false),
   new StructField("authors", StringType, false),
   new StructField("average_rating", FloatType, false),
   new StructField("isbn", StringType, false),
   new StructField("isbn13", StringType, false),
   new StructField("language_code", StringType, false),
   new StructField("num_pages", IntegerType, false),
   new StructField("ratings_count", IntegerType, false),
   new StructField("text_reviews_count", IntegerType, false)))

import org.apache.spark.sql.types._
bookSchema: org.apache.spark.sql.types.StructType = StructType(StructField(bookID,IntegerType,false), StructField(title,StringType,false), StructField(authors,StringType,false), StructField(average_rating,FloatType,false), StructField(isbn,StringType,false), StructField(isbn13,StringType,false), StructField(language_code,StringType,false), StructField(num_pages,IntegerType,false), StructField(ratings_count,IntegerType,false), StructField(text_reviews_count,IntegerType,false))


## Reading the Data Again with a Schema

In [48]:
val booksSchemaDF = spark.read.format("csv")
                         .schema(bookSchema)
                         .option("inferSchema", "true")
                         .option("header", "true")
                         .load("../data/books.csv")
booksSchemaDF.printSchema()

root
 |-- bookID: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- authors: string (nullable = true)
 |-- average_rating: float (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- num_pages: integer (nullable = true)
 |-- ratings_count: integer (nullable = true)
 |-- text_reviews_count: integer (nullable = true)



booksSchemaDF: org.apache.spark.sql.DataFrame = [bookID: int, title: string ... 8 more fields]


## Creating a Custom DataFrame

## Creating from a `RDD`

* Reminder: `RDD` or Resilient Distributed Dataset is often less performant than the `DataFrame`, `DataSet`, and `SparkSQL` counterparts.
* It is still useful and used to create the `DataFrame` in the first place, especially with `parallelize`
* `parralelize` is a method factory from an object called the `SparkContext`. 
* `SparkContext` was used extensively in the 1.x versions of Spark.
* It can be obtained from the `SparkSession` by the `sparkContext` method

In [54]:
val sparkContext = spark.sparkContext
val rddRows = sparkContext.parallelize(Seq(Row("Abe", null, "Lincoln", 40000),
           Row("Martin", "Luther", "King", 80000),
           Row("Ben", null, "Franklin", 82000),
           Row("Toni", null, "Morrisson", 82000)))


sparkContext: org.apache.spark.SparkContext = org.apache.spark.SparkContext@5c909ebd
rddRows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[93] at parallelize at <console>:48


In [59]:
val employeeSchema = new StructType(Array(
      StructField("firstName", StringType, nullable = false),
      StructField("middleName", StringType, nullable = true),
      StructField("lastName", StringType, nullable = false),
      StructField("salaryPerYear", IntegerType, nullable = false)
    ))

val dataFrame = spark.createDataFrame(rddRows, employeeSchema)
dataFrame.show()

+---------+----------+---------+-------------+
|firstName|middleName| lastName|salaryPerYear|
+---------+----------+---------+-------------+
|      Abe|      null|  Lincoln|        40000|
|   Martin|    Luther|     King|        80000|
|      Ben|      null| Franklin|        82000|
|     Toni|      null|Morrisson|        82000|
+---------+----------+---------+-------------+



employeeSchema: org.apache.spark.sql.types.StructType = StructType(StructField(firstName,StringType,false), StructField(middleName,StringType,true), StructField(lastName,StringType,false), StructField(salaryPerYear,IntegerType,false))
dataFrame: org.apache.spark.sql.DataFrame = [firstName: string, middleName: string ... 2 more fields]


## Lab: Read Wine Data

For our lab, we will be reading data from the Wine Data Set at Kaggle. https://www.kaggle.com/zynicide/wine-reviews. We already downloaded and made it a part of your notebook in the `data` directory.

**Step 1:** Read wine data from `../data/winemag.csv` first without setting headers or infering the schema

**Step 2:** See what you glean from the data

**Step 3:** Apply headers, infer the schema

**Step 4:** `show` some of the data

**Step 5:** Apply your own schema