## **Spark**

In [1]:
spark

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.15:4041
SparkContext available as 'sc' (version = 2.4.4, master = local[*], app id = local-1588681607572)
SparkSession available as 'spark'


res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@74007749


## **Transformations**

In [2]:
val myRange = spark.range(1000).toDF("number")

myRange: org.apache.spark.sql.DataFrame = [number: bigint]


In [3]:
myRange.show(2)

+------+
|number|
+------+
|     0|
|     1|
+------+
only showing top 2 rows



In [4]:
val divisBy2 = myRange.where("number % 2 = 0")

divisBy2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [number: bigint]


## **Action**

In [10]:
divisBy2.count()

res7: Long = 500


## **Load data**

In [2]:
val flightData2015 = spark
.read
.option("inferSchema", "true")
.option("header", "true") .csv("../data/flight-data/csv/2015-summary.csv")

flightData2015: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [4]:
flightData2015.take(3)

res0: Array[org.apache.spark.sql.Row] = Array([United States,Romania,15], [United States,Croatia,1], [United States,Ireland,344])


In [5]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [6]:
flightData2015.createOrReplaceTempView("flight_data_2015")

In [17]:
flightData2015.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



## **SQL way + dataFrame way**

In [20]:
val sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1) FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")
val dataFrameWay = flightData2015.groupBy('DEST_COUNTRY_NAME).count()

sqlWay.explain 
dataFrameWay.explain

== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/erikapat/Dropbox/PRUEBAS_DATA_SCIENCE/SPARK/GIT_SPARK-PRACTICE-NOTE..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>
== Physical Plan ==
*(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[count(1)])
+- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
   +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_count(1)])
      +- *(1) FileScan csv [DEST_COUNTRY_NAME#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/erikapat/Dropbox/PRUEBAS_DATA_SCIENCE/SPARK/GIT_SPARK-PRACTICE-NOTE..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAM

sqlWay: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, count(1): bigint]
dataFrameWay: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, count: bigint]


In [22]:
//sql
spark.sql("SELECT max(count) from flight_data_2015").take(1)

res9: Array[org.apache.spark.sql.Row] = Array([370002])


In [26]:
//dataframe
import org.apache.spark.sql.functions.max 
flightData2015.select(max("count")).take(1)

import org.apache.spark.sql.functions.max
res11: Array[org.apache.spark.sql.Row] = Array([370002])


In [27]:
//sql
val maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")
maxSql.collect()

maxSql: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, destination_total: bigint]
res12: Array[org.apache.spark.sql.Row] = Array([United States,411352], [Canada,8399], [Mexico,7140], [United Kingdom,2025], [Japan,1548])


In [28]:
//dataframe
import org.apache.spark.sql.functions.desc
flightData2015
.groupBy("DEST_COUNTRY_NAME")
.sum("count")
.withColumnRenamed("sum(count)", "destination_total") 
.sort(desc("destination_total"))
.limit(5)
.collect()

import org.apache.spark.sql.functions.desc
res13: Array[org.apache.spark.sql.Row] = Array([United States,411352], [Canada,8399], [Mexico,7140], [United Kingdom,2025], [Japan,1548])


## **set of instructions**

In [29]:
flightData2015
.groupBy("DEST_COUNTRY_NAME")
.sum("count")
.withColumnRenamed("sum(count)", "destination_total") .sort(desc("destination_total"))
.limit(5)
.explain()

== Physical Plan ==
TakeOrderedAndProject(limit=5, orderBy=[destination_total#99L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#10,destination_total#99L])
+- *(2) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[sum(cast(count#12 as bigint))])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#10, 5)
      +- *(1) HashAggregate(keys=[DEST_COUNTRY_NAME#10], functions=[partial_sum(cast(count#12 as bigint))])
         +- *(1) FileScan csv [DEST_COUNTRY_NAME#10,count#12] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/erikapat/Dropbox/PRUEBAS_DATA_SCIENCE/SPARK/GIT_SPARK-PRACTICE-NOTE..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


## **class**

In [35]:
// A Scala case class (similar to a struct) that will automatically
// be mapped into a structured data table in Spark
case class Flight(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: BigInt)
val flightsDF = spark.read.parquet("../data/flight-data/parquet/2010-summary.parquet/") 

org.apache.spark.sql.AnalysisException:  Unable to infer schema for Parquet. It must be specified manually.;

In [39]:
val flights = flightsDF.as[Flight]
flights.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada").take(5)

<console>: 26: error: not found: value flightsDF

## **Retail data**

In [40]:
val staticDataFrame = spark.read.format("csv")
                        .option("header", "true")
                        .option("inferSchema", "true")
                        .load("../data/retail-data/by-day/*.csv")
staticDataFrame.createOrReplaceTempView("retail_data") 
val staticSchema = staticDataFrame.schema

staticDataFrame: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 6 more fields]
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(InvoiceNo,StringType,true), StructField(StockCode,StringType,true), StructField(Description,StringType,true), StructField(Quantity,IntegerType,true), StructField(InvoiceDate,TimestampType,true), StructField(UnitPrice,DoubleType,true), StructField(CustomerID,DoubleType,true), StructField(Country,StringType,true))


In [47]:
import org.apache.spark.sql.functions.{window, column, desc, col}

staticDataFrame 
    .selectExpr(
    "CustomerId",
    "(UnitPrice * Quantity) as total_cost", 
        "InvoiceDate")
    .groupBy(
    col("CustomerId"), window(col("InvoiceDate"), "1 day")) 
    .sum("total_cost")
    .show(5)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   14075.0|[2011-12-05 01:00...|316.78000000000003|
|   18180.0|[2011-12-05 01:00...|            310.73|
|   15358.0|[2011-12-05 01:00...| 830.0600000000003|
|   15392.0|[2011-12-05 01:00...|304.40999999999997|
|   15290.0|[2011-12-05 01:00...|263.02000000000004|
+----------+--------------------+------------------+
only showing top 5 rows



import org.apache.spark.sql.functions.{window, column, desc, col}


## **readStream**

In [43]:
val streamingDataFrame = spark.readStream.schema(staticSchema) 
.option("maxFilesPerTrigger", 1) 
.format("csv")
.option("header", "true") 
.load("../data/retail-data/by-day/*.csv")

streamingDataFrame: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 6 more fields]


In [48]:
streamingDataFrame.isStreaming // returns true

res23: Boolean = true


In [45]:
val purchaseByCustomerPerHour = streamingDataFrame 
                                .selectExpr("CustomerId", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")
                                .groupBy($"CustomerId", window($"InvoiceDate", "1 day")) 
                                .sum("total_cost")

purchaseByCustomerPerHour: org.apache.spark.sql.DataFrame = [CustomerId: double, window: struct<start: timestamp, end: timestamp> ... 1 more field]


In [46]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [49]:
purchaseByCustomerPerHour.writeStream
    .format("memory") // memory = store in-memory table 
    .queryName("customer_purchases") // counts = name of the in-memory table 
    .outputMode("complete") // complete = all the counts should be in the table 
    .start()

res24: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@7eee33fb


In [50]:
spark.sql(""" SELECT *
FROM customer_purchases
ORDER BY `sum(total_cost)` DESC """)
.show(5)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|      null|[2011-11-14 01:00...|          55316.08|
|      null|[2011-03-29 02:00...| 33521.39999999998|
|      null|[2011-11-29 01:00...|23744.250000000055|
|      null|[2011-08-30 02:00...| 23032.59999999993|
|   15749.0|[2011-01-11 01:00...|           22998.4|
+----------+--------------------+------------------+
only showing top 5 rows



* You’ll notice that as we read in more data - the composition of our table changes! With each file the results may or may not be changing based on the data. Naturally since we’re grouping customers we hope to see an increase in the top customer purchase amounts over time (and do for a period of time!). Another option you can use is to just simply write the results out to the console.

* Neither of these streaming methods should be used in production but they do make for convenient demonstration of Structured Streaming’s power. Notice how this window is built on event time as well, not the time at which the data Spark processes the data. This was one of the shortcoming of Spark Streaming that Structured Streaming as resolved. We cover Structured Streaming in depth in Part V of this book.

In [51]:
purchaseByCustomerPerHour.writeStream 
    .format("console") 
    .queryName("customer_purchases_2") 
    .outputMode("complete")
    .start()

res26: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@30736cf9


-------------------------------------------
Batch: 0
-------------------------------------------
+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   14239.0|[2011-03-03 01:00...|             -56.1|
|   17700.0|[2011-03-03 01:00...| 602.6099999999999|
|   15932.0|[2011-03-03 01:00...|             -7.65|
|   16191.0|[2011-03-03 01:00...|             -1.65|
|   17646.0|[2011-03-03 01:00...|            345.85|
|   18041.0|[2011-03-03 01:00...|            148.49|
|   18102.0|[2011-03-03 01:00...|            1396.0|
|   13630.0|[2011-03-03 01:00...|             -14.4|
|   17652.0|[2011-03-03 01:00...|             222.3|
|   17567.0|[2011-03-03 01:00...|            535.38|
|   15596.0|[2011-03-03 01:00...|            303.03|
|   13476.0|[2011-03-03 01:00...| 727.5999999999999|
|   14524.0|[2011-03-03 01:00...|            210.05|
|   12500.0|[2011-03-03 01:00...|            249.84|
| 