##Chapter 1

1. What is bigdata?
- Big data refers to data that is so large, fast or complex that it's difficult or impossible to process using traditional method
2. Why spark?
- Spark is used for fast, interactive computation that runs in memory, enabling machine learning and data transformations to run quickly
3. What is spark?
- Spark is a general-purpose distributed processing system used for big data workloads.
4. Highlevel API of spark? Sparksession, Dataframe, Partitions, Transformation, Actions, Lazy Evaluation
- Spark is a distributed programming model in which the user specifies transformations. Multiple transformations build up a directed acyclic graph of instructions. An action begins the process of executing that graph of instructions, as a single job, by breaking it down into stages and tasks to execute across the cluster.

##Chapter 2

In [0]:
from pyspark import SparkFiles


In [0]:
url = 'https://raw.githubusercontent.com/databricks/Spark-The-Definitive-Guide/master/data/flight-data/csv/2015-summary.csv'
spark.sparkContext.addFile(url)

In [0]:
df = spark.read.csv("file://"+SparkFiles.get('2015-summary.csv'),header=True,inferSchema=True,sep=',')

In [0]:
df.show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [0]:
schema = df.schema
schema

In [0]:
df.take(3)

Out[5]: [Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

In [0]:
df.sort("count").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#47 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(count#47 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=65]
      +- FileScan csv [DEST_COUNTRY_NAME#45,ORIGIN_COUNTRY_NAME#46,count#47] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/local_disk0/spark-4c41330a-e7fc-476e-84a9-fc010754f237/userFiles..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>




In [0]:
df.sort("count").show(5)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
+--------------------+-------------------+-----+
only showing top 5 rows



In [0]:
spark.conf.set("spark.sql.shuffle.partitions","5")

In [0]:
df.sort("count").show(5)

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|               Malta|      United States|    1|
|Saint Vincent and...|      United States|    1|
|       United States|            Croatia|    1|
|       United States|          Gibraltar|    1|
|       United States|          Singapore|    1|
+--------------------+-------------------+-----+
only showing top 5 rows



In [0]:
df.createOrReplaceTempView("sqldf")

In [0]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM sqldf
GROUP BY DEST_COUNTRY_NAME
""")
sqlWay.show(5)

+--------------------+--------+
|   DEST_COUNTRY_NAME|count(1)|
+--------------------+--------+
|             Moldova|       1|
|             Bolivia|       1|
|             Algeria|       1|
|Turks and Caicos ...|       1|
|            Pakistan|       1|
+--------------------+--------+
only showing top 5 rows



In [0]:
dataFrameWay = df.groupBy("DEST_COUNTRY_NAME").count()
dataFrameWay.show(5)

+--------------------+-----+
|   DEST_COUNTRY_NAME|count|
+--------------------+-----+
|             Moldova|    1|
|             Bolivia|    1|
|             Algeria|    1|
|Turks and Caicos ...|    1|
|            Pakistan|    1|
+--------------------+-----+
only showing top 5 rows



In [0]:
spark.sql("SELECT max(count) from sqldf").take(1)

Out[22]: [Row(max(count)=370002)]

In [0]:
from pyspark.sql.functions import max
df.select(max("count")).take(1)

Out[23]: [Row(max(count)=370002)]

In [0]:
df.groupBy('DEST_COUNTRY_NAME').sum('count').sort('sum(count)',ascending=False).withColumnRenamed('sum(count)','destination_total').limit(5).show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [0]:
maxsql = spark.sql("""
Select DEST_COUNTRY_NAME, sum(count) as destination_total 
from sqldf
group by DEST_COUNTRY_NAME
order by sum(count) DESC
limit 5
                   """)
maxsql.show(5)

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



##Chapter 3

In [0]:
#%fs rm -r dbfs:/FileStore/tables  #to delete folders from dbfs

In [0]:
static_retail = spark.read.csv("dbfs:/FileStore/by_day_retail_data/*.csv",header=True, inferSchema=True)

In [0]:
static_retail.show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|
|   580538|    22906|12 MESSAGE CARDS ...|      24|2011-12-05 08:38:00|     1.65|   14075.0|United Kingdom|
|   580538|    21914|BLUE HARMONICA IN...|      24|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|
|   580538|    22467|   GUMBALL COAT RACK|       6|2011-12-05 08:38:00|     2.55|   14075.0|United Kingdom|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+
only showing top 5 rows



In [0]:
print(static_retail.count())
static_schema = static_retail.schema

541909


In [0]:
static_schema

Out[5]: StructType([StructField('InvoiceNo', StringType(), True), StructField('StockCode', StringType(), True), StructField('Description', StringType(), True), StructField('Quantity', IntegerType(), True), StructField('InvoiceDate', TimestampType(), True), StructField('UnitPrice', DoubleType(), True), StructField('CustomerID', DoubleType(), True), StructField('Country', StringType(), True)])

In [0]:
from pyspark.sql.functions import window, col
purchase_by_customer = static_retail.selectExpr("CustomerID", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")\
.groupby(col("CustomerID"), window(col("InvoiceDate"), "1 day"))\
.sum("total_cost")
purchase_by_customer.show(5)

+----------+--------------------+-----------------+
|CustomerID|              window|  sum(total_cost)|
+----------+--------------------+-----------------+
|   16057.0|{2011-12-05 00:00...|            -37.6|
|   14126.0|{2011-11-29 00:00...|643.6300000000001|
|   13500.0|{2011-11-16 00:00...|497.9700000000001|
|   17160.0|{2011-11-08 00:00...|516.8499999999999|
|   15608.0|{2011-11-11 00:00...|            122.4|
+----------+--------------------+-----------------+
only showing top 5 rows



In [0]:
purchase_by_customer.schema

Out[51]: StructType([StructField('CustomerID', DoubleType(), True), StructField('window', StructType([StructField('start', TimestampType(), True), StructField('end', TimestampType(), True)]), False), StructField('sum(total_cost)', DoubleType(), True)])

In [0]:
purchase_by_customer.sort('CustomerID').sample(fraction=0.5).show(5,False)

+----------+------------------------------------------+------------------+
|CustomerID|window                                    |sum(total_cost)   |
+----------+------------------------------------------+------------------+
|null      |{2011-11-23 00:00:00, 2011-11-24 00:00:00}|8124.250000000004 |
|null      |{2011-12-08 00:00:00, 2011-12-09 00:00:00}|31975.590000000007|
|null      |{2011-11-04 00:00:00, 2011-11-05 00:00:00}|6878.12000000001  |
|null      |{2011-11-13 00:00:00, 2011-11-14 00:00:00}|5462.140000000003 |
|null      |{2011-11-24 00:00:00, 2011-11-25 00:00:00}|12399.519999999964|
+----------+------------------------------------------+------------------+
only showing top 5 rows



In [0]:
#streaming
streamingDataFrame = spark.readStream\
.schema(static_schema)\
.option("maxFilesPerTrigger", 1)\
.format("csv")\
.option("header", "true")\
.load("dbfs:/FileStore/by_day_retail_data/*.csv")

In [0]:
streamingDataFrame.isStreaming

Out[12]: True

In [0]:
#transformation
purchase_by_cust_day = streamingDataFrame.selectExpr("CustomerID", "(UnitPrice * Quantity) as total_cost", "InvoiceDate")\
.groupby(col("CustomerID"), window(col("InvoiceDate"), '1 hour')).sum('total_cost')

In [0]:
#action
purchase_by_cust_day.writeStream.format('memory')\
.queryName('customer_purchases')\
.outputMode('complete').start()


Out[54]: <pyspark.sql.streaming.query.StreamingQuery at 0x7f6bf9c6d100>

In [0]:
spark.sql("""
SELECT * From customer_purchases
ORDER BY `sum(total_cost)` DESC
""").show(5,False)

+----------+------------------------------------------+------------------+
|CustomerID|window                                    |sum(total_cost)   |
+----------+------------------------------------------+------------------+
|18102.0   |{2010-12-07 16:00:00, 2010-12-07 17:00:00}|25920.37          |
|null      |{2010-12-10 15:00:00, 2010-12-10 16:00:00}|12633.669999999996|
|null      |{2010-12-03 11:00:00, 2010-12-03 12:00:00}|12187.780000000002|
|null      |{2010-12-03 14:00:00, 2010-12-03 15:00:00}|10661.690000000004|
|15061.0   |{2010-12-02 15:00:00, 2010-12-02 16:00:00}|9407.339999999998 |
+----------+------------------------------------------+------------------+
only showing top 5 rows



In [0]:
static_retail.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)



In [0]:
from pyspark.sql.functions import date_format
prep_df = static_retail.na.fill(0)\
.withColumn("day_of_week", date_format(col("InvoiceDate"), 'EEEE')).coalesce(5)

In [0]:
prep_df.show(5)

+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----------+
|InvoiceNo|StockCode|         Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|day_of_week|
+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+-----------+
|   580538|    23084|  RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|     Monday|
|   580538|    23077| DOUGHNUT LIP GLOSS |      20|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|     Monday|
|   580538|    22906|12 MESSAGE CARDS ...|      24|2011-12-05 08:38:00|     1.65|   14075.0|United Kingdom|     Monday|
|   580538|    21914|BLUE HARMONICA IN...|      24|2011-12-05 08:38:00|     1.25|   14075.0|United Kingdom|     Monday|
|   580538|    22467|   GUMBALL COAT RACK|       6|2011-12-05 08:38:00|     2.55|   14075.0|United Kingdom|     Monday|
+---------+---------+-------------------

In [0]:
trainDataFrame = prep_df\
.where("InvoiceDate < '2011-07-01'")
testDataFrame = prep_df\
.where("InvoiceDate >= '2011-07-01'")

In [0]:
print(trainDataFrame.count())
print(testDataFrame.count())

245903
296006


In [0]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
indexer = StringIndexer()
indexer.setInputCol("day_of_week").setOutputCol("day_of_week_index")
encoder = OneHotEncoder()
encoder.setInputCol('day_of_week_index').setOutputCol('day_of_week_encoded')

Out[81]: OneHotEncoder_207dfe547aa3

In [0]:
vectorAssembler = VectorAssembler()\
.setInputCols(["UnitPrice", "Quantity", "day_of_week_encoded"])\
.setOutputCol("features")

In [0]:
from pyspark.ml import Pipeline
transformationPipeline = Pipeline().setStages([indexer,encoder,vectorAssembler])

In [0]:
fittedPipeline = transformationPipeline.fit(trainDataFrame)

In [0]:
transformedTraining = fittedPipeline.transform(trainDataFrame)

In [0]:
transformedTraining.show(1,False)

+---------+---------+------------------------+--------+-------------------+---------+----------+--------------+-----------+-----------------+-------------------+--------------------------+
|InvoiceNo|StockCode|Description             |Quantity|InvoiceDate        |UnitPrice|CustomerID|Country       |day_of_week|day_of_week_index|day_of_week_encoded|features                  |
+---------+---------+------------------------+--------+-------------------+---------+----------+--------------+-----------+-----------------+-------------------+--------------------------+
|537226   |22811    |SET OF 6 T-LIGHTS CACTI |6       |2010-12-06 08:34:00|2.95     |15987.0   |United Kingdom|Monday     |2.0              |(5,[2],[1.0])      |(7,[0,1,4],[2.95,6.0,1.0])|
+---------+---------+------------------------+--------+-------------------+---------+----------+--------------+-----------+-----------------+-------------------+--------------------------+
only showing top 1 row



In [0]:
transformedTraining.cache()

Out[95]: DataFrame[InvoiceNo: string, StockCode: string, Description: string, Quantity: int, InvoiceDate: timestamp, UnitPrice: double, CustomerID: double, Country: string, day_of_week: string, day_of_week_index: double, day_of_week_encoded: vector, features: vector]

In [0]:
from pyspark.ml.clustering import KMeans
kmeans = KMeans()\
.setK(20)

In [0]:
kmModel = kmeans.fit(transformedTraining)

In [0]:
transformedTest = fittedPipeline.transform(testDataFrame)
pred = kmModel.transform(transformedTest)

In [0]:
from pyspark.ml.evaluation import ClusteringEvaluator
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(pred)
print("Silhouette with squared euclidean distance = " + str(silhouette))

Silhouette with squared euclidean distance = 0.46523507334366837


In [0]:
#RDD
from pyspark.sql import Row
spark.sparkContext.parallelize([Row(1), Row(2), Row(3)]).toDF()

Out[103]: DataFrame[_1: bigint]