# SparkSession

The `SparkSession` class, defined in the `pyspark.sql package`, is the entry point
to programming Spark with the `Dataset and DataFrame` APIs. In order to do
anything useful with a Spark cluster, you first need to create an instance of this
class, which gives you access to an instance of SparkContext.

# SparkContext

The `SparkContext` class, defined in the pyspark package, is the main entry point
for Spark functionality. A `SparkContext` holds a connection to the Spark cluster
manager and can be used to `create RDDs and broadcast variables` in the cluster.

When you create an instance of `SparkSession`, the `SparkContext` becomes **available**
inside your session as an attribute, `SparkSession.sparkContext`.

# Why we need SparkSession ?

Prior to **Spark 2.0**, entry points for Spark applications included the
- `SparkContext`, used for Spark core applications
- `SQLContext` and HiveContext, used with Spark SQL applications
- `StreamingContext` used for Spark Streaming applications. 

The `SparkSession` object introduced in Spark 2.0 **combines all these objects into a single entry point that can be used for
all Spark applications**.

In [1]:
spark

In [2]:
sc  # We need to use Spark Context, since we are dealing with RDD 

In [3]:
rdd1 = sc.textFile('s3://fcc-spark-example/dataset/2023/orders.txt')  # Transformation 

In [5]:
type(rdd1)

pyspark.rdd.RDD

In [6]:
rdd1.take(5) # Action 

                                                                                

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [14]:
rdd1.count()

                                                                                

68881

In [None]:
rdd1.collect()

# Findout the no. of orders based on different order status 

In [9]:
a = '1,2013-07-25 00:00:00.0,11599,CLOSED'
a.split(',')[-1]

'CLOSED'

In [10]:
rdd2 = rdd1.map(lambda line: (line.split(',')[-1], 1))

In [15]:
rdd2.count()

68881

In [11]:
rdd2.take(10)

                                                                                

[('CLOSED', 1),
 ('PENDING_PAYMENT', 1),
 ('COMPLETE', 1),
 ('CLOSED', 1),
 ('COMPLETE', 1),
 ('COMPLETE', 1),
 ('COMPLETE', 1),
 ('PROCESSING', 1),
 ('PENDING_PAYMENT', 1),
 ('PENDING_PAYMENT', 1)]

In [12]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y) # Tansformation 

In [13]:
rdd3.collect() # Action 

                                                                                

[('CLOSED', 7556),
 ('CANCELED', 1428),
 ('PENDING_PAYMENT', 15030),
 ('COMPLETE', 22899),
 ('PROCESSING', 8274),
 ('PAYMENT_REVIEW', 729),
 ('PENDING', 7609),
 ('ON_HOLD', 3798),
 ('SUSPECTED_FRAUD', 1558)]

# Find out the top 10 premium customers 

In [22]:
rdd1 = sc.textFile('s3://fcc-spark-example/dataset/2023/orders.txt')

In [23]:
rdd1.take(10)

                                                                                

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [24]:
rdd2 = rdd1.map(lambda line: (line.split(',')[2], 1))

In [25]:
rdd2.take(10) 

                                                                                

[('11599', 1),
 ('256', 1),
 ('12111', 1),
 ('8827', 1),
 ('11318', 1),
 ('7130', 1),
 ('4530', 1),
 ('2911', 1),
 ('5657', 1),
 ('5648', 1)]

In [26]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y)

In [27]:
rdd3.take(20)

                                                                                

[('256', 10),
 ('12111', 6),
 ('11318', 6),
 ('7130', 7),
 ('2911', 6),
 ('5657', 12),
 ('9149', 4),
 ('9842', 7),
 ('7276', 5),
 ('9488', 7),
 ('2711', 3),
 ('333', 6),
 ('656', 5),
 ('6983', 6),
 ('4189', 3),
 ('4840', 2),
 ('5863', 6),
 ('8214', 5),
 ('7776', 8),
 ('1549', 4)]

In [34]:
rdd4 = rdd3.sortBy(lambda x: x[-1], ascending=False)

In [35]:
rdd4.take(10)

[('5897', 16),
 ('6316', 16),
 ('12431', 16),
 ('569', 16),
 ('4320', 15),
 ('221', 15),
 ('5624', 15),
 ('5283', 15),
 ('12284', 15),
 ('5654', 15)]

# Distinct count of customers who have placed at least one order 

In [39]:
rdd1 = sc.textFile('s3://fcc-spark-example/dataset/2023/orders.txt')

In [40]:
rdd1.take(5)

                                                                                

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE']

In [41]:
rdd2 = rdd1.map(lambda line: line.split(',')[2])

In [42]:
rdd2.take(10)

                                                                                

['11599',
 '256',
 '12111',
 '8827',
 '11318',
 '7130',
 '4530',
 '2911',
 '5657',
 '5648']

In [43]:
rdd3 = rdd2.distinct()

In [44]:
rdd3.count()

                                                                                

12405

In [45]:
rdd1.count()

68881

# Find out the customers who has the maximum no. of orders which are in `PENDING_PAYMENT`

In [9]:
rdd1 = sc.textFile('s3://fcc-spark-example/dataset/2023/orders.txt')

In [10]:
rdd2 = rdd1.map(lambda line: line.split(',')[-2::])

In [11]:
rdd3 = rdd2.filter(lambda line: line[1] == 'PENDING_PAYMENT')

In [12]:
rdd4 = rdd3.map(lambda x: (x[0], 1))

In [13]:
rdd5 = rdd4.reduceByKey(lambda x, y: x + y)

In [63]:
rdd6 = rdd5.sortBy(lambda x: x[-1], ascending=False)

                                                                                

# Word Count Problem and `Chaining functions`

In [None]:
data_set = 's3://fcc-spark-example/dataset/gutenberg_books/11-0.txt'

In [2]:
# - load the file 
# - split each line based on space 
# - ('word1', 'word2', .....) =========>.   [('word1', 1), ('word2', 1).....]
# - reducebykey()
# - sortby()

In [5]:
result = (sc.textFile(data_set) 
            .flatMap(lambda line: line.split(' ')) 
            .map(lambda word: (word, 1)) 
            .reduceByKey(lambda x, y: x + y)
         )

In [6]:
result.take(10)

                                                                                

[('Project', 78),
 ('Gutenberg’s', 2),
 ('Alice’s', 11),
 ('in', 401),
 ('Wonderland,', 3),
 ('Lewis', 4),
 ('Carroll', 4),
 ('', 2244),
 ('is', 89),
 ('use', 24)]