# SparkSession

The `SparkSession` class, defined in the `pyspark.sql package`, is the entry point
to programming Spark with the `Dataset and DataFrame` APIs. In order to do
anything useful with a Spark cluster, you first need to create an instance of this
class, which gives you access to an instance of SparkContext.

# SparkContext

The `SparkContext` class, defined in the pyspark package, is the main entry point
for Spark functionality. A `SparkContext` holds a connection to the Spark cluster
manager and can be used to `create RDDs and broadcast variables` in the cluster.

When you create an instance of `SparkSession`, the `SparkContext` becomes **available**
inside your session as an attribute, `SparkSession.sparkContext`.

# Why we need SparkSession ?

Prior to **Spark 2.0**, entry points for Spark applications included the
- `SparkContext`, used for Spark core applications
- `SQLContext` and HiveContext, used with Spark SQL applications
- `StreamingContext` used for Spark Streaming applications. 

The `SparkSession` object introduced in Spark 2.0 **combines all these objects into a single entry point that can be used for
all Spark applications**.

In [1]:
spark

In [2]:
sc  # We need to use Spark Context, since we are dealing with RDD 

In [3]:
rdd1 = sc.textFile('s3://fcc-spark-example/dataset/2023/orders.txt')  # Transformation 

In [5]:
type(rdd1)

pyspark.rdd.RDD

In [6]:
rdd1.take(5) # Action 

                                                                                

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [14]:
rdd1.count()

                                                                                

68881

In [None]:
rdd1.collect()

# Findout the no. of orders based on different order status 

In [9]:
a = '1,2013-07-25 00:00:00.0,11599,CLOSED'
a.split(',')[-1]

'CLOSED'

In [10]:
rdd2 = rdd1.map(lambda line: (line.split(',')[-1], 1))

In [15]:
rdd2.count()

68881

In [11]:
rdd2.take(10)

                                                                                

[('CLOSED', 1),
 ('PENDING_PAYMENT', 1),
 ('COMPLETE', 1),
 ('CLOSED', 1),
 ('COMPLETE', 1),
 ('COMPLETE', 1),
 ('COMPLETE', 1),
 ('PROCESSING', 1),
 ('PENDING_PAYMENT', 1),
 ('PENDING_PAYMENT', 1)]

In [12]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y) # Tansformation 

In [13]:
rdd3.collect() # Action 

                                                                                

[('CLOSED', 7556),
 ('CANCELED', 1428),
 ('PENDING_PAYMENT', 15030),
 ('COMPLETE', 22899),
 ('PROCESSING', 8274),
 ('PAYMENT_REVIEW', 729),
 ('PENDING', 7609),
 ('ON_HOLD', 3798),
 ('SUSPECTED_FRAUD', 1558)]

# Find out the top 10 premium customers 

In [22]:
rdd1 = sc.textFile('s3://fcc-spark-example/dataset/2023/orders.txt')

In [23]:
rdd1.take(10)

                                                                                

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [24]:
rdd2 = rdd1.map(lambda line: (line.split(',')[2], 1))

In [25]:
rdd2.take(10) 

                                                                                

[('11599', 1),
 ('256', 1),
 ('12111', 1),
 ('8827', 1),
 ('11318', 1),
 ('7130', 1),
 ('4530', 1),
 ('2911', 1),
 ('5657', 1),
 ('5648', 1)]

In [26]:
rdd3 = rdd2.reduceByKey(lambda x, y: x+y)

In [27]:
rdd3.take(20)

                                                                                

[('256', 10),
 ('12111', 6),
 ('11318', 6),
 ('7130', 7),
 ('2911', 6),
 ('5657', 12),
 ('9149', 4),
 ('9842', 7),
 ('7276', 5),
 ('9488', 7),
 ('2711', 3),
 ('333', 6),
 ('656', 5),
 ('6983', 6),
 ('4189', 3),
 ('4840', 2),
 ('5863', 6),
 ('8214', 5),
 ('7776', 8),
 ('1549', 4)]

In [34]:
rdd4 = rdd3.sortBy(lambda x: x[-1], ascending=False)

In [35]:
rdd4.take(10)

[('5897', 16),
 ('6316', 16),
 ('12431', 16),
 ('569', 16),
 ('4320', 15),
 ('221', 15),
 ('5624', 15),
 ('5283', 15),
 ('12284', 15),
 ('5654', 15)]

# Distinct count of customers who have placed at least one order 

In [39]:
rdd1 = sc.textFile('s3://fcc-spark-example/dataset/2023/orders.txt')

In [40]:
rdd1.take(5)

                                                                                

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE']

In [41]:
rdd2 = rdd1.map(lambda line: line.split(',')[2])

In [42]:
rdd2.take(10)

                                                                                

['11599',
 '256',
 '12111',
 '8827',
 '11318',
 '7130',
 '4530',
 '2911',
 '5657',
 '5648']

In [43]:
rdd3 = rdd2.distinct()

In [44]:
rdd3.count()

                                                                                

12405

In [45]:
rdd1.count()

68881

# Find out the customers who has the maximum no. of orders which are in `PENDING_PAYMENT`

In [9]:
rdd1 = sc.textFile('s3://fcc-spark-example/dataset/2023/orders.txt')

In [10]:
rdd2 = rdd1.map(lambda line: line.split(',')[-2::])

In [11]:
rdd3 = rdd2.filter(lambda line: line[1] == 'PENDING_PAYMENT')

In [12]:
rdd4 = rdd3.map(lambda x: (x[0], 1))

In [13]:
rdd5 = rdd4.reduceByKey(lambda x, y: x + y)

In [63]:
rdd6 = rdd5.sortBy(lambda x: x[-1], ascending=False)

                                                                                

# Word Count Problem and `Chaining functions`

In [None]:
data_set = 's3://fcc-spark-example/dataset/gutenberg_books/11-0.txt'

In [2]:
# - load the file 
# - split each line based on space 
# - ('word1', 'word2', .....) =========>.   [('word1', 1), ('word2', 1).....]
# - reducebykey()
# - sortby()

In [5]:
result = (sc.textFile(data_set) 
            .flatMap(lambda line: line.split(' ')) 
            .map(lambda word: (word, 1)) 
            .reduceByKey(lambda x, y: x + y)
         )

In [6]:
result.take(10)

                                                                                

[('Project', 78),
 ('Gutenberg’s', 2),
 ('Alice’s', 11),
 ('in', 401),
 ('Wonderland,', 3),
 ('Lewis', 4),
 ('Carroll', 4),
 ('', 2244),
 ('is', 89),
 ('use', 24)]

In [1]:
spark

In [2]:
sc

In [3]:
my_list = ['US', 'India', 123, 'Boston', 'Suman', 1, 2, 3, 4, 5]

In [4]:
type(my_list)

list

In [5]:
rdd = sc.parallelize(my_list)

In [6]:
rdd.collect()

                                                                                

['US', 'India', 123, 'Boston', 'Suman', 1, 2, 3, 4, 5]

In [7]:
rdd.getNumPartitions() 

2

In [9]:
sc.defaultParallelism

2

In [10]:
l = range(10)

In [12]:
list(l)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [13]:
rdd = sc.parallelize(l)

In [14]:
rdd.getNumPartitions()

2

In [15]:
rdd2 = rdd.map(lambda x: (x, x*2)) 

In [16]:
rdd2.collect()

                                                                                

[(0, 0),
 (1, 2),
 (2, 4),
 (3, 6),
 (4, 8),
 (5, 10),
 (6, 12),
 (7, 14),
 (8, 16),
 (9, 18)]

In [17]:
rdd2.getNumPartitions()

2

In [18]:
rdd2.saveAsTextFile('s3://fcc-spark-example/output/test')

                                                                                

In [19]:
data_set ='s3://fcc-spark-example/dataset/gutenberg_books/'

rdd = sc.textFile(data_set)

In [20]:
rdd.getNumPartitions() 

6

In [30]:
sc.defaultMinPartitions

2

In [31]:
rdd = sc.textFile('s3://fcc-spark-example/dataset/gutenberg_books/1342-0.txt')
rdd.getNumPartitions()

2

In [38]:
rdd.getNumPartitions()

2

In [39]:
new_rdd = rdd.repartition(10)

In [40]:
new_rdd.getNumPartitions()

10

In [42]:
new_rdd.getNumPartitions()

10

In [43]:
rdd = new_rdd.repartition(3)
rdd.getNumPartitions()

3

In [44]:
new_rdd.getNumPartitions()

10

In [46]:
a = new_rdd.coalesce(5)
a.getNumPartitions()

5

In [47]:
dataset = 's3://amazon-reviews-pds/tsv/amazon_reviews_us_Books*'

In [48]:
rdd1 = sc.textFile(dataset)

rdd1.take(5)

                                                                                

['marketplace\tcustomer_id\treview_id\tproduct_id\tproduct_parent\tproduct_title\tproduct_category\tstar_rating\thelpful_votes\ttotal_votes\tvine\tverified_purchase\treview_headline\treview_body\treview_date',
 'US\t25933450\tRJOVP071AVAJO\t0439873800\t84656342\tThere Was an Old Lady Who Swallowed a Shell!\tBooks\t5\t0\t0\tN\tY\tFive Stars\tI love it and so does my students!\t2015-08-31',
 'US\t1801372\tR1ORGBETCDW3AI\t1623953553\t729938122\tI Saw a Friend\tBooks\t5\t0\t0\tN\tY\tPlease buy "I Saw a Friend"! Your children will be delighted!\tMy wife and I ordered 2 books and gave them as presents...one to a friend\'s daughter and the other to our grandson! Both children were so happy with the story, by author Katrina Streza, and they were overjoyed with the absolutely adorable artwork, by artist Michele Katz, throughout the book! We highly recommend &#34;I Saw a Friend&#34; to all your little ones!!!\t2015-08-31',
 'US\t5782091\tR7TNRFQAOUTX5\t142151981X\t678139048\tBlack Lagoon, Vol. 6

In [57]:
header = rdd1.first()

                                                                                

In [58]:
header

'marketplace\tcustomer_id\treview_id\tproduct_id\tproduct_parent\tproduct_title\tproduct_category\tstar_rating\thelpful_votes\ttotal_votes\tvine\tverified_purchase\treview_headline\treview_body\treview_date'

In [59]:
rdd2 = rdd1.filter(lambda line: line != header)

In [60]:
rdd2.take(5)

                                                                                

['US\t25933450\tRJOVP071AVAJO\t0439873800\t84656342\tThere Was an Old Lady Who Swallowed a Shell!\tBooks\t5\t0\t0\tN\tY\tFive Stars\tI love it and so does my students!\t2015-08-31',
 'US\t1801372\tR1ORGBETCDW3AI\t1623953553\t729938122\tI Saw a Friend\tBooks\t5\t0\t0\tN\tY\tPlease buy "I Saw a Friend"! Your children will be delighted!\tMy wife and I ordered 2 books and gave them as presents...one to a friend\'s daughter and the other to our grandson! Both children were so happy with the story, by author Katrina Streza, and they were overjoyed with the absolutely adorable artwork, by artist Michele Katz, throughout the book! We highly recommend &#34;I Saw a Friend&#34; to all your little ones!!!\t2015-08-31',
 'US\t5782091\tR7TNRFQAOUTX5\t142151981X\t678139048\tBlack Lagoon, Vol. 6\tBooks\t5\t0\t0\tN\tY\tShipped fast.\tGreat book just like all the others in the series.\t2015-08-31',
 'US\t32715830\tR2GANXKDIFZ6OI\t014241543X\t712432151\tIf I Stay\tBooks\t5\t0\t0\tN\tN\tFive Stars\tSo bea

In [61]:
rdd3 = rdd2.map(lambda x: x.split('\t'))
rdd4 = rdd3.map(lambda x: (x[3], 1))

In [62]:
rdd4.take(5)

                                                                                

[('0439873800', 1),
 ('1623953553', 1),
 ('142151981X', 1),
 ('014241543X', 1),
 ('1604600527', 1)]

In [69]:
# ('0439873800', [1, 1, 1, 1, 1]),
# ('1623953553', [1, 1, 1]),


result = rdd4.groupByKey() \
             .map(lambda x: (x[0], len(x[1])))

In [70]:
result.take(10)

                                                                                

[('0786665963', 17),
 ('125007181X', 209),
 ('0080969925', 2),
 ('0994081707', 39),
 ('1618650025', 15),
 ('0679600213', 102),
 ('099385723X', 2),
 ('067988629X', 146),
 ('0694003611', 1118),
 ('B000N7FEN8', 2)]

In [71]:
result2 = rdd4.reduceByKey(lambda x, y: x + y)

In [72]:
result2.take(5)

                                                                                

[('142151981X', 4),
 ('1604600527', 4),
 ('0399170863', 254),
 ('0671728725', 3),
 ('1570913722', 9)]

# JOIN

In [107]:
# Loading the ORDERS data

data_set = 's3://fcc-spark-example/dataset/2023/orders.csv'
rdd1 = sc.textFile(data_set)

header = rdd1.first()
rdd1 = rdd1.filter(lambda x : x != header)

orders_rdd = rdd1.map(lambda line: (line.split(',')[2], line.split(',')[3])) 

                                                                                

In [108]:
orders_rdd.take(5)

                                                                                

[('11599', 'CLOSED'),
 ('256', 'PENDING_PAYMENT'),
 ('12111', 'COMPLETE'),
 ('8827', 'CLOSED'),
 ('11318', 'COMPLETE')]

In [109]:
# Loading the CUSTOMERS data

data_set = 's3://fcc-spark-example/dataset/2023/customers.csv'
rdd1 = sc.textFile(data_set)

header = rdd1.first()
rdd1 = rdd1.filter(lambda x : x != header)

customer_rdd = rdd1.map(lambda x: (x.split(',')[0], x.split(',')[-1]))

                                                                                

In [110]:
customer_rdd.take(5)

                                                                                

[('1', '78521'),
 ('2', '80126'),
 ('3', '00725'),
 ('4', '92069'),
 ('5', '00725')]

In [114]:
joined_rdd = customer_rdd.join(orders_rdd)
                #self.            #second
                (k, v1)           (k, v2)
        
            (k, (v1, v2))

In [115]:
joined_rdd.take(10)

                                                                                

[('4', ('92069', 'COMPLETE')),
 ('4', ('92069', 'COMPLETE')),
 ('4', ('92069', 'PENDING_PAYMENT')),
 ('4', ('92069', 'COMPLETE')),
 ('4', ('92069', 'COMPLETE')),
 ('4', ('92069', 'CLOSED')),
 ('10', ('22554', 'COMPLETE')),
 ('10', ('22554', 'COMPLETE')),
 ('12', ('78227', 'PENDING_PAYMENT')),
 ('12', ('78227', 'PENDING_PAYMENT'))]

In [116]:
orders_rdd.join(customer_rdd).take(10)

                                                                                

[('256', ('PENDING_PAYMENT', '60625')),
 ('256', ('CLOSED', '60625')),
 ('256', ('PENDING_PAYMENT', '60625')),
 ('256', ('PENDING', '60625')),
 ('256', ('PENDING_PAYMENT', '60625')),
 ('256', ('CLOSED', '60625')),
 ('256', ('PENDING', '60625')),
 ('256', ('PENDING', '60625')),
 ('256', ('PROCESSING', '60625')),
 ('256', ('COMPLETE', '60625'))]

In [117]:
# Loading the orders data
data_set = 's3://fcc-spark-example/dataset/2023/orders_cust_id_999/*'
rdd1 = sc.textFile(data_set)

orders_rdd = rdd1.map(lambda line: (int(line.split(',')[2]), line.split(',')[-1])) 

In [118]:
orders_rdd.take(5)

                                                                                

[(256, 'PENDING_PAYMENT'),
 (918, 'PAYMENT_REVIEW'),
 (333, 'COMPLETE'),
 (656, 'COMPLETE'),
 (196, 'PROCESSING')]

In [119]:
# Loading the customers data
data_set = 's3://fcc-spark-example/dataset/2023/customers_cust_id_999/*'
rdd1 = sc.textFile(data_set)

customer_rdd = rdd1.map(lambda x: (int(x.split(',')[0]), x.split(',')[-1]))

In [120]:
customer_rdd.take(5)

[(1, '78521'), (2, '80126'), (3, '00725'), (4, '92069'), (5, '00725')]

In [126]:
bradcast_variable = sc.broadcast(dict(customer_rdd.collect()))

In [129]:
def get_zip_code(customer_id):
    
    return bradcast_variable.value.get(customer_id)

In [131]:
get_zip_code(15)

'92336'

In [134]:
# (256, 'PENDING_PAYMENT')

joined_rdd = orders_rdd.map(lambda x: (x[0], (x[1], get_zip_code(x[0]))))

In [136]:
joined_rdd.take(10)

                                                                                

[(256, ('PENDING_PAYMENT', '60625')),
 (918, ('PAYMENT_REVIEW', '00725')),
 (333, ('COMPLETE', '00725')),
 (656, ('COMPLETE', '91767')),
 (196, ('PROCESSING', '48126')),
 (662, ('PENDING_PAYMENT', '00725')),
 (674, ('PROCESSING', '19067')),
 (824, ('ON_HOLD', '21740')),
 (395, ('PROCESSING', '02151')),
 (104, ('PROCESSING', '55112'))]

23/05/11 22:57:25 WARN BlockManagerMasterEndpoint: No more replicas available for broadcast_72_python !
